Announcement

**existensil** · 06 March 2017, 07:53 PM

A good page of examples of how multi-threaded programming should be done.

**defaultUser** · 07 March 2017, 01:42 AM

Originally posted by existensil View Post

A good page of examples of how multi-threaded programming should be done.

Actually there nothing good at this page. Most programs cannot even double performance in clearly embarrassing parallel tasks (from one to two threads). And also most of the softwares cannot scales beyond 4 threads without a sharp drop in efficiency. Considering that is common to have servers/nodes/workstations with more than twenty cores/threads and with more complexities NUMA placement. Only computer bound applications that have a low footprint/use of memory can be effectively use these cores and even then with lots of work. This page is a statement of the poorly available programming tools available to developers. Why use the DES standard since it is nowadays essentially broken/backdoored

**indepe** · 07 March 2017, 02:05 AM

Look quite good to me. (Only the +HT often doesn't achieve very much.)

**existensil** · 07 March 2017, 03:28 AM

Originally posted by defaultUser View Post

Actually there nothing good at this page. Most programs cannot even double performance in clearly embarrassing parallel tasks (from one to two threads). And also most of the softwares cannot scales beyond 4 threads without a sharp drop in efficiency. Considering that is common to have servers/nodes/workstations with more than twenty cores/threads and with more complexities NUMA placement. Only computer bound applications that have a low footprint/use of memory can be effectively use these cores and even then with lots of work. This page is a statement of the poorly available programming tools available to developers. Why use the DES standard since it is nowadays essentially broken/backdoored

I don't think you bothered doing the math. Many of these examples scale great. The very first example, John The Ripper, scales almost completely linearly.

1 to 2 cores > 2x performance improvement
2 - 3 cores = 1.5x performance improvement (the theoretical max)
3 - 4 cores = 1.3325x improvement (theoretical max is 4/3rds or ~1.333)
4 - 5 cores = 1.247x (theoretical max: 1.25)
5 - 6 cores = 1.172x (theoretical max: 1.2)

It's not perfect but for a real world tasks to get this close to theoretical scaling maximums is fairly impressive and it's worth recognizing the achievements of the developers.

Some of the other examples get pretty close to maximum ideal scaling as well. I stand by my statement that this is a page full of examples of doing multi-threaded programming well.

**existensil** · 07 March 2017, 03:48 AM

Originally posted by indepe View Post

Look quite good to me. (Only the +HT often doesn't achieve very much.)

Since hyperthreading fakes additional cores and the vast majority of resources are competitively shared between threads, any benefit is a pretty big win. Theoretically it should be possible to optimize many workloads so well that HT provides little benefit as a single thread is able to fully utilize each physical core's resources, but there are some resources dedicated to each thread so even if you did that HT might give you a tiny bit more.

**tg--** · 07 March 2017, 05:33 AM

Originally posted by existensil View Post

Since hyperthreading fakes additional cores and the vast majority of resources are competitively shared between threads, any benefit is a pretty big win. Theoretically it should be possible to optimize many workloads so well that HT provides little benefit as a single thread is able to fully utilize each physical core's resources, but there are some resources dedicated to each thread so even if you did that HT might give you a tiny bit more.

Well, that's not actually true.
As in any SMT scheme, HT also has duplicated hardware-ressources. In contrast to full cores, this is however just a small portion of the processing logic, and the main point is to use this extra ressources to make use of currently unused ressources in other paths, with the goal to achieve greater utilization (most of the time a large portion of a CPU is not in use because it has to wait for another group to finish its work).
So there's nothing fake about it, it's just tricky to actually do this, and there are many cases where you can't really achieve any significant benefit (not even with perfectly optimized software).

The trend is of course, that simpler workloads tend to do better. As with compiling, stuffing more cores into a CPU basically scales linearly, since there are no interdependencies between separate compilation units.
Similarly rather "simple" (i.e. not complex) mathematical problems like the steps of brute-force-cracking passwords tend to do well with SMT, because there usually is one computationally heavy step the whole CPU has to wait for, and SMT can abuse that since future steps and all interdependencies are known and simple. This makes it much easier to implement than many other things, especially real-time data input like games or user interaction.

**bug77** · 07 March 2017, 06:01 AM

Iirc, the main problem with HT is that execution engines share the same FSB. Depending on workload, this can be anything between irrelevant and total performance killer.
Going into the BIOS to enable/disable HT depending on what you're doing is not the most convenient thing, but at least in theory, you get an option.

Also I would like to stress this again: these programs are fundamentally different from games - they don't need to sync with user input/choices.

**defaultUser** · 07 March 2017, 08:08 AM

Originally posted by existensil View Post

I don't think you bothered doing the math. Many of these examples scale great. The very first example, John The Ripper, scales almost completely linearly.

1 to 2 cores > 2x performance improvement
2 - 3 cores = 1.5x performance improvement (the theoretical max)
3 - 4 cores = 1.3325x improvement (theoretical max is 4/3rds or ~1.333)
4 - 5 cores = 1.247x (theoretical max: 1.25)
5 - 6 cores = 1.172x (theoretical max: 1.2)

It's not perfect but for a real world tasks to get this close to theoretical scaling maximums is fairly impressive and it's worth recognizing the achievements of the developers.

Some of the other examples get pretty close to maximum ideal scaling as well. I stand by my statement that this is a page full of examples of doing multi-threaded programming well.

As I said this a computer bound problem that uses very little memory and actually workloads like this is the exception to the norm and even if its was more common you can, usually, obtain better results using algorithms that have hardware suppport (try for instance running the cryptsetupt benchmark in computer with and without AES-NI techonology) than going to the multithread way. Most of more practical (and common) problems will bottleneck at the memory or IO operations long before saturating the cpu cores.

**Michael_S** · 07 March 2017, 10:19 AM

Originally posted by defaultUser View Post

Considering that is common to have servers/nodes/workstations with more than twenty cores/threads and with more complexities NUMA placement. Only computer bound applications that have a low footprint/use of memory can be effectively use these cores and even then with lots of work.

I think the canonical case for efficient use of machines with ten or more cores or threads is heavy multitasking - twenty different threads doing twenty different things for twenty virtual machines or Docker containers. Or maybe twenty different threads doing five or ten different things for five or ten virtual machines or Docker containers.

I could use something like that for work, actually. I do a lot of virtual machine testing, and it would be nicer to run six or ten VMs locally than ssh into servers to manage them.

It would certainly be nice if you could easily scale any single task process-intensive program to more than four cores, though.

Announcement

Core i7 6800K Linux CPU Scaling Benchmarks With Ubuntu 16.10

Core i7 6800K Linux CPU Scaling Benchmarks With Ubuntu 16.10

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment