Folding Forum

Posted: **Sun Mar 09, 2014 1:53 pm**

I found this online article. It may explain why for GPUs, the doubling of Streaming Multiprocessor (SM) count does not always double performance..

http://www.extremetech.com/computing/11 ... ll-stuck/2
Note the interesting illustration in the above article mentioning that as the no. of cores increases, performance improves disproportionally before eventually leveling off.

Posted: **Sun Mar 09, 2014 3:04 pm**

The only thing you can be referring to is the doubling of shaders going from the Fermi architecture to the Kepler architecture. And why that did not double performance has been explained many times over. Google it so I don't have to repeat it again here.

And this article misses the point completely. Everyday software has yet to catch up with today's multicore processors. That's not the fault of the CPU makers.

And this doesn't apply to fah in the least. The SMP client is well known to scale in performance in a linear manner up to 64 cores and beyond. That puts fah on that green line in that chart in the article. Which means that until the consumer desktop CPUs start shipping with 256 cores, (not any time soon) the article is very premature in its predictions in regards to fah.

Posted: **Sun Mar 09, 2014 3:18 pm**

The article talks about 'software' but there are only individual programs. Some scale very well, some do not. I wrote a payroll program that scaled to over 16 CPUs (in theory it could scale to CPU = The number of employees) as no part of your check interacts with anyone elses check. Most lack of scalability has to due with tight interaction. Multi-core programming is going to be about breaking interaction.

Posted: **Sun Mar 09, 2014 5:35 pm**

Do note that F@H uses the FPU (Floating Point Unit) heavily. Thus, depending on the actual specifications of the processor, not the marketing adverts, your performance could vary significantly, for example you got a system with 4 CPUs:
1) 2 Cores with 4 Threads
2) 4 Cores with 4 Threads

In this case, processor 2 would be significantly faster than processor 1 since it has twice the number of FPUs (4 VS 2). The virtual core may increase the performance of F@H between 0% to 25%, depending on the type of WU assigned. Thus, when it comes to F@H, you should always focus on the actual number of FPUs (real cores) present and not the total number of CPUs/Threads since they may mislead you. For example a processor with 6 Cores with 6 Threads is faster than a processor with 4 Cores with 8 Threads.

Posted: **Mon Mar 10, 2014 11:26 pm**

7im wrote:And this doesn't apply to fah in the least. The SMP client is well known to scale in performance in a linear manner up to 64 cores and beyond. That puts fah on that green line in that chart in the article. Which means that until the consumer desktop CPUs start shipping with 256 cores, (not any time soon) the article is very premature in its predictions in regards to fah.

This is accurate ... almost. The programmers at Gromacs and at OpenMM have gone to great lengths to create code that is hightly parallelizable -- sufficient to keep all your cores busy, but there's still going to be some serial code. The speed at which a WU progresses from 0% to 100% will essentially (i.e.-almost) double if you can give it twice as many FPUs but the time between reaching 100% of one WU while it packs up the data for upload and while it initializes the next WU (up to when you see the next 0% message) is still serial, and can't be doubled.

The other serial segment is internal to Gromacs -- namely synchronizing threads from each CPU. When there are a large number of atoms, that's insignificant, but a very small protein (measured by atom count) can become less efficient with a really large number of cores. If JimboPalmer's company has 256 employees and they happen have a 256-core CPU, There is zero advantage to processing all the checks concurrently since it still takes time initialize the program and they can't all be transmitted to the bank concurrently nor can the printer process all the checks concurrently. Admittedly, that's an unrealistically extreme case, but it does explain why those efficiency curves tend to flatten out at the top. Fortunately, the proteins that FAH analyzes tend to have a lot more atoms than your hardware has CPU-cores so FAH is decidedly in the best part of the curve.

Posted: **Tue Mar 11, 2014 3:29 am**

bruce wrote:If JimboPalmer's company has 256 employees and they happen have a 256-core CPU, There is zero advantage to processing all the checks concurrently since it still takes time initialize the program and they can't all be transmitted to the bank concurrently nor can the printer process all the checks concurrently. Admittedly, that's an unrealistically extreme case, but it does explain why those efficiency curves tend to flatten out at the top. Fortunately, the proteins that FAH analyzes tend to have a lot more atoms than your hardware has CPU-cores so FAH is decidedly in the best part of the curve.

The sole internal shared logic is in reducing the contents of the companies payroll account, a single subtraction that only one employee can be allowed to do at a time. And the program must wait for ALL employees to finish before terminating, so if some employees have a very complex job history, They might delay completion out of proportion to their percentage of the workforce. In serial mode the program took 22 hours, with 16 threads handling 1600 employees, it finished in 90 minutes.

Posted: **Tue Mar 11, 2014 4:24 am**

Complex tasks with larger data sets is where multiple processors can really shine, just like fah.

Posted: **Tue Mar 11, 2014 4:19 pm**

Right. You brought up a different example than I had in mind but it's a good one. Working backward, the best possible scenario is that 22 hrs of serial work divided equally across 16 threads might take as little as 82.5 minutes. With it actually taking 90 minutes, we can say it's really good code because there's only 7.5 minutes of extra overhead plus the remaining serial operations. With half the number of employees, The 82.5 minutes might become 41.3 plus almost all of the 7.5 for a total of ~49 minutes which is not quite double the speed.

I also like your example for another reason. 1600 employees distributed over 16 threads means an average of 100 employees per thread. FAH's proteins currently have between 250 and 22000 atoms. The number I like to use to estimate an ideal number of processors is 100 or more serial operations per thread, meaning FAHCore_a5 at 22000 atoms could still be improved almost linearly by adding more CPUs. It also means that the proteins that are being being tested on FAHCore_17 at 16000-17000 atoms are very happy with today's high-end GPUs. Conversely, the proteins with only a few hundred atoms are too small to use either Core_a5 or Core_17 efficiently. I'd say that 100 atoms per thread is near the point where the curve flattens out pretty significantly. (It's just a rule-of-thumb anyway.)

Anyway, the the whole point is that for the right kind of operations more cores can be beneficial, but for other types, less will be gained. FAH is the right kind of operation for the range of hardware that we have today.

Posted: **Sun Jun 22, 2014 12:47 pm**

Sounds like there's no need to worry about diminishing returns with increasing CPU thread count in FAH in the near future.

Generally, the atom count in SMP work units is increasing right?

Posted: **Mon Jun 23, 2014 4:10 pm**

user123 wrote:Generally, the atom count in SMP work units is increasing right?

In general terms, probably. In more accurate terms, not rapidly and not if the science can be evaluated with fewer atoms.

For implicit solvent solutions, there has been a tendency to identify simple proteins whenever possible. For explicit solvent models, the protein is enclosed in a box of water molecules and while a larger box would contain more atoms and those atoms add up, there's no need to wonder what a larger box of water will do.

Posted: **Sun Jun 29, 2014 1:44 pm**

(Sorry for a late detailed reply)
There appears to be a misunderstanding.

7im wrote: The only thing you can be referring to is the doubling of shaders going from the Fermi architecture to the Kepler architecture. And why that did not double performance has been explained many times over. Google it so I don't have to repeat it again here.

And this article misses the point completely. Everyday software has yet to catch up with today's multicore processors. That's not the fault of the CPU makers.

And this doesn't apply to fah in the least. The SMP client is well known to scale in performance in a linear manner up to 64 cores and beyond. That puts fah on that green line in that chart in the article. Which means that until the consumer desktop CPUs start shipping with 256 cores, (not any time soon) the article is very premature in its predictions in regards to fah.

I wasn't referring to the doubling of shaders going from Fermi to Kepler architecture.
I was referring to doubling of shaders within the same architecture.

Mentioning personal experience, back in 2012, I tried running FAH on a Palit GTX460 (336 shaders, card was overclocked to 810 MHz) and compared the performance to a Gigabyte GT430 (96 shaders, card was also overclocked to 810 MHz).
The Palit GTX460 had 3.5 times the no. of shaders of the GT430 but was less than 3.5 times as fast.

GPUs have alot more shaders (anologous to cores in CPU) than CPUs and are much more parallel. With a large no. of shaders and travelling along the green line in the chart in the article, it can be seen why there is diminishing returns.

Posted: **Sun Jun 29, 2014 2:08 pm**

CPUs are not GPUs. The article is about CPUs.

And you singled out only one data set for comparison, which unfortunately is old and flawed. Your 460 GPU is similar the benchmark GPU at the time. And the 430 is way below it and very bottle necked.

And you used a very vague "as fast" measurement for your comparison. Was that PPD? Which fahcore? With bonus? Or only the more accurate base points? Same project?

I'm sure there is an old GPU performance chart around here somewhere. It will show a more complete picture.

Folding Forum

Increasing no. of cores and diminishing returns

Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns

Re: Increasing no. of cores and diminishing returns