128 threads and low PPD

Moderators: Site Moderators, FAHC Science Team

Skillz
Posts: 29
Joined: Sun Feb 10, 2008 8:19 pm

128 threads and low PPD

Post by Skillz »

A friend of mine on my team is running some CPU work on his big rigs and has experienced some issues with a8. He has 128 threads and a7 will scale to utilize all threads on the system while a8 will only use 64 threads. Is this by design? A bug? Bad configuration?
FahCore_a7 scales easily to 128 threads.
FahCore_a8 only scales to 64 threads. When loaded into a 128-threaded slot, it logs this…
ERROR:128 OpenMP threads were requested. Since the non-bonded force buffer reduction is prohibitively slow with more than 64 threads, we do not allow this. Use 64 or less OpenMP threads.​
…and starts singlethreaded. :facepalm:

Edit: dual EPYC 7452, TDP/PPT set to 180 W:
FahCore_a8 in a 64-threaded slot (other 64 threads unused): 170 kPPD
FahCore_a7 in a 128-threaded slot: 2.8 M PPD
Quoted from here: https://forums.anandtech.com/threads/fo ... t-40423126

He wishes to keep his overall online foot print low, so he doesn't want to create a new account here. Hence why I am posting this on his behalf.

Additionally, another member on my team experiences the same low PPD, but doesn't have any rigs larger than 64 threads.
That has been my experience as well, outside of that I do not have any machines that are 64 threads or higher. I was excited to get an A8 task thinking the PPD would be much better due to AVX2/FMA3 and it being a newer core, but the PPD was significantly lower than A7 tasks on both a Ryzen 3950x and Haswell Xeon.
As seen in this post here: https://forums.anandtech.com/threads/fo ... t-40423149
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 128 threads and low PPD

Post by bruce »

FAH requires frequent inter-process syncing. When the gromacs.org folks measure performance and they observe that N+K processes runs slower than N processes, they tend to limit the FAHCore version to run with no more than N threads. While there may be special options that allow gromacs to work efficiently with N+K threads on a dedicated installation, it's not reasonable to expect FAH Donors to manage those options. After all, FAH is designed for the *@home market, and not many people actually have a home computer with that many threads.

FAH's development resources are severely limited and they have to concentrate on things that make FAH run better for the masses, not for a select few. If you happen develop test reports that show how overall FAH performace can be improved, feel free to submit Open_Source suggestions on github. sSubtract the costs associated with testing and implementing your idea from the total benefit to FAH before expecting anything to happen.
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 128 threads and low PPD

Post by Joe_H »

One known issue with the A8 folding core is that an option that would make it more compatible with very large core count processors and multi-processor systems was not selected for the distributed executable. They ran into an issue with that turned on during development, and removed that optimization from the current version of the core being distributed. They may revisit that decision if/when they find the cause and release a newer version of A8.

The tradeoff then was between getting out a new CPU folding core that supported new features related to folding science and would also support ARM processors with the ability to run on most commonly seen hardware, versus taking more time to get out a version that would run efficiently on a small portion of the hardware.

Also, your friend may have run into a first WU download "bug" that exists in the client. Often the first WU will be downloaded and limited to 1 CPU thread, later WUs will use the maximum of the slot setting. If no WUs are available to run on 64 threads, then a lower thread requirement WU may get assigned instead, but it could not be for as few as a single thread. Once assigned at a particular max thread count, a WU can be run on fewer, but not more threads.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Skillz
Posts: 29
Joined: Sun Feb 10, 2008 8:19 pm

Re: 128 threads and low PPD

Post by Skillz »

Thanks for the feedback guys.

Is there anyway to opt out of receiving those work units on systems with more than 64-threads? Other than using scripts to abort the tasks if they're downloaded.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 128 threads and low PPD

Post by Neil-B »

Aborting tasks once downloaded is really unhelpful for the progress of science :(
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Skillz
Posts: 29
Joined: Sun Feb 10, 2008 8:19 pm

Re: 128 threads and low PPD

Post by Skillz »

Neil-B wrote:Aborting tasks once downloaded is really unhelpful for the progress of science :(
Tasks that run only half of the available resources on the CPU are really unhelpful for the volunteers.
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: 128 threads and low PPD

Post by JimboPalmer »

Skillz wrote:
Neil-B wrote:Aborting tasks once downloaded is really unhelpful for the progress of science :(
Tasks that run only half of the available resources on the CPU are really unhelpful for the volunteers.
I hope you come to realize that this is all about science and not at all about you.

If you go on to the reddit to give information that can help the developers, be sure to give detailed info about the version of OS you/they are using. Not just "Windows" or "Linux" but exact as you can make them version numbers. That will have a great deal of impact on how the OS assigns threads.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
gs60
Posts: 55
Joined: Mon Aug 31, 2020 12:33 pm

Re: 128 threads and low PPD

Post by gs60 »

Just curious, would this setup be a good example of defining 2 cpu slots with 64 threads each (or even 4, 32 thread slots)?
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: 128 threads and low PPD

Post by JimboPalmer »

gs60 wrote:Just curious, would this setup be a good example of defining 2 cpu slots with 64 threads each (or even 4, 32 thread slots)?
As most versions of Windows only use 32 treads per app, if you (for some reason) had a consumer version of Windows on a 128 thread monster CPU you would need 4 32 thread slots.

I know almost nothing about Linux*, but I am unaware of any issues there.

*Where it resembles Unix I have some knowledge.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 128 threads and low PPD

Post by Neil-B »

Skillz wrote:
Neil-B wrote:Aborting tasks once downloaded is really unhelpful for the progress of science :(
Tasks that run only half of the available resources on the CPU are really unhelpful for the volunteers.
I'll put it another way then ... folders should not do this as it does more harm than good ... this is not volunteering it is vandalism ... to quote the FaH guidance/best practice:

Donors should not delete/dump a WU for any reason other than mentioned below. Deleting WUs disrupts the project since it takes longer for WUs to pass their deadline, get reassigned, and finally completed. Deleting a WU solely for PPD advantage is prohibited. The permitted reasons for deleting/dumping WUs are:
WU Instability -> When this happens, please report it in this Forum
FAH Client instability -> If this happens, please report it in the appropriate FAH Client Forum
Inability of the host system to complete the WU before the Deadline -> If it happens, please visit this thread or this guide to reconfigure your FAH Client to better fits your computing needs.


So please ask your friend (and anyone else you know who does this) not to dump wus for any reasons other than the above ... there are other ways of addressing the current issue with A8 core limitations (which in time may be resolved anyway) that are destructive or damaging to the science.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 128 threads and low PPD

Post by Neil-B »

gs60 wrote:Just curious, would this setup be a good example of defining 2 cpu slots with 64 threads each (or even 4, 32 thread slots)?
Possibly ... It depends ... If 128 core slot is regularly properly (as it may have been in the past with a7) then leaving as 128 core may well be best for progress of science ... If however it is now regularly being stepped don to 64 cores due to impact of current a8 restrictions then changing config to two 64 core slots may well provide best contribution ... dropping to 32 core slots would be best on windows systems due to current limitation enforced by FaHClient (not a windows restriction) but if someone has been running 128 core slots they aren't running windows (or is they are I would love to know how) !!

When it comes down to it folders should do what they feel comfortable with ... I for instance don't change my slots from 24 and 32 cores when the occasional period of low cpu core count limitations mean they are down stepped to lower values simply because I know at some point it will all sort itself out - yes for a period my cores aren't all maxing but the faff of changing back and forth is one I choose to avoid :) ... Yes I could do more science/get more points (which I don't actually care about) by altering configs at such times but in the big scheme of things it isn't for me an issue.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: 128 threads and low PPD

Post by MeeLee »

In the defense, dumping is sometimes inevitable.
So long you're maintaining an 8/10 ratio (>8 successfully completed, and <2 erroneous and dumped) you'll get the bonus PPD.
Sometimes dumping is the better choice, if you choose not to continue for whatever reason (perhaps you want to reinstall another OS, or something).
Whatever contribution is donated, is appreciated.
It's preferable to finish the WUs, to at least get the points, however, if he can enable all cores on another operating system, I see no reason why not to dump it, rather than install the new OS, and have the server wait for the WUs that are now overwritten.


Joe_H wrote:One known issue with the A8 folding core is that an option that would make it more compatible with very large core count processors and multi-processor systems was not selected for the distributed executable. They ran into an issue with that turned on during development, and removed that optimization from the current version of the core being distributed. They may revisit that decision if/when they find the cause and release a newer version of A8.

The tradeoff then was between getting out a new CPU folding core that supported new features related to folding science and would also support ARM processors with the ability to run on most commonly seen hardware, versus taking more time to get out a version that would run efficiently on a small portion of the hardware.

Also, your friend may have run into a first WU download "bug" that exists in the client. Often the first WU will be downloaded and limited to 1 CPU thread, later WUs will use the maximum of the slot setting. If no WUs are available to run on 64 threads, then a lower thread requirement WU may get assigned instead, but it could not be for as few as a single thread. Once assigned at a particular max thread count, a WU can be run on fewer, but not more threads.

We'd have to look at modern hardware, Joe.
Some of the new Neonverse CPUs may not have those same limitations found in the older ARM CPU architecture.
Ampere CPUs use 80 threads, and are pretty optimized already. Amazon Graviton CPUs (not available for consumers) also have many tweaks from the standard ARM architecture for mobile CPUs.
I think this is the way where the CPUs are heading to.

The limitation could be removed on hardware that supports it, and leave it to the end user to see if his PPD is dropping when he surpasses an x-amount of cores.
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 128 threads and low PPD

Post by Joe_H »

MeeLee wrote:We'd have to look at modern hardware, Joe.
Some of the new Neonverse CPUs may not have those same limitations found in the older ARM CPU architecture.
Ampere CPUs use 80 threads, and are pretty optimized already. Amazon Graviton CPUs (not available for consumers) also have many tweaks from the standard ARM architecture for mobile CPUs.
I think this is the way where the CPUs are heading to.

The limitation could be removed on hardware that supports it, and leave it to the end user to see if his PPD is dropping when he surpasses an x-amount of cores.
The option was not selected because it was not working properly, not something related to whether it was new or old ARM architecture. The A8 core is using a newer version of Gromacs than used for the A7 core, support for ARM is one of several changes involved. They are looking into what the problem might be coming from, it could be from a bug or an interaction not foreseen in the code once the optimization is added. As for leaving it to the end user, they are not going to do that in most cases, the results do need to be verifiable science.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 128 threads and low PPD

Post by Neil-B »

MeeLee wrote:In the defense, dumping is sometimes inevitable.
.. and where it is inevitable then it is inevitable and yes it happens - but that is not what is being talked about here .. scripting dumping for performance gains is not inevitable !!! .. and doing so is wrong and against the guidance/policy set by fah.

.. now if you want someone to write a script to dump all gpu wus that dont 100% utilise the gpu then feel free to defend this behaviour .. the reasons for the limit on cores for a8 was explained - no doubt efforts will be made to sort that out in future drops of the core - but until then there are ways to resolve this without damaging the progress of science ... and scripting dumping of wus is not one of them.

... and the 80% figure for qrb doesn't mean one should aim for only 80% - 100% return should be the aim :(
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: 128 threads and low PPD

Post by MeeLee »

Neil-B wrote:
MeeLee wrote:In the defense, dumping is sometimes inevitable.
.. and where it is inevitable then it is inevitable and yes it happens - but that is not what is being talked about here .. scripting dumping for performance gains is not inevitable !!! .. and doing so is wrong and against the guidance/policy set by fah.

.. now if you want someone to write a script to dump all gpu wus that dont 100% utilise the gpu then feel free to defend this behaviour .. the reasons for the limit on cores for a8 was explained - no doubt efforts will be made to sort that out in future drops of the core - but until then there are ways to resolve this without damaging the progress of science ... and scripting dumping of wus is not one of them.

... and the 80% figure for qrb doesn't mean one should aim for only 80% - 100% return should be the aim :(
I doubt many are busy with dumping WUs that won't utilize the GPU fully.
Even if a script existed, you'd lose PPD by the time it takes to download, and discard the WU, plus you're getting closer to the minimum 80% needed for the Quick return bonus that way.
Post Reply