Re: Bug report: FAH client cannot detect more than 32 CPUs
Posted: Sun Sep 06, 2020 6:22 pm
But 2 slots of 32 cores is still better than a single one.
Community driven support forum for Folding@home
https://foldingforum.org/
Humm... can you please elaborate on what assumptions or use case you would suggest 2 CPU:32 Slots instead of 1 CPU:64 Slot?MeeLee wrote:But 2 slots of 32 cores is still better than a single one.
I think that MeeLee was trying to say that 2 slots of 32 cores is better than 1 slot of 32 cores. Personally, I can only think that 2 slots of 32 cores is better than 1 slot of 64 cores if the tasks assigned are so small that at least one of the tasks would leave several idle cores. There are work units with so few atoms that the core will refuse to use all cores if given a large amount of cores. It may be server-driven, though. I don't remember exactly what my logs said when this happened to my computer.PantherX wrote:Humm... can you please elaborate on what assumptions or use case you would suggest 2 CPU:32 Slots instead of 1 CPU:64 Slot?MeeLee wrote:But 2 slots of 32 cores is still better than a single one.
One of the problems with Ryzen and Threadripper CPUs, is that all CPU chiplets communicate with one another over the Infinity fabric.Neil-B wrote:In my case for the moment I run a 32/56 and a 24/56 slot because of this issue ... Have been providing some analysis into the team which posted elsewhere but relevant to this:
Working "outside" of the Client running the new A8 Core on a test p16810 WU the current 24 core slot and a 32 core slot produced less than a single 54 core slot (54 was better than 56) ... Not only did the single slot process slightly more over time (20.5 WUs/d to 20 WUs/day) but 11 of them are completed 60 mins quicker using a single 54 core slot (in 70 mins rather than 130 mins) and the other 9 are completed 90 mins quicker (in 70 mins rather than 160 mins) ... Only a little bit more science but one heck of a lot faster completion of each WU.
Actually, there are problems where having too many processes competing with each other can cause slowdowns. For example, after all of the BOINC clients that are part of NFS@home return their results, the staff setting up the back end postprocessing of the BOINC client results reserve many more cores than they actually use in the compute servers so that the cores which do process instructions can have little or no competition for the memory controllers since the back end processes are not CPU hogs, but memory throughput hogs. Having many cores sit idle drastically speeds up the cores that spend most of their time waiting for memory accesses.MeeLee wrote:One of the problems with Ryzen and Threadripper CPUs, is that all CPU chiplets communicate with one another over the Infinity fabric.Neil-B wrote:In my case for the moment I run a 32/56 and a 24/56 slot because of this issue ... Have been providing some analysis into the team which posted elsewhere but relevant to this:
Working "outside" of the Client running the new A8 Core on a test p16810 WU the current 24 core slot and a 32 core slot produced less than a single 54 core slot (54 was better than 56) ... Not only did the single slot process slightly more over time (20.5 WUs/d to 20 WUs/day) but 11 of them are completed 60 mins quicker using a single 54 core slot (in 70 mins rather than 130 mins) and the other 9 are completed 90 mins quicker (in 70 mins rather than 160 mins) ... Only a little bit more science but one heck of a lot faster completion of each WU.
Basically a similar idea (probably derived of) Intel Core I's ring bus technology, with the exception that a ring bus goes from core to core, while infinity fabric is some sort of blazing fast network, connecting all cores pretty much directly to one another. This poses a problem when a lot of data is moved around, which is why a Ryzen speeds up so much with faster RAM (thus also faster Infinity Fabric).
Data to and from RAM, also gets transported via this infinity network.
You can imagine, an Infinity Fabric running at 1,8Ghz, having to provide data to all CPU cores and RAM, running at twice the frequency...
Thankfully they're equipped with (from what I can understand) 4 lanes connecting each chiplet to another, and an additional 6 lanes connecting cores within the Chiplet (the chiplet is the block enveloping a quadcore, 6-core, or octacore CPU. Ryzen and Threadripper have many Chiplets stacked together, and within each chiplet are many CPUs; Basically a CPU core on a chiplet, is a CPU within a CPU). The marketing names don't make things easier...
Anyway, the reason why your Threadripper (and even Ryzen 3900x/xt or 3950x) isn't operating as fast at an all core load is twofold.
One because of bottlenecking the IF,
Two, I firmly believe that Threadrippers have too many cores, asking for data from RAM. So the RAM is somewhat bottlenecked too.
Three, because having a few cores passive, will allow that power to be routed to the other cores, allowing them to have a higher clock speed.
This, in my opinion, is the difference between Intel and AMD.
Intel would never (or at least never has) fabricate a CPU where the additional cores would slow down operation.
They even measure voltages, and power consumption levels, so that each CPU is optimized pretty much about as well as can be; so that the average of a bunch of tasks that need completion, will be done at the lowest carbon footprint possible for that technology.
Meaning, if they'd increase the CPU frequency, the CPU needs more power, and overall the watts used to finish the job would rise.
Consequently, lowering the frequency, lowers the power requirements, but also takes the task longer to complete; resulting in an increase in power consumption as well.
AMD on the other hand, I don't feel they look at this.
They just either shoot for fastest CPU frequency, or in case of 3000 series Ryzen and Threadrippers, their CPU algorithms are a total mess!
You could be running a 6 CPU threads workload on a 3900x, and instead of running that at a maximum rated frequency of 4,?Ghz, the CPU is running it at a 2,5Ghz frequency.
Not to mention their initial bios issues on all ryzens!
I feel their latest products were released hastily, and aren't as refined as Intel, despite they running a smaller lithography (7nm), and having more cores...
Some tests are showing Intel to gain superiority back with their 11th gen CPUs, that are running workloads more efficient at 10nm, than AMD at 7nm.
Personally, it seems slightly unfair to compare a new architecture that has been out for ~3 years to something that has been out for ~11 years. I am sure that the first few Intel Core i Series had their own heat issues too. However, I am still thankful to AMD for shaking up the market and producing CPUs that a more affordable price than Intel. That's healthy competition which is needed as it benefits customers like us and drives up innovation.MeeLee wrote:...I feel their latest products were released hastily, and aren't as refined as Intel, despite they running a smaller lithography (7nm), and having more cores...
Some tests are showing Intel to gain superiority back with their 11th gen CPUs, that are running workloads more efficient at 10nm, than AMD at 7nm.
AFAIK, there's no real benefit for a 64-bit client. All folding is done on FahCore_a7 and FahCore_a8 which are 64-bit compatible.jnv11 wrote:...All generations of Threadripper will require benchmarking to see if NUMA mode or non-NUMA mode works better for each generation once a 64-bit client that is properly NUMA-aware is developed and shipped...
If you were referring to my comment re 54/56 being quicker than 56/56 just to point out this was on an dual Intel Xeon server .. and tbh the reason was probably due to the contention and overhead with other software on the server at the time during the testing .. and I kind of like 54 as a count as doesn't have an "large" prime issues .. and yes I know A8 shouldn't have these but at least for a while the ingrained avoidance twitch will still kick inMeeLee wrote:Anyway, the reason why your Threadripper (and even Ryzen 3900x/xt or 3950x) isn't operating as fast at an all core load is twofold.
But they aren't. Ryzen 3000 series have just come out last year, and Intel Core I 11th gen is a big leap from 2nd to 9th gen.PantherX wrote:Personally, it seems slightly unfair to compare a new architecture that has been out for ~3 years to something that has been out for ~11 years. I am sure that the first few Intel Core i Series had their own heat issues too. However, I am still thankful to AMD for shaking up the market and producing CPUs that a more affordable price than Intel. That's healthy competition which is needed as it benefits customers like us and drives up innovation.
It's a general rule of thumb, that if you're folding on that many cores, one core is needed to feed those, much like you're feeding a GPU.Neil-B wrote: If you were referring to my comment re 54/56 being quicker than 56/56 just to point out this was on an dual Intel Xeon server .. and tbh the reason was probably due to the contention and overhead with other software on the server at the time during the testing .. and I kind of like 54 as a count as doesn't have an "large" prime issues .. and yes I know A8 shouldn't have these but at least for a while the ingrained avoidance twitch will still kick in
If you were referring to the world in general re Threadripper then I can't comment as I am an Intel only user !!
Microsoft reworked the data structures for 64-bit Windows to greatly expand its ability to cope with more cores. A 32-bit client could not hope to get accurate data structures since the capacity limits of the 32-bit structures are being exceeded by high end systems today. The only thing that it could accurately get is a core count by using a new API, and there is no hope to get accurate versions of the rest of the data entirely within the 32-bit client. A 64-bit client could use the 64-bit versions of the data structures. Since the new structures are designed to use the same coding unless one is using assembly code, a recompile without changing the high level language code could likely fix the issue.PantherX wrote:AFAIK, there's no real benefit for a 64-bit client. All folding is done on FahCore_a7 and FahCore_a8 which are 64-bit compatible.jnv11 wrote:...All generations of Threadripper will require benchmarking to see if NUMA mode or non-NUMA mode works better for each generation once a 64-bit client that is properly NUMA-aware is developed and shipped...
Since Microsoft greatly expanded the data structures when designing 64-bit Windows to accommodate more cores and my system already exceeds the limits for the 32-bit data structures, a 32-bit Windows client getting accurate information by itself is ruled out. Asking the folding cores to pass the data to the client would create loads of messy complexity that is asking for more spaghetti code which is a nightmare to maintain. Furthermore, the current Windows Folding@home client will ignore user requests to set the number of CPU cores in one folding slot to more than 32 cores, so that will need to be changed in the next version of the Folding@home client software. Asking the cores to manage themselves will anger users who want manual control to be able to set more than 32 cores per slot.gunnarre wrote:If the 32 bit client can get correct answers from the OS about the structure of a 64 bit system (core count, NUMA, multi-CPU structure, RAM bandwidth), then that shouldn't be a problem, but if it can't then those questions would need to be asked by the 64 bit folding core instead and either passed up to the client or managed completely within the folding core.