Bug report: FAH client cannot detect more than 32 CPUs

Neil-B · Post by **Neil-B** » Thu Sep 10, 2020 6:18 am

jnv11 wrote:Asking the folding cores to pass the data to the client would create loads of messy complexity that is asking for more spaghetti code which is a nightmare to maintain. Furthermore, the current Windows Folding@home client will ignore user requests to set the number of CPU cores in one folding slot to more than 32 cores, so that will need to be changed in the next version of the Folding@home client software.

Not necessarily might simply be change to one line of code allows both tbh ?!

... putting minor quick fix in the 32but client as part a new release may be a lot quicker, simpler and less messy than coding and building a 64bit client?

Post by **bruce** » Fri Oct 09, 2020 6:38 am

I think somebody at some time in the past decided that if there are 64 threads, that certainly means that the NUMA structure is two sockets of 32-threads each. Perhaps in that timeframe, the OS was not able to figure out the actual NUMA structure or the programmer didn't feel like breaking out the manuals to learn about specific OPs that he hadn't used in years.

It's a bad idea for a single project to make use of more than one socketed CPU. It's not uncommon for independent sockets and independent memory to have loosly coupled clocks so you either have to seriously down-clock both CPUs or you fact race conditions when portions of the calculation have to synchronize with portions when were run based on a different CPU clock.

Neil-B · Post by **Neil-B** » Fri Oct 09, 2020 12:32 pm

bruce wrote:It's a bad idea for a single project to make use of more than one socketed CPU. It's not uncommon for independent sockets and independent memory to have loosly coupled clocks so you either have to seriously down-clock both CPUs or you fact race conditions when portions of the calculation have to synchronize with portions when were run based on a different CPU clock.

From the tests I have run on FaH A7 and A8 cores on various project WUs and various totally unrelatedhigh intensity ai/my loads (albeit on a twin xeon server specifically configured for such loads) this doesn't appear to be an issue .. may have been in the past, may even be the case today and I have just been luck but I rather like to ensure this is still the case prior to consigning this to the discard pile .. also haven't had this issue on multi threaded ai/my loads on 4 and 8 cpu servers (not used for fah).

... and linux clients currently running greater than 32 threads over multiple sockets without issues I believe?

MeeLee · Post by **MeeLee** » Sat Oct 10, 2020 1:02 am

I'm wondering,
If splitting a CPU into 2 CPU slots randomly assigns cores, or if they take consideration on what chiplet they are?
Like, if I have 16 cores 32 threads on one slot, and the same on another, will the client automatically assign core 0 to core 15 for slot 1, and core 16 to core 31 for slot 2?

Post by **bruce** » Sat Oct 10, 2020 11:42 am

Newer versions of the OS probably do take it into consideration whereas older versions did not.

MeeLee · Post by **MeeLee** » Sat Oct 10, 2020 11:29 pm

The OS, or FAH?
How will the OS see which thread to assign, when it's assigned in FAH (through slots)?
I think the OS doesn't see what FAH sees, but could automatically address cores depending on the L-cache requests the tasks pull.
It would be better if in FAH one could somehow assign cores (manually, or automatically).

gunnarre · Post by **gunnarre** » Sun Oct 11, 2020 7:35 pm

The OS. As far as I understand, an SMP-aware Linux kernel will by default attempt to keep all the threads of a particular process running on one CPU. So I would test with one slot per CPU and see if that makes folding faster. It should spawn one process for each work unit, and if the kernel does its job correctly, it should run each process on one CPU without having to move threads between them. You shouldn't have to force the process to run on a particular CPU, although you can do so manually if you want:
From the Linux taskset documentation:

Note that the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications.

The kernel scheduler has a process tree, and it knows which processes communicate, so a multi-threaded process with a lower thread count than one CPU should automatically be kept on one CPU.

Likewise a NUMA aware Linux kernel should automatically try to localize each process to one NUMA node, to reduce cross-node memory access. I don't run a Threadripper though, so I'm not sure about this part.

Neil-B · Post by **Neil-B** » Sun Oct 11, 2020 8:15 pm

using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can

MeeLee · Post by **MeeLee** » Sun Oct 11, 2020 10:05 pm

Neil-B wrote:using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can

We're talking about more than 32 cores or threads, there seem to be some issues.

gunnarre wrote:The OS. As far as I understand, an SMP-aware Linux kernel will by default attempt to keep all the threads of a particular process running on one CPU. So I would test with one slot per CPU and see if that makes folding faster. It should spawn one process for each work unit, and if the kernel does its job correctly, it should run each process on one CPU without having to move threads between them. You shouldn't have to force the process to run on a particular CPU, although you can do so manually if you want:
From the Linux taskset documentation:
Note that the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications.
The kernel scheduler has a process tree, and it knows which processes communicate, so a multi-threaded process with a lower thread count than one CPU should automatically be kept on one CPU.

Likewise a NUMA aware Linux kernel should automatically try to localize each process to one NUMA node, to reduce cross-node memory access. I don't run a Threadripper though, so I'm not sure about this part.

Neil-B wrote:using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can

We're talking about more than 32 cores or threads, there seem to be some issues.

Yeah, CPUs is different from chiplets. CPU Chiplets are on the same CPU block.
You can have a CPU like a threadripper, with 2 CPU chiplets, with 8 or 16 cores on it (and 16 to 32 threads per chiplet, multiplied by the amount of chiplets to give the total CPU threads).

It would definitely lower performance if a few threads of one chiplet are running on another chiplet, as they're pulling data from another L-cache section, that data has to be loaded from one L-cache, into one CPU core, where it'll forward the data to the second L-cache block on the other chiplet. That's a lot of latency loss. Hence the question.

Post by **PantherX** » Mon Oct 12, 2020 8:04 am

MeeLee wrote:
Neil-B wrote:using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can
We're talking about more than 32 cores or threads, there seem to be some issues...

To elaborate a bit on what Neil-B posted, FahCore_a8 can run on 32+ threads on Windows when run stand-alone. I have worked with Neil-B and documented various test cases and have passed that information on which was appreciated. While I hope that the fix is easy, I don't know if/when it will be resolved since there are other things that needs to be addressed. I am hoping that CPU folding on Windows with 32+ CPUs can become a reality soon-ish (no ETA or promise)

Folding Forum

Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs

Re: Bug report: FAH client cannot detect more than 32 CPUs