Neil-B wrote:using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can
We're talking about more than 32 cores or threads, there seem to be some issues.
gunnarre wrote:The OS. As far as I understand, an SMP-aware Linux kernel will by default attempt to keep all the threads of a particular process running on one CPU. So I would test with one slot per CPU and see if that makes folding faster. It should spawn one process for each work unit, and if the kernel does its job correctly, it should run each process on one CPU without having to move threads between them. You shouldn't have to force the process to run on a particular CPU, although you can do so manually if you want:
From the Linux
taskset documentation:
Note that the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep processes on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications.
The kernel scheduler has a process tree, and it knows which processes communicate, so a multi-threaded process with a lower thread count than one CPU should automatically be kept on one CPU.
Likewise a NUMA aware Linux kernel should automatically try to localize each process to one NUMA node, to reduce cross-node memory access. I don't run a Threadripper though, so I'm not sure about this part.
Neil-B wrote:using a single slot utilising both cpus makes this a non issue? ... the a8 core works fine under windows utilising two cpus and delivers great performance - just needs a client that allows the core to run as it can
We're talking about more than 32 cores or threads, there seem to be some issues.
Yeah, CPUs is different from chiplets. CPU Chiplets are on the same CPU block.
You can have a CPU like a threadripper, with 2 CPU chiplets, with 8 or 16 cores on it (and 16 to 32 threads per chiplet, multiplied by the amount of chiplets to give the total CPU threads).
It would definitely lower performance if a few threads of one chiplet are running on another chiplet, as they're pulling data from another L-cache section, that data has to be loaded from one L-cache, into one CPU core, where it'll forward the data to the second L-cache block on the other chiplet. That's a lot of latency loss. Hence the question.