Which device are the tasks running on?

Moderators: Site Moderators, FAHC Science Team

arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

The minimum granularity that the kernel scheduler can handle without making changes to core debug parameters is 0.75 ms but that will cause severe context switch overhead. The default for SCHED_BATCH which FAH might be switching to in the future is 1500 ms.

Most GPU kernels only take a few microseconds to execute. Even slow ones like the computeNonbonded kernel can take something like 20 us.

That's thousands of kernels that are not being executed every single time the GPU feeder thread loses the CPU.
muziqaz
Posts: 2131
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 9950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, Intel B580
Location: London
Contact:

Re: Which device are the tasks running on?

Post by muziqaz »

Let's just concentrate on getting folding to work instead of thinking to rewrite some molecular dynamics code from ground up to achieve imaginary efficiency gains :)
FAH Omega tester
Image
arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

muziqaz wrote: Sat Apr 05, 2025 5:34 am Let's just concentrate on getting folding to work instead of thinking to rewrite some molecular dynamics code from ground up to achieve imaginary efficiency gains :)
Exactly this. Worse, we'd have to re-design how superscalar processors work from the ground up!

We all could talk all day about hypothetical hardware designs that allow instruction interleaving or efficient sub-microsecond timeslices, but running N+1 folding tasks on a system with N hardware threads will harm performance, even if one of those threads is "just" a thread feeding the GPU.
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

arisu wrote: Sat Apr 05, 2025 5:34 am The minimum granularity that the kernel scheduler can handle without making changes to core debug parameters is 0.75 ms but that will cause severe context switch overhead. The default for SCHED_BATCH which FAH might be switching to in the future is 1500 ms.
One and a half seconds?! That's gonna screw things up, remember there are Windows background tasks, and the user etc.
arisu wrote: Sat Apr 05, 2025 5:34 amMost GPU kernels only take a few microseconds to execute. Even slow ones like the computeNonbonded kernel can take something like 20 us.

That's thousands of kernels that are not being executed every single time the GPU feeder thread loses the CPU.
No buffering?
arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

Peter_Hucker wrote: Sat Apr 05, 2025 6:29 am One and a half seconds?! That's gonna screw things up, remember there are Windows background tasks, and the user etc.
That's the default time slice not the granularity. It can still be preempted before that. The minimum theoretical is still 0.75 ms. Btw that's only on Linux. Windows uses something else. Probably something inferior considering the same hardware folds on the same GPU faster on Linux than on Windows.
Peter_Hucker wrote: Sat Apr 05, 2025 6:29 am No buffering?
The CPU sends a series of kernels to the GPU plus data for them to work on and they all execute in parallel. When they return, the GPU returns to a ready state. The CPU thread that feeds the GPU is continuously polling the GPU's state in a loop (at least on Nvidia platforms) and when it sees that the GPU is back in the ready state, it collects the data and uses it to prepare a new batch of data with new kernels and then it sends that.

Buffering everything isn't possible because the next batch of data getting sent depends on the results of computations on the previous batch. Going from one batch to the next batch is an inherently serial operation, even though each "batch" contains a bunch of small kernels that are run independently and in parallel. It's probably possible in theory to process everything on the GPU itself, keeping everything in VRAM and, effectively, moving the CPU feeder thread into the GPU, but that hasn't been done and there are probably technical and logistical reasons why it would be hard to do with FAH.

As long as the feeder thread has exclusive access to a hardware thread, the GPU will never be idle. The more often the feeder has to be kicked off of the hardware thread to make room for something else, the longer the GPU will stay idle. Folding runs with low priority so that's not a bad thing and you'll only lose a tiny amount of work, but if you have something else competing for 100% CPU usage, you will find that the GPU gets severely starved.

The AMD GPU driver is a little more forgiving btw. Unlike the Nvidia driver which uses a spin wait loop (the CPU asking the GPU over and over if it's done with its current batch), the AMD driver uses interrupts (the CPU sends the batch then goes on to do its thing and the GPU will notify the CPU when it is done). For some reason, Nvidia performs a lot better when using a spin wait loop than when using interrupts.
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

I've tested several Boinc projects under Linux and Windows. Some are faster in one, some in the other. I guess it depends on the programming style.

I decided I'd just stick to windows, it easier to use. But I do have Linux on a Mac because I can't stand their kid's OS.

I guess the Nvidia does better because the CPU is ready immediately. With an AMD the CPU could be busy when interrupted. I choose cards by price anyway, and it's always been AMD.

Running two tasks on the GPU at once would be better, but then they would get returned slower.

With very fast GPUs, can one CPU core feed it fast enough?
arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

Peter_Hucker wrote: Sat Apr 05, 2025 9:05 am With very fast GPUs, can one CPU core feed it fast enough?
It's limited to one CPU core anyway. A CPU core, if it's not extremely busy with other things or being locked at a super low clock, will always be able to saturate the PCIe bus to feed the GPU. And folding doesn't use up much PCIe bandwidth in the first place. Unless you are doing something silly like connecting an RTX 5090 over a single PCIe 4.0 lane on a rig meant for crypto mining, one core will always be fine.

Despite using 100% of a single core (technically hardware thread), it actually doesn't use that much extra power because it's spending almost all its cycles in a spin wait loop and it's not engaging any of the really power hungry components of the CPU like the ALU or FPU or register file.
Peter_Hucker wrote: Sat Apr 05, 2025 9:05 am I guess the Nvidia does better because the CPU is ready immediately. With an AMD the CPU could be busy when interrupted. I choose cards by price anyway, and it's always been AMD.
For the same price, Nvidia GPUs usually have better performance for two reasons: 1) Nvidia has been in the HPC market a long time and their GPUs are designed for compute tasks and 2) Nvidia GPUs use CUDA and AMD GPUs use OpenCL, and CUDA is just more efficient (though soon AMD GPUs will use HIP, which will bring folding performance in line with that of CUDA).

Especially on the higher end, Nvidia always wins out. Compared to an AMD RX 7900XTX, an Nvidia RTX 5090 has 60% higher FP32 speed (104.8 vs 61.4 TFLOPS) using 60% more power (575W vs 355W TDP) at only 2 times the price ($1999 vs $999 MSRP). Despite this seeming like a win for the AMD card, the Nvidia card folds 5 times faster (50M vs 10M PPD for the best projects), and there's no way HIP will grant a 5x speedup over OpenCL.
Peter_Hucker wrote: Sat Apr 05, 2025 9:05 am Running two tasks on the GPU at once would be better, but then they would get returned slower.
Folding rewards fast returns, so one WU completed at full speed is worth more (in points and scientific value) than two WUs completed at half speed. But for GPUs with a huge number of shader cores like a top of the line RTX 5090, a lot of projects don't have enough atoms to fully utilize it, so some shader cores are sitting idle. Even running two WUs at once wouldn't make it better because only one would run at a time in spite of each WU not fully utilizing the GPU (only one "batch" gets worked on at a time even if it only utilizes half of the shader cores). There is a solution to that called MPS which can coalesce two batches sent from two folding processes into one batch, but the folding client doesn't support it yet because it is only for Nvidia on Linux.
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

It seems insane to spin wait. Surely the thing doing the work (GPU) should be the thing saying "ready!" What a waste of CPU time.

I have yet to see a better value Nvidia. I always look at power to price ratio. I'm not interested in it being faster for one particular application. No fast FP, me no buy.

Folding only rewards beating the timeout. Base value or base+bonus value.

Why can't two run without messing about with whatever MPS is? Boinc will run any number of anything on a GPU, even two completely different projects. A GPU will also play a game while crunching.
arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

Peter_Hucker wrote: Sat Apr 05, 2025 12:31 pm It seems insane to spin wait. Surely the thing doing the work (GPU) should be the thing saying "ready!" What a waste of CPU time.
I agree. AMD seems to get away with an interrupt-driven system. Nvidia supports that too but it has a bad performance hit for some reason (I suspect it would be possible to get around the performance hit but that would require changing the driver itself): https://github.com/openmm/openmm/issues/2955

I asked about it when I found out about it too because it seemed insane to me as well until I read deeper: viewtopic.php?t=42624
Peter_Hucker wrote: Sat Apr 05, 2025 12:31 pm Why can't two run without messing about with whatever MPS is? Boinc will run any number of anything on a GPU, even two completely different projects. A GPU will also play a game while crunching.
I don't know what BOINC is doing but on Nvidia GPUs it's not possible to run two compute tasks simultaneously. You can run graphics and compute, but not two compute tasks. Not truly at the same time (they will be multitasked, but the batch of work from both processes won't get coalesced into a single batch, and only one will be running in any instant even if it is not fully utilizing all shader cores). It's not possible without MPS (or an enterprise-GPU-only thing called vGPU).
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

Read your linked topic, has put me off Nvidia completely. They seem like Apple, let's do everything different and be incompatible. I'm sticking to Windows, Android, and AMD. No macs, no linux, no nvidia.

Boinc most certainly does run two compute tasks on AMD or Nvidia. In fact you can run several.

And of course on AMD or Nvidia you can play games (they presumably use shader cores) while computing, and display webpages, etc. Any GPU multitasks just like a CPU. Perhaps you are correct and it doesn't work like I thought - 50% of shader cores to each task, maybe it increases efficiency by running a second GPU task while the CPU is loading the first one, I don't know. But with many Boinc projects, I get more work done overall if I run 2 or more tasks on the GPU at once. It means more than one CPU core can be feeding, and the GPU doesn't have to wait for that feed, it can be getting on with something else.
arisu
Posts: 586
Joined: Mon Feb 24, 2025 11:11 pm

Re: Which device are the tasks running on?

Post by arisu »

AMD GPUs are the same, doing things to make a vendor lock-in such as switching to a proprietary CUDA competitor (HIP) and cutting funding to the open source OpenCL replacement (SYCL). In the case of MPS only supporting Linux, that is because MPS is for compute tasks and most HPC computing servers run Linux. Maybe BOINC itself coalesces the batches like MPS does or something? I don't know.

I also prefer AMD because their driver is open source and I like AMD better in general, but Nvidia is still (currently) better with compute tasks.
muziqaz
Posts: 2131
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 9950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, Intel B580
Location: London
Contact:

Re: Which device are the tasks running on?

Post by muziqaz »

AMD does run 2 FAH WUs on one GPU, and there is a bit of the benefit to do that depending on the project. But OpenCL is so slow compared to Nvidia and to HIP that there is no point in doing that at the moment.
HIP does not lock companies into using AMD, it's just that Nvidia has their CUDA which is truly proprietary.
SYCL or HIP?
I would take HIP any day, with the speed up it offers.
AMD lost big time with OpenCL being so open and friendly. They need to catch up to Nvidia with their software stack
FAH Omega tester
Image
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

But HIP also runs on Nvidia doesn't it? Even if it didn't, I'd say they're just getting revenge on Nvidia. AMD have to protect their cashflow after all.

And look at Apple. Why make programmers write code twice? Their computers are not better, they're shinier I'll grant you, but they do the same at double the price. I even saw a devout Apple worshipper change to PC when I showed him the prices. But in this case, Apple is one company. PCs are a billion companies. It's surprising Apples sell any at all.

I don't bother looking into cuda/opencl efficiency per project. I just look at the basic number, flops, per £. I used to go for the cards with more DP (as I loved running Milkyway), and Nvidia is useless at DP. Even now the project has finished (on the GPU side), SP is still higher per £ than Nvidia. I just bought a Radeon Pro WX 9100, not chosen for compute this time, but it was the cheapest card with 16GB RAM on it. I need this to run 5 android emulators simultaneously, to play 5 different accounts in a war game :-)

I nearly bought a slightly cheaper Nvidia with 16GB, but it had no monitor outputs! This AMD has 6, and I use 5 of them. The old card only had 4, so I used a very unreliable USB GPU.
muziqaz
Posts: 2131
Joined: Sun Dec 16, 2007 6:22 pm
Hardware configuration: 9950x, 9950x3D, 5950x, 5800x3D
7900xtx, RX9070, Radeon 7, 5700xt, 6900xt, Intel B580
Location: London
Contact:

Re: Which device are the tasks running on?

Post by muziqaz »

nVidia does not support HIP. But HIP is CUDA code recompiled for AMD hardware I believe
FAH Omega tester
Image
Peter_Hucker
Posts: 370
Joined: Wed Feb 16, 2022 1:18 am
Hardware configuration: Ryzen 9 3900XT: 24 cores, 128GB RAM, 1TB NVME, 4TB HDD, R9 Nano (Fiji) GPU.
Ryzen 9 3900X: 24 cores, 64GB RAM, 250GB NVME.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME, R9 290(Hawaii) GPU.
Xeon X5650 dual CPU server: 24 cores, 64GB RAM, 250GB NVME.
I3-6100: 4 cores, 32GB RAM, 250GB NVME, 2 of R9 2980X (Tahiti) GPUs.
5 other smaller computers.
Location: Scotland

Re: Which device are the tasks running on?

Post by Peter_Hucker »

From the AMD website:

"The Heterogeneous-computing Interface for Portability (HIP) is a C++ runtime API and kernel language that lets you create portable applications for AMD and NVIDIA GPUs from a single source code."

https://rocm.docs.amd.com/projects/HIP/ ... index.html
Post Reply