Using MPS to dramatically increase PPD on big GPUs (Linux guide)

arisu · Post by **arisu** » Sun Aug 03, 2025 2:22 am

enroscado wrote: ↑Thu Jul 31, 2025 12:31 pm I have a couple of 5090s that I could try this on. However, I am not running fah-client.service (my setup doesn't allow it), but rather as a command on a (Ubuntu) terminal windows that remains open.

Can you please help me figure out a workaround with no service? I'll let it run for a week or so on both 5090s, and compare overall output after a week.

You should be able to do it like this:

Code: Select all

sudo nvidia-smi -c EXCLUSIVE_PROCESS
export CUDA_VISIBLE_DEVICES=$(nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd ',')
nvidia-cuda-mps-control -d
fah-client

That's assuming both GPUs are on the same computer. That sets the GPUs into exclusive mode, exports an environmental variable that tells the MPS daemon and the folding client what GPUs it can use, starts the MPS server (which will fork itself to the background), and then starts folding.

To stop folding and MPS, either Control+C on the fah-client process or send it SIGTERM (which will tell it to gracefully shut down), then run this:

Code: Select all

sudo nvidia-smi -c DEFAULT
unset CUDA_VISIBLE_DEVICES
echo quit | nvidia-cuda-mps-control

That will reset all GPUs into default mode, remove the environmental variable, and tell the MPS controller to terminate the MPS daemon.

arisu · Post by **arisu** » Sun Aug 03, 2025 2:40 am

Nicolas_orleans wrote: ↑Thu Jul 31, 2025 12:56 pm To my knowledge (from fah-client --help) there is no command line to restart fah-client when not running it as a service.
I would say pause fah-client from webUI (using the finish function to reach "no work" status), follow arisu's steps, and replace the service restart by a pkill of fah-client process, then the usual start with ./fah-client
That's what I plan to do on a 5080 instance when I find time... Pkill is ugly but no better idea.

As calxalot says, both SIGTERM (the pkill default) and SIGINT (the signal that is received if you control+c while the process runs) are safe.

In fact systemd just sends SIGTERM to the process when stopping it (source). The client detects the SIGTERM and relays the message to the cores, asking them to shut down too. Once the cores shut down and save their work, the client logs the event and the core's progress and then terminates itself. The client doesn't know who is responsible for shutting it down (systemd or pkill) and it doesn't care.

Avoid using SIGKILL unless the client refuses to shut down for many minutes and even after multiple SIGTERMs, because the kill signal will rip out its guts and not give it time to clean up, which can cause loss of progress or even database corruption if you're unlucky.

This is what I do:

Code: Select all

pkill -x fah-client

umfaddi · Post by **umfaddi** » Mon Sep 22, 2025 3:53 pm

When using a two-way GPU setup what is the best way of adding a new resource group?

Default Group with both GPUs enabled and...
another GPU resource group with both GPUs enabled and a total of 2 groups...
...OR:
one resource group with GPU no. 1 enabled
one resource group with GPU no. 2 enabled with a total of 3 groups.

Does it even matter which option I go for?

muziqaz · Post by **muziqaz** » Mon Sep 22, 2025 4:24 pm

As many resource groups as you want to split both GPUs to.
2 GPUs split in half, you need 4 resource groups, 2 of them selected same one GPU, other two second same GPU

Albuquerquefx · Post by **Albuquerquefx** » Wed Oct 01, 2025 6:35 pm

I created an account here on the forums specifically to say "Thank you!" to @arisu. If I may, here are some additional findings...

First, the box I'm running:

Intel DX79Si "Siler" motherboard, PCIe 3.0 was in "beta" at this time and was only supported with very specific steppings of compatible processors
Intel i7-3930k C2-stepping CPU, which supported PCIe 3 and also VT-* instructions
Eight 4GB DDR3/1600 sticks of ram in quad channel 1T CL8 1600MT config.
Asus TUF Gaming 4090 at 350W power limit, and +180 GPU / -1000 memory offsets, plugged into a PCIe 3.0 x16 slot.
MSI Ventus 4070 Super at 200W power limit, and +90 GPU / -1000 memory offsets, plugged into a PCIe 3.0 x8 slot.
Fedora 42 running NVIDIA 580.82 drivers, with coolbits enabled, on Xorg desktop

Without MPS, the 4090 will reliably generate somewhere between 20-22MPPD, and the 4070 Super about 9-11MPPD, obviously depending on WU distribution. With MPS enabled for two sessions each, the 4090 delivers between 22-30MPPD in the aggregate, whereas the 4070 Super seems to choke down about 8-12MPPD in the aggregate. Without any context other than these scores, you would imagine the 4070 Super is simply ill-equipped for use with MPS.

However, digging further into the situation shows @arisu's concerns about the disconnect between artificial scoring (PPD!) versus the actual science output are well and truly the problem. When digging into the actual WU output, both cards increased their output by significant double-digits; the 4090 saw an almost 55% increase in WU throughput, the 4070S saw a 35% increase. Exactly as @arisu expected, it seems the total wall-clock time necessary to compute a single WU artificially constrains the overall far-increased performance of the card and its ability to crank out more than 1 WU at a time.

I did test three MPS instances with the 4090 (with 16,384 CUDA cores, it seemed like ~5460 CUDA cores would be more than enough to chew through a lot of work) and while the total WU output increased even further to about 65%, the scores dropped dramatically to the 12-15MPPD range. Again, almost a two-thirds increase in total output, but a whopping 30% decrease in allocated points seems a little too much.

I've since left the 4090 split into two MPS sessions, and have left MPS configured for the 4070S but for now only have one resource group assigned to it. A week ago, that one box would average about 33MPPD, it now averages right around 40MPPD. This is also on a motherboard and processor from literally 14 years ago (!). I have another 4070 Super with the same specs and config running on an ASRock B550 + 5950X rig and, again using the same exact everything, belts out 11-12MPPD. I have another ASRock B550 + Ryzen 5500 coming in the mail today and I'll get that i7-3930k put out to pasture shortly. I should find another ~10% or better performance hiding in there...

foldinghomealone · Post by **foldinghomealone** » Mon Oct 06, 2025 12:23 am

Great summary, thanks guys.
I wish I had the Linux knowledge to set this up.
I'm expecting a 5080 the next days and would like to run some tests.

What about power draw in MPS "mode"?
(edit: can be seen in the first article, seems to be a bit more, only)

Albuquerquefx · Post by **Albuquerquefx** » Mon Oct 06, 2025 10:48 pm

foldinghomealone wrote: ↑Mon Oct 06, 2025 12:23 amI wish I had the Linux knowledge to set this up.

Which Linux distro are you running / are you planning to run?

foldinghomealone wrote: ↑Mon Oct 06, 2025 12:23 amWhat about power draw in MPS "mode"?

The answer is "it depends" but you should expect a few more watts...

Your CPU will have additional core(s) in a spinlock state, servicing the extra F@H thread(s). Depending on your CPU, this is likely to only be a small handful of watts (less than double-digits). Main system memory will also be that bit of extra busy, which could probably add another watt or three of extra consumption to the wall socket.

As you're already aware, the GPU power consumption varies significantly based on WU. The question to ask first: is the GPU already power limited before enabling MPS? Because this is a 24/7 folding rig, and is also bundled with a 4070 Super, I had both cards power limited down a bit because the WU/watt (or PPD/watt) efficiency is significantly better at lower power levels. With only one CUDA worker thread, my 4090 would typically sit at my 350W power limit all the time. With MPS enabled, I only noticed a tiny bit of power difference, mostly because the card could remain loaded with one WU while another one finished uploading / downloading / preparing to start.

All in all, I wager the addition of MPS probably added another ~25 or so watt-hours of consumption to a rig that's already chewing through nearly 625Whr when the monitor is off.

foldinghomealone · Post by **foldinghomealone** » Sun Oct 12, 2025 11:38 am

As a Linux newbie I managed to install Xubuntu 24.04.3 LTS on my system and started to fold a bit in Windows, Xubuntu and with MPS activated.

However, otherwise than expected, folding doesn't seem to benefit from MPS on my RTX 5080.
PPD didn't increase much but power draw rose significantly.

Code: Select all

1	Win11	GPU only	18,5Mio	375W	49,3kPPD/W	26 WUs
2	Xubuntu	GPU only	21,0Mio	355W	59,2kPPD/W	26 WUs
3	Xubuntu	MPS / M+C	22,6Mio	470W	48,1kPPD/W	23 WUs
- power draw w/o monitor etc...						
- GPU is stock --> no OC/UV

When MPS is activated, the GPU is always limited by Power Limit (shown value: ~350W GPU load). On normal mode it barely is limited by PL.

I would be really interested how much the scientific value of MPS has increased

arisu wrote: ↑Sat Jul 19, 2025 7:50 am ...The ultimate way to determine how much benefit you are getting is by checking to see if the total ns/day has increased. Look for the growing science.log files in your work directory. Every checkpoint, it will log the most recent ns/day. When MPS is in use, you'll have multiple science.log files for the same GPU. Add together the ns/day values and see if the sum is greater than the ns/day value of a single WU when MPS is not active. ...

Is there a way to check the total ns/day after finishing folding? When the client stops, there is no science.log file anymore.

I will continue MPS testing after I have a stable OC/UV.

If you have any further suggestions or tips, please let me know.

muziqaz · Post by **muziqaz** » Sun Oct 12, 2025 12:05 pm

The idea is to create 2 resource groups per 1 GPU when you have mps enabled. That way MPS splits and distributes the GPU resources to those 2 WUs.
Each Wu will have its own directory within fahclient/work folder, each of those directories will have folder called 00, which should contain science.log for each of the WUs. Keep in mind science.log is sent back to the server once WU is completed, and there is no local copy saved after that WU is sent back. Monitoring ns/day is not the correct way to check performance, unless you happen to receive 3 WU copies of the same project. Only WUs from the same project can be compared with ns/day or TPF. Also GPU utilisation is good indicator that it is being utilised to its full potential.
So let's say you receive a p10 WU on your GPU when MPS is disabled. You get 10m PPD while folding it. science.log will show you 10ns/day output. And normal log will show you 10s TPF. Your GPU utilisation is reported as only 60% usage, which is bad.
So, now you enable MPS, create a second resource group for the same GPU. You receive one p10 WU on 1st group, and another p10 WU on your second group. Since it is same project you were folding before, you can now directly compare the performance either through TPF or ns/day.
So, now tpf on both resource groups might show 12s TPF, which is 2 seconds slower than if you were folding without mps on a single resource group. ns/day in science.log might now report as 9ns/day in each of the WU's science.logs. Then you look at utilisation of your GPU, you might now see that GPU is being used 80%, which is much better than 60% reported before with single WU running.
So, with single WU without MPS you get 1 WU returned in 1000s. Second same WU will take another 1000s, which will be 2000s for 2 WUs
With MPS enabled you have 2 WUs returned in 1200s. That is 800s less than without MPS. That will reflect in PPD.
However, people rarely get same WUs at the same time. However,over time it is easy to collect several samples of same WU, and normal logs are still saved locally, once WUs are sent back. So you can take TPF and compare them between each other, just make sure it is the same project

foldinghomealone · Post by **foldinghomealone** » Sun Oct 12, 2025 3:25 pm

I have done exactly that:

muziqaz wrote: ↑Sun Oct 12, 2025 12:05 pm The idea is to create 2 resource groups per 1 GPU when you have mps enabled. That way MPS splits and distributes the GPU resources to those 2 WUs.

I have made a small db / google sheet with the limited functions of the client and the following projects were folded normally and using MPS. Shown is the average TPF in s and PPD

Code: Select all

	xub,gpu		        xub,mps	
Projekt	TPF [s]	PPD [-]	        TPF [s]	PPD [-]
15002	32,3	19.957.836	45,3	11.978.074
15314	31,0	21.572.442	41,5	13.719.026
15315	17,8	19.359.746	26,0	10.928.137
15316	30,0	22.537.633	42,0	13.567.329
15318	23,0	20.455.716	35,2	10.975.790

So, what does it tell me?
For "normal mode" with a single resource group (xub,gpu) for two same consecutive WUs I can add the TPF. For project 15002 it would be 32,3s+32,3s. PPD stays the same.
For "MPS mode" with two resource groups (xub,mps) the TPF stays the same and I can add the PPDs

But how to see the increase of scientific value?
- How many more WUs (of the same project) would I make? Factor = (32,3+32,3)/(45,3) = 1,42 ???

muziqaz · Post by **muziqaz** » Sun Oct 12, 2025 4:16 pm

There is no way to find out with such small sample. What you can do is:
While some WU is running on your GPU, go to /var/lib/fah-client/work/1234 folder where 1234 is current WU directory, you can find that in the log:
13:04:19:I1:WU156: Args: -dir hOQhIMNm50Q63kh31yNmjqDk1D4rYoD8Yr-tS2Fsqww -suffix 01
dir -hOQblahblah is my 1234 folder. So for me, I would need to go to:
/var/lib/fah-client/hOQhIMNm50Q63kh31yNmjqDk1D4rYoD8Yr-tS2Fsqww <--yours obviously gonna be different
From there copy wudata_01.dat to more convenient place, like Desktop/mps_test/00
Now check which fahcore this WU is using. Then go to /var/lib/fah-client/cores, find the directory of the core (openmm-core-nn, where nn is core number). I'm gonna use core22 as an example.
Go as deep inside of the directory tree as you can until you see FahCore_22 alongside some .so files. Copy all those files alongside fahcore22 in to Desktop/mps_test/ folder. Set your GPU to finish and pause folding, that this WU does not sit pointlessly while you are doing mps testing. Once it finishes, open terminal window inside of the Desktop/mps_test and run following command (I don't have nvidia card, so hope this works):

sudo rm /home/myxubuntu/Desktop/mps_test/libstdc++.so.6
sudo ln -s /lib/x86_64-linux-gnu/libstdc++.so.6 /home/myxubuntu/Desktop/mps_test/libstdc++.so.6
export LD_LIBRARY_PATH="$PWD"
./FahCore_22 -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 --log-time=true

If this command works, you should see GPU loaded. Terminal will not show you the progress, but progress can be seen in the log, which you can find inside Desktop/mps_test/00. Give it 3-4 frames. Record average time between of each frame. For ns/day you would need to wait for 5 frames or more (can't remember). Once you done 4 frames, ctrl-c inside of that terminal window which you used to start the test.
Now, copy Desktop/mps_test folder's content into Desktop/mps_test1, so that you have two sets of folders with the same WU to test. Obviously enable MPS now (it can stay enabled even before that, no matter). Clear any log files in both directories.
Open one terminal window inside Desktop/mps_test, and second terminal window in Desktop/mps_test1.
In mps_test1 run:

sudo rm /home/myxubuntu/Desktop/mps_test1/libstdc++.so.6
sudo ln -s /lib/x86_64-linux-gnu/libstdc++.so.6 /home/myxubuntu/Desktop/mps_test1/libstdc++.so.6
export LD_LIBRARY_PATH="$PWD"
./FahCore_22 -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 --log-time=true

In mps_test folder run:

export LD_LIBRARY_PATH="$PWD"
./FahCore_22 -dir 00 -suffix 01 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 --log-time=true

commands. Give them both some time to finish several frames each. Go to each folder and open their respective log files, and then record the average time per frame. Compare those times with the time you recorded in the very first test, when you were running a single terminal window. Once you recorded your data, ctrl-c on both terminals, and you can delete both Desktop/mps_test(1) folders. Post the findings here

foldinghomealone · Post by **foldinghomealone** » Sun Oct 12, 2025 7:42 pm

thank you. Will try to do in the next few days. Will let you know

Albuquerquefx · Post by **Albuquerquefx** » Tue Oct 14, 2025 4:21 pm

Since your prior WU's were not fully loading your card, it's not difficult to picture how MPS would allow more work to fill up the GPU, thus burning through more power. Ultimately that was Arisu's design goal with this post: on really wide GPUs like any of the last two or three generations of the top-tier (3090, 4080 Super, 4090, 5080 and 5090) there are a lot of WU's which simply can't take advantage of that many CUDA cores simultaneously. As such, to properly make use of the really wide GPUs, loading more than one WU into the GPU can make sense.

Also, as Arisu had also mentioned earlier in the thread, the increase in throughput may not be evidenced by PPD, but instead by total WU per day. Since each individual WU processes slower, the PPD goes down -- yet the total number of WU's should increase by a notable amount, the percentage increase depending entirely on how wide your GPU is. With your 5080 suddenly seeing much more power consumption, your PPD may not necessarily bump up, but certainly you're getting more WU's pushed through.

The important part for the F@H organization is completed (and correctly processed) WU's; the points are purely window dressing. As such, if you're actually pumping out more WUs in a given day, you're doing more science regardless of the points. It's like the old US TV show Whose Line Is It Anyway, where the stories are made up and the points don't matter

foldinghomealone · Post by **foldinghomealone** » Wed Oct 15, 2025 8:18 am

I understand the correlation between (longer) TPF and (significantly less) PPD.

Since the TPF can vary significantly across different WUs and project, it’s difficult to determine whether the GPU is actually processing more WUs per time.
I can only rely on – or hope – that the MPS server isn’t just burning energy on parallelization without real gains.
At the moment, my view is this: the GPU is constrained by the Power Limit, which prevents WUs from being processed more quickly.
Once I find a stable undervolting setup, there should be more headroom before hitting the power limit. Ideally, this would allow faster processing of WUs and lead to an increase in PPD.
Of course, it’s possible that another bottleneck / limiter might be active before the Power Limiter becomes active.

muziqaz · Post by **muziqaz** » Wed Oct 15, 2025 8:22 am

The test sequence I wrote few posts back will tell you if you are gaining anything from MPS initiative or not

Folding Forum

Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)