Using MPS to dramatically increase PPD on big GPUs (Linux guide)

arisu · Post by **arisu** » Sat Jul 19, 2025 7:50 am

An RTX 3090 on Linux earns around 9M PPD. By utilizing a trick called MPS (Multi-Process Service), this can be increased to 14M PPD. Other big GPUs, like the 4080/4090 and 5080/5090, may see even greater gains. Remember that smaller GPUs do not benefit from this guide.

If you already know a bit about GPU architecture, feel free to skip to the section "A guide to using MPS with FAH".

Why are small projects not suitable for high-end GPUs?

Nvidia GPUs contain hundreds to tens of thousands of units called CUDA cores that perform 32-bit floating point calculations (the type that dominates the work done in FAH simulations). When running a compute application with CUDA on a GPU, small programs called kernels are sent to the GPU sequentially. Once a kernel has finished its task, which usually takes mere microseconds, the CPU sends the next kernel to the GPU.

Each kernel performs some computational work like determining which atoms are interacting, calculating the force of chemical bonds, or even something as mundane as erasing a buffer in memory. Dozens of kernel launches in a row, each one doing a different job, make up a single simulation step and advance the simulation by 2-4 femtoseconds. Tens of thousands of simulation steps make up 1% of a work unit.

When a kernel executes, it launches hundreds to thousands of threads. Each thread performs the same core computation defined by the kernel, but on a different piece of data (such as a different set of atoms or forces). The massive number of CUDA cores in the GPU work together to execute the arithmetic instructions in these threads concurrently and are what makes GPUs so efficient at performing parallel computation tasks.

If the kernel has too few threads to occupy all the GPU's CUDA cores, then some cores are left idle and do nothing but draw power. Small work units that simulate systems with a small number of atoms may not be able to make use of enough threads to fully-utilize high-end GPUs. They don't execute threads faster, but they can run more of them concurrently. This is why projects with few atoms earn you less PPD on high-end GPUs than projects with many atoms. FAH servers will prioritize assigning larger projects to these GPUs, but even those might not be large enough to fully saturate modern high-end cards. As GPUs get bigger and bigger, projects are having a harder time fully utilizing them. This is called Amdahl's law and that is what this guide helps overcome.

For nerds like me who want to know what a kernel's code looks like, here is one for reduceEnergy, used by FAH to sum up the energy of the system. It shows the CUDA C++ code, the PTX code (an assembly-like intermediate representation that can run on all Nvidia GPUs) that the CUDA C++ compiles to, and the SASS code (the largely-undocumented raw assembly that is unique to each GPU generation, in this case Ampere) that the PTX compiles to at launch.

What if we could run multiple projects on a single GPU simultaneously?

There is no speedup if you simply start two projects on the same GPU together. Even if they appear to be running simultaneously, they are being interleaved and the GPU is only running one kernel from one project at any given instant. But there is a way to run two (or more) projects on the same GPU truly simultaneously, using the Linux-specific CUDA MPS server.

The MPS server attaches directly to the GPU. CUDA applications then connect to MPS instead of the GPU. MPS combines the requests and enables true concurrent execution which increases thread density and GPU utilization. This requires launching the MPS server and setting the GPU into exclusive mode.

A guide to using MPS with FAH:

While MPS is well-documented and technical guides exist for using MPS with molecular simulations, so far no one has written one that integrates seamlessly with Folding@home. So, here it is.

First, you need to install the MPS program and a program for managing the GPU called nvidia-smi. On Ubuntu, install the packages as shown here, replacing "570" in the MPS package with whatever your Nvidia driver version is. On Debian, the MPS package may be named "nvidia-cuda-mps" instead.

Code: Select all

sudo apt install nvidia-smi nvidia-compute-utils-570

Now create a systemd service override by running:

Code: Select all

sudo systemctl edit fah-client.service

This will open up an editor where an override service file will be created. Put the following contents into it to launch MPS whenever FAH is launched, and to terminate MPS whenever FAH is terminated (the GPU is also brought into and taken out of exclusive compute mode as needed):

Code: Select all

[Service]
ExecStartPre=+/usr/bin/nvidia-smi -c EXCLUSIVE_PROCESS
ExecStartPre=/usr/bin/nvidia-cuda-mps-control -d
ExecStopPost=/bin/sh -c '/bin/echo quit | /usr/bin/nvidia-cuda-mps-control'
ExecStopPost=+/usr/bin/nvidia-smi -c DEFAULT

Restart FAH for the changes to take effect:

Code: Select all

sudo systemctl restart fah-client.service

Now FAH should be running with MPS. But we still need to allow multiple work units to run on one GPU. To do this, open the web client and create a new resource group as described in the v8.4 client guide. Enable only the GPU in the new resource group. This will cause a WU to be downloaded and run on that GPU in addition to the one you have set in your default resource group. You can add more than one extra resource group if your GPU is big enough (especially if it's a 5090), but usually one extra resource group is enough.

To determine if you are using MPS, use the nvidia-smi program to get a summary of the processes the GPU is servicing:

Code: Select all

$ nvidia-smi
Sat Jul 19 12:01:40 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169                Driver Version: 570.169        CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:03:00.0 Off |                  N/A |
|100%   83C    P0            401W /  420W |     794MiB /  24576MiB |     99%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           10239      C   nvidia-cuda-mps-server                   28MiB |
|    0   N/A  N/A           40499    M+C   ...4bit-release-8.1.4/FahCore_24        378MiB |
|    0   N/A  N/A           40962    M+C   ...4bit-release-8.1.4/FahCore_24        378MiB |
+-----------------------------------------------------------------------------------------+

The processes with "M+C" listed are compute apps that communicating with the GPU through the MPS server.

You should be good to go now! Monitor your PPD to make sure that you're seeing an increase. Give it a few hours for PPD to stabilize. Each work unit yields lower PPD individually, but the combined output increases significantly. A single WU alone gives me 9M PPD, but MPS allows me to run two WUs for 6.5M PPD each, totaling 14M PPD! All on a single 3090!

How can I tell if my GPU is being fully-utilized?

You can get basic statistics from your GPU that will let you determine if it's big enough to benefit from MPS, and whether you could benefit from adding another resource group. This is what it looks like when your GPU is being well-utilized. Each line represents one second. Notice how how power usage doesn't fluctuate too rapidly, and SM usage (the percentage of time that one or more kernels was executing on the GPU in the last sample period) is consistently high:

Code: Select all

$ nvidia-smi dmon -s pu
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa 
# Idx      W      C      C      %      %      %      %      %      % 
    0    399     83      -    100     41      0      0      0      0 
    0    396     82      -    100     45      0      0      0      0 
    0    400     84      -    100     43      0      0      0      0 
    0    414     82      -     99     44      0      0      0      0 
    0    386     82      -     97     45      0      0      0      0 
    0    397     83      -    100     43      0      0      0      0 
    0    401     83      -    100     46      0      0      0      0 
    0    390     81      -    100     43      0      0      0      0 
    0    391     82      -     92     40      0      0      0      0 
    0    402     82      -    100     45      0      0      0      0

Contrast this with stats that you get when only one WU is running. See how both power usage and SM usage are lower and fluctuate rapidly:

Code: Select all

$ nvidia-smi dmon -s pu
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa 
# Idx      W      C      C      %      %      %      %      %      % 
    0    376     82      -     82     28      0      0      0      0 
    0    394     82      -     90     32      0      0      0      0 
    0    373     82      -     86     27      0      0      0      0 
    0    393     82      -     75     24      0      0      0      0 
    0    375     81      -     96     34      0      0      0      0 
    0    391     82      -     95     34      0      0      0      0 
    0    396     83      -     79     25      0      0      0      0 
    0    389     82      -     87     29      0      0      0      0 
    0    391     82      -     83     26      0      0      0      0 
    0    397     83      -     77     25      0      0      0      0

These are stats from an RTX 3090. Notice how a single WU does not make full use of the GPU's resources. Almost a fifth of the GPU's time is spent idling, doing nothing at all. When running two WUs at once with MPS, this idle period is reduced to nearly 1%. Average power usage only increases by about 10W, but total PPD increases by 5M PPD!

Not all GPUs will benefit from MPS, however. The same project that only partially-utilizes an RTX 3090's 10496 CUDA cores can fully utilize a GTX 970M's mere 1280 CUDA cores without any tricks:

Code: Select all

$ nvidia-smi dmon -s pu
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa 
# Idx      W      C      C      %      %      %      %      %      % 
    0     85     58      -     99     56      0      0      -      - 
    0     84     58      -    100     57      0      0      -      - 
    0     82     58      -    100     59      0      0      -      - 
    0     75     58      -     99     55      0      0      -      - 
    0     84     58      -     94     50      0      0      -      - 
    0     85     58      -    100     58      0      0      -      - 
    0     87     58      -     95     52      0      0      -      - 
    0     80     58      -     98     55      0      0      -      - 
    0     76     58      -     99     55      0      0      -      - 
    0     85     58      -     99     58      0      0      -      -

This makes sense. The 970M is more than a decade old and has just a bit more than a tenth as many cores as the 3090. This little GPU gets me about 0.5M PPD, but as even medium-sized projects can fully saturate it, there's not much room for improvement.

Adding too many WUs to one GPU may decrease PPD. GPUs work best when they are oversubscribed (running more threads than they can simultaneously execute) because it ensures they always have work to do, but adding too many WUs may decrease PPD due to increasingly diminishing returns.

Is this allowed? Could this taint the science?

MPS is completely safe and does not make any changes to FAH's behavior or tamper with its code (which would violate the EULA). It is not cheating either: The increase in PPD is a result of more efficient GPU usage and does not game the point algorithm. It is officially-supported by Nvidia on Linux and is designed to allow multiple compute apps to run at once without interfering with each other. Both MPS and OpenMM (the simulation software that FAH uses for GPU folding) are compatible and are often used together.

What if I have multiple GPUs?

I don't have a multi-GPU system to test it out, but launching MPS with CUDA_VISIBLE_DEVICES variable set to a comma-delimited list of the GPU indexes or UUIDs on your system should enable MPS for multiple GPUs. You can get a comma-delimited list this way:

Code: Select all

nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd ','

In the fah-client.service override file, add CUDA_VISIBLE_DEVICES as an environmental variable. It should look like this:

Code: Select all

Environment=CUDA_VISIBLE_DEVICES=GPU-c7bf74dd-ec57-41ae-a1dc-c8f8ee96053e,GPU-b4c9e155-45c9-4906-9090-6bd6fa9e0b37

What if I'm using AMD and not Nvidia?

As far as I know, AMD does not require any workarounds at all and can natively run multiple simultaneous simulations without needing something like MPS. You can create a new resource group to add a second WU to the same GPU right away. There aren't many AMD GPUs that are big enough to benefit from this, but perhaps someone with a 7900XTX could test this out.

What if I'm using Windows and not Linux?

Unfortunately, Nvidia has not released a version of MPS for Windows. Maybe it could work by running FAH and MPS under WSL2, but I have no idea if that's feasible due to WSL2's use of paravirtualization. The GPU will have to be set to EXCLUSIVE_PROCESS mode from Windows and not from WSL2.

PPD on Windows is typically lower than that of Linux anyway even for the same GPU. If you're chasing PPD, you really should be using Linux.

Nicolas_orleans · Post by **Nicolas_orleans** » Sat Jul 19, 2025 10:07 am

Brilliant work, arisu!

Nicolas_orleans · Post by **Nicolas_orleans** » Sun Jul 20, 2025 7:58 pm

I gave it a try today on a 4080 Super that had by default a 91-96% usage, for around 20.6 M PPD last 5-days average. Long story short: it works, it's very easy to set up thanks to arisu's explanations, but PPD wise it's not worth it (cumulative PPD for two WUs i parallel with MPS: 18-20 M PPD).

I suspect we are gaining computing time in a linear fashion, but loosing PPD in an exponential fashion, so there should a kind of functioning point in terms of average use without MPS (somewhere between 80 and 90% ?) to inform the with/without MPS decision. Could even be calculated if all projects were perfectly calibrated, and no inter and intra project variability... which is not the case obviously.

Would be interesting to test on a 5080, but if 4080S is a no go because it's already a sufficiently optimized folder, it could be only a x090 thing.

Without MPS

Code: Select all

nicolas@nicolas-MS-7C56:~$ nvidia-smi dmon -s pu
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa 
# Idx      W      C      C      %      %      %      %      %      % 
    0    250     74      -     91     15      0      0      0      0 
    0    255     74      -     92     16      0      0      0      0 
    0    258     74      -     96     17      0      0      0      0 
    0    255     74      -     91     15      0      0      0      0 
    0    253     75      -     91     15      0      0      0      0 
    0    256     72      -     96     17      0      0      0      0 
    0    251     74      -     93     16      0      0      0      0 
    0    254     75      -     92     15      0      0      0      0 
    0    255     74      -     96     16      0      0      0      0 
    0    255     74      -     92     16      0      0      0      0 
    0    257     74      -     93     16      0      0      0      0 
    0    251     74      -     94     16      0      0      0      0 
    0    253     75      -     92     16      0      0      0      0 
    0    253     74      -     91     15      0      0      0      0 
    0    255     73      -     93     16      0      0      0      0 
    0    254     75      -     92     15      0      0      0      0 
    0    253     74      -     92     15      0      0      0      0 
    0    251     75      -     93     16      0      0      0      0 
    0    248     73      -     93     18      0      0      0      0 
    0    247     71      -     95     19      0      0      0      0

With MPS

Code: Select all

nicolas@nicolas-MS-7C56:~$ nvidia-smi dmon -s pu
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa 
# Idx      W      C      C      %      %      %      %      %      % 
    0    301     77      -    100     33      0      0      0      0 
    0    302     78      -    100     34      0      0      0      0 
    0    302     77      -    100     35      0      0      0      0 
    0    302     77      -    100     32      0      0      0      0 
    0    302     77      -    100     34      0      0      0      0 
    0    302     77      -     99     33      0      0      0      0 
    0    303     77      -    100     33      0      0      0      0 
    0    302     78      -    100     34      0      0      0      0 
    0    303     78      -    100     33      0      0      0      0 
    0    302     77      -    100     35      0      0      0      0 
    0    301     77      -    100     34      0      0      0      0 
    0    302     77      -    100     33      0      0      0      0 
    0    298     77      -    100     33      0      0      0      0 
    0    302     78      -    100     34      0      0      0      0 
    0    298     76      -     98     28      0      0      0      0 
    0    285     78      -     99     34      0      0      0      0 
    0    301     78      -    100     32      0      0      0      0 
    0    302     78      -    100     32      0      0      0      0 
    0    304     78      -    100     34      0      0      0      0

arisu · Post by **arisu** » Mon Jul 21, 2025 9:57 am

MPS improves performance, but perhaps a 4080S with the project you were folding didn't improve it enough to overcome the QRB.

It looks like that is an instance of the QRB being a perverse incentive. MPS is speeding up the system and the sum of ns/day is higher, but not quite high enough to overcome the penalty from a lower QRB. Two projects at 30 ns/day is slightly more valuable (meaning the project will complete faster) than one project at 59 ns/day, even though the latter will earn significantly more points.

I suspect that the QRB was implemented for psychological reasons. My RTX 3090 is only about 20x faster than my GTX 780M in ns/day for one particular project, but the RTX gets 9M PPD on that project and the GTX gets less than 100k. Without the QRB, if the RTX got 9M PPD then the crappy decade old mobile GPU would be getting almost 0.5M PPD (which would better reflect its contribution, but would also provide less incentive to upgrade).

Which project were you folding on btw? The one I tested on was the 154xx series (128k atoms).

Nicolas_orleans · Post by **Nicolas_orleans** » Mon Jul 21, 2025 12:21 pm

I was folding a mix of 18230, 12705, 16581.

On this rig, usually (I browsed the available logs for min/max TPF )
- 18230: between 86 and 94 ns/d
- 12705: between 330 and 336 ns/d
- 16581: between 101 and 104 ns/d

With MPS
- 18230: 58 ns/d
- 12705: 155 ns/d
- 16581: 55 ns/d

I think it did something like:
slot 1 18230 58 ns/d + slot 2 12705 155 ns/d
slot 1 18230 58 ns/d + slot 2 16581 55 ns/d
then another 18230 and slot 1 that folded with MPS until slot 2 finished and was paused

We cannot add ns/d of different projects so not sure how to know the output is actually maximized vs non MPS ? By that I mean, how to be sure in the field, not from a theoretical perspective. If higher utilization, temperature and power draw decrease the frequency of the GPU for example, a 100% utilized GPU could have a lower output compared to a 94% utilized GPU ?

arisu · Post by **arisu** » Mon Jul 21, 2025 1:23 pm

Nicolas_orleans wrote: ↑Mon Jul 21, 2025 12:21 pm If higher utilization, temperature and power draw decrease the frequency of the GPU for example, a 100% utilized GPU could have a lower output compared to a 94% utilized GPU ?

A 94% utilized GPU means that, 6% of the time, not a single kernel is running and the GPU might as well be off. But in those brief, microsecond-long idling periods, the clock rate can spike even though the GPU spinning its wheels. The idle periods are brief enough that it doesn't leave P0 state and enter a power saving state.

SM% isn't a perfect measure of utilization though. A kernel that spawns only 32 threads, which is the smallest possible number, would still count as a utilization event. So often times a "mere" 6% of idle time is indicative of a much larger percentage of the GPU being idle. You'd need to use more advanced profiling tools like Nsight-Compute to determine the actual usage, but usually a lower clock rate combined with higher power usage will indicate that something is actually happening on those clock cycles and that the GPU isn't just spinning its wheels.

A higher clock rate on an under-utilized GPU just means that the GPU is doing nothing, but faster. You can see the same effect the instant a WU completes. You'll see the GPU clock max out for a few seconds before the GPU realizes that no new kernels are coming and reduces its performance level to save power. When that clock is maxed out, it's running the instruction dispatch units at their fastest speed, but all those units have no instructions to issue so the GPU does nothing.

With your 4080S and the current batch of projects, it looks like this is a situation where you'd be forced to choose between doing more science and getting more points. A 4090 or 5090 would see much more consistent benefit (doing more science and getting more points) because those are very hard to fully utilize without those rare million atom projects, so they are perfect for MPS. I guess the 4080S is right on the edge, at least for the current projects.

Folding Forum

Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)

Re: Using MPS to dramatically increase PPD on big GPUs (Linux guide)