If you already know a bit about GPU architecture, feel free to skip to the section "A guide to using MPS with FAH".
Why are small projects not suitable for high-end GPUs?
Nvidia GPUs contain hundreds to tens of thousands of units called CUDA cores that perform 32-bit floating point calculations (the type that dominates the work done in FAH simulations). When running a compute application with CUDA on a GPU, small programs called kernels are sent to the GPU sequentially. Once a kernel has finished its task, which usually takes mere microseconds, the CPU sends the next kernel to the GPU.
Each kernel performs some computational work like determining which atoms are interacting, calculating the force of chemical bonds, or even something as mundane as erasing a buffer in memory. Dozens of kernel launches in a row, each one doing a different job, make up a single simulation step and advance the simulation by 2-4 femtoseconds. Tens of thousands of simulation steps make up 1% of a work unit.
When a kernel executes, it launches hundreds to thousands of threads. Each thread performs the same core computation defined by the kernel, but on a different piece of data (such as a different set of atoms or forces). The massive number of CUDA cores in the GPU work together to execute the arithmetic instructions in these threads concurrently and are what makes GPUs so efficient at performing parallel computation tasks.
If the kernel has too few threads to occupy all the GPU's CUDA cores, then some cores are left idle and do nothing but draw power. Small work units that simulate systems with a small number of atoms may not be able to make use of enough threads to fully-utilize high-end GPUs. They don't execute threads faster, but they can run more of them concurrently. This is why projects with few atoms earn you less PPD on high-end GPUs than projects with many atoms. FAH servers will prioritize assigning larger projects to these GPUs, but even those might not be large enough to fully saturate modern high-end cards. As GPUs get bigger and bigger, projects are having a harder time fully utilizing them. This is called Amdahl's law and that is what this guide helps overcome.
For nerds like me who want to know what a kernel's code looks like, here is one for reduceEnergy, used by FAH to sum up the energy of the system. It shows the CUDA C++ code, the PTX code (an assembly-like intermediate representation that can run on all Nvidia GPUs) that the CUDA C++ compiles to, and the SASS code (the largely-undocumented raw assembly that is unique to each GPU generation, in this case Ampere) that the PTX compiles to at launch.
What if we could run multiple projects on a single GPU simultaneously?
There is no speedup if you simply start two projects on the same GPU together. Even if they appear to be running simultaneously, they are being interleaved and the GPU is only running one kernel from one project at any given instant. But there is a way to run two (or more) projects on the same GPU truly simultaneously, using the Linux-specific CUDA MPS server.
The MPS server attaches directly to the GPU. CUDA applications then connect to MPS instead of the GPU. MPS combines the requests and enables true concurrent execution which increases thread density and GPU utilization. This requires launching the MPS server and setting the GPU into exclusive mode.
A guide to using MPS with FAH:
While MPS is well-documented and technical guides exist for using MPS with molecular simulations, so far no one has written one that integrates seamlessly with Folding@home. So, here it is.
First, you need to install the MPS program and a program for managing the GPU called nvidia-smi. On Ubuntu, install the packages as shown here, replacing "570" in the MPS package with whatever your Nvidia driver version is. On Debian, the MPS package may be named "nvidia-cuda-mps" instead.
Code: Select all
sudo apt install nvidia-smi nvidia-compute-utils-570
Code: Select all
sudo systemctl edit fah-client.service
Code: Select all
[Service]
ExecStartPre=+/usr/bin/nvidia-smi -c EXCLUSIVE_PROCESS
ExecStartPre=/usr/bin/nvidia-cuda-mps-control -d
ExecStopPost=/bin/sh -c '/bin/echo quit | /usr/bin/nvidia-cuda-mps-control'
ExecStopPost=+/usr/bin/nvidia-smi -c DEFAULT
Code: Select all
sudo systemctl restart fah-client.service
To determine if you are using MPS, use the nvidia-smi program to get a summary of the processes the GPU is servicing:
Code: Select all
$ nvidia-smi
Sat Jul 19 12:01:40 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:03:00.0 Off | N/A |
|100% 83C P0 401W / 420W | 794MiB / 24576MiB | 99% E. Process |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 10239 C nvidia-cuda-mps-server 28MiB |
| 0 N/A N/A 40499 M+C ...4bit-release-8.1.4/FahCore_24 378MiB |
| 0 N/A N/A 40962 M+C ...4bit-release-8.1.4/FahCore_24 378MiB |
+-----------------------------------------------------------------------------------------+
You should be good to go now! Monitor your PPD to make sure that you're seeing an increase. Give it a few hours for PPD to stabilize. Each work unit yields lower PPD individually, but the combined output increases significantly. A single WU alone gives me 9M PPD, but MPS allows me to run two WUs for 6.5M PPD each, totaling 14M PPD! All on a single 3090!
How can I tell if my GPU is being fully-utilized?
You can get basic statistics from your GPU that will let you determine if it's big enough to benefit from MPS, and whether you could benefit from adding another resource group. This is what it looks like when your GPU is being well-utilized. Each line represents one second. Notice how how power usage doesn't fluctuate too rapidly, and SM usage (the percentage of time that one or more kernels was executing on the GPU in the last sample period) is consistently high:
Code: Select all
$ nvidia-smi dmon -s pu
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa
# Idx W C C % % % % % %
0 399 83 - 100 41 0 0 0 0
0 396 82 - 100 45 0 0 0 0
0 400 84 - 100 43 0 0 0 0
0 414 82 - 99 44 0 0 0 0
0 386 82 - 97 45 0 0 0 0
0 397 83 - 100 43 0 0 0 0
0 401 83 - 100 46 0 0 0 0
0 390 81 - 100 43 0 0 0 0
0 391 82 - 92 40 0 0 0 0
0 402 82 - 100 45 0 0 0 0
Code: Select all
$ nvidia-smi dmon -s pu
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa
# Idx W C C % % % % % %
0 376 82 - 82 28 0 0 0 0
0 394 82 - 90 32 0 0 0 0
0 373 82 - 86 27 0 0 0 0
0 393 82 - 75 24 0 0 0 0
0 375 81 - 96 34 0 0 0 0
0 391 82 - 95 34 0 0 0 0
0 396 83 - 79 25 0 0 0 0
0 389 82 - 87 29 0 0 0 0
0 391 82 - 83 26 0 0 0 0
0 397 83 - 77 25 0 0 0 0
Not all GPUs will benefit from MPS, however. The same project that only partially-utilizes an RTX 3090's 10496 CUDA cores can fully utilize a GTX 970M's mere 1280 CUDA cores without any tricks:
Code: Select all
$ nvidia-smi dmon -s pu
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa
# Idx W C C % % % % % %
0 85 58 - 99 56 0 0 - -
0 84 58 - 100 57 0 0 - -
0 82 58 - 100 59 0 0 - -
0 75 58 - 99 55 0 0 - -
0 84 58 - 94 50 0 0 - -
0 85 58 - 100 58 0 0 - -
0 87 58 - 95 52 0 0 - -
0 80 58 - 98 55 0 0 - -
0 76 58 - 99 55 0 0 - -
0 85 58 - 99 58 0 0 - -
Adding too many WUs to one GPU may decrease PPD. GPUs work best when they are oversubscribed (running more threads than they can simultaneously execute) because it ensures they always have work to do, but adding too many WUs may decrease PPD due to increasingly diminishing returns.
Is this allowed? Could this taint the science?
MPS is completely safe and does not make any changes to FAH's behavior or tamper with its code (which would violate the EULA). It is not cheating either: The increase in PPD is a result of more efficient GPU usage and does not game the point algorithm. It is officially-supported by Nvidia on Linux and is designed to allow multiple compute apps to run at once without interfering with each other. Both MPS and OpenMM (the simulation software that FAH uses for GPU folding) are compatible and are often used together.
What if I have multiple GPUs?
I don't have a multi-GPU system to test it out, but launching MPS with CUDA_VISIBLE_DEVICES variable set to a comma-delimited list of the GPU indexes or UUIDs on your system should enable MPS for multiple GPUs. You can get a comma-delimited list this way:
Code: Select all
nvidia-smi --query-gpu=uuid --format=csv,noheader | paste -sd ','
Code: Select all
Environment=CUDA_VISIBLE_DEVICES=GPU-c7bf74dd-ec57-41ae-a1dc-c8f8ee96053e,GPU-b4c9e155-45c9-4906-9090-6bd6fa9e0b37
As far as I know, AMD does not require any workarounds at all and can natively run multiple simultaneous simulations without needing something like MPS. You can create a new resource group to add a second WU to the same GPU right away. There aren't many AMD GPUs that are big enough to benefit from this, but perhaps someone with a 7900XTX could test this out.
What if I'm using Windows and not Linux?
Unfortunately, Nvidia has not released a version of MPS for Windows. Maybe it could work by running FAH and MPS under WSL2, but I have no idea if that's feasible due to WSL2's use of paravirtualization. The GPU will have to be set to EXCLUSIVE_PROCESS mode from Windows and not from WSL2.
PPD on Windows is typically lower than that of Linux anyway even for the same GPU. If you're chasing PPD, you really should be using Linux.