Multiple projects on one GPU with MPS

arisu · Post by **arisu** » Mon Mar 31, 2025 2:54 am

Has anyone tried to use Nvidia MPS (https://docs.nvidia.com/deploy/mps/index.html) to make better use of high-capacity GPUs?

If two WUs are run on the same GPU then even if they are small, there will be no net speed-up. Imagine the WU is too small to fully utilize the GPU and only makes use of 40% of the CUDA cores. Running two WUs at once would cause WU A to use 40%, then it would schedule B which would use 40%, then back to A which would use 40% etc and at no times would they both run at once (it would appear to the user that they're both running at once, but each at only half-speed). But when MPS is turned on, they are truly scheduled in parallel, now it will use 80% of the CUDA cores.

OpenMM, the software behind the GPU cores, has an open feature request for simultaneous simulations but using MPS to do that was rejected for FAH because it only supports Linux (https://github.com/openmm/openmm/issues ... -713723978). But Linux users should be able to use this easily, without modifying the core or making any changes to the client. Unlike vGPU that splits one GPU into several virtual ones, MPS is available for consumer cards.

If no one has tried this with FAH before, I will try it out and write a guide for it.

muziqaz · Post by **muziqaz** » Mon Mar 31, 2025 4:36 am

Nvidia does not support 2 workloads on a single consumer GPU.
Developer resources are scarce as is, especially for corner cases like this

arisu · Post by **arisu** » Mon Mar 31, 2025 4:49 am

muziqaz wrote: ↑Mon Mar 31, 2025 4:36 am Nvidia does not support 2 workloads on a single consumer GPU.
Developer resources are scarce as is, especially for corner cases like this

Anything with a CUDA Compute Capability above 3.5 supports it. Please see here: https://developer.nvidia.com/cuda-gpus

arisu · Post by **arisu** » Tue Apr 15, 2025 10:26 am

I've tested it with non-FAH CUDA utilities but I'd like to test with FAH now. Could someone give me two very small GPU WUs' wudata_01.dat files, ones small enough that they won't even utilize half of a 3090 or 4090 Mobile's CUDA cores?

muziqaz · Post by **muziqaz** » Tue Apr 15, 2025 6:37 pm

Try to choose correct subforum when posting new threads

this thread would have been perfect for "Problems with nVidia drivers", since technically this is nVidia driver issue.

Anyways: https://send.vis.ee/download/6cbd410b62 ... RPckoyuMJA
This contains wu, as well as windows fahcore_23. But you might be on Linux, so ignore that. If you need more guidance on how to run this, let me know.
Download link expires in 3 days and 10 downloads

arisu · Post by **arisu** » Wed Apr 16, 2025 2:01 am

Thank you!

I thought this would be a better subforum because I'm testing out MPS but not having any problems (yet). It's going smoothly so far.

arisu · Post by **arisu** » Thu Apr 24, 2025 7:14 am

Running multiple WUs on a single wide GPU at the same time is a success using MPS with no changes needed to the FAH client, cores, or Nvidia drivers. I tested multiple different projects of multiple different sizes.

Here are the results for 12600, the smallest project I could find, running on my RTX 4090 Mobile.

Total ns/day without MPS:
1 WU: 1080
2 WUs: 813 (~407 ea)
3 WUs: 814 (~407 ea)
4 WUs: 810 (~405 ea)
5 WUs: 807 (~403 ea)

Total ns/day with MPS:
1 WU: 1016
2 WUs: 1687 (~844 ea)
3 WUs: 2191 (~730 ea)
4 WUs: 2514 (~629 ea)
5 WUs: 2823 (~565 ea)

The ns/day is inversely proportional to TPF. With this, *every* project can fully utilize even a 5090, no matter how small the project is.

For a single GPU with an index of 0, it is as simple as:

Code: Select all

# nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
# CUDA_VISIBLE_DEVICES=0 nvidia-cuda-mps-control -d

And then start up multiple cores and they will run simultaneously on the same GPU with minimal overhead. The wider the GPU, the better this scales.

muziqaz · Post by **muziqaz** » Thu Apr 24, 2025 7:24 am

I'm slow this morning. So are you gaining overall anything science wise?
I don't care about utilisation, I care if you can do 2 WU on single GPU at same or faster rate than of you were doing them on separate GPUs

arisu · Post by **arisu** » Thu Apr 24, 2025 7:35 am

muziqaz wrote: ↑Thu Apr 24, 2025 7:24 am I'm slow this morning. So are you gaining overall anything science wise?
I don't care about utilisation, I care if you can do 2 WU on single GPU at same or faster rate than of you were doing them on separate GPUs

Yes. You can do 2 WU on a single GPU at approximately the same rate as if you were doing them on two separate GPUs, if the GPU is wide enough (it is of no benefit to lower-end GPUs).

In other words if we were to race to complete 12600 projects (or projects of similar size) and you had three RTX 4090 Mobile GPUs, my one 4090 Mobile would complete more WUs per day than your three, if I'm using MPS.

muziqaz · Post by **muziqaz** » Thu Apr 24, 2025 9:50 am

What about the large WUs?
I take it it is unrealistic to hope for user to sit there by their PCs applying the fox for small WUs and then disabling it for large WUs?
The fix should be applied once and left to be regardless of which WUs are being downloaded

arisu · Post by **arisu** » Thu Apr 24, 2025 10:25 am

On a very wide GPU, almost all WUs will see at least some benefit. For a GPU much less powerful than a 3090, I think that there would not be a major benefit. For a very powerful one like a 5090, there should be a considerable benefit at almost all times.

To get a bigger benefit for more GPUs, the client and server would have to coordinate: If a species 10 Nvidia asks is requesting an assignment and there are no big assignments available, it could instead be given two simultaneous assignments (or three, or four) intended for lesser GPUs.

But without the client and server adapting, this is only of interest to people who have very wide GPUs.

muziqaz · Post by **muziqaz** » Thu Apr 24, 2025 11:03 am

I don't think you gonna get this implemented client side.
Since it is dependant on the constraints that we set, we never split them per OS, as that is pure hassle to do with every project.
Probably best way to do it is just for user to create couple of resource groups with the same GPU ticked, and download different projects that way

arisu · Post by **arisu** » Thu Apr 24, 2025 1:54 pm

I think John Chodera briefly looked into implementing this client and server side but decided against it because it is Linux-only.

There are some other techniques that could theoretically work on Windows, but they all have serious downsides that limit reproducibility or cause simulation instability, such as stacking a bunch of periodic bounding boxes next to each other in one OpenMM context with constraints to limit the range of forces. It would make it possible to fully utilize a wide GPU, but the FP32 accuracy goes down as the position diverges from 0 (there are fewer points between 10.0 and 11.0 than between 0.0 and 1.0).

I actually started working on implementing just that before I found out that it was already discussed on the OpenMM GitHub and that Peter Eastman already shot it down fast because of the aformentioned issues.

muziqaz · Post by **muziqaz** » Thu Apr 24, 2025 2:29 pm

I have suggested to do a 4 box environment which has 4 WUs of the same project inside of it, and each of those boxes is separated by each other by a membrane which does not interfere with the folding process.
Currently each work unit is inside of a single cube (box) which it folds in. If we take 4 of those cubes, add membrane between of them, and package it as a single WU, GPU would think that all those boxes are a single WU, and would simulate each of them on a separate shader group or something.
This is very simplified suggestion from the person who has no biochemistry science background. Though I did not receive any major push backs on this.
However implementing system like that is not easy, and would need to have it's own grant, or time assigned to it, and doubt anyone would undertake it as their hobby or free time activity.
However, seeing how current and future GPUs are getting wider and wider, someone will have to come up with the solution to utilise all this power within GPUs.
My suggestion by the way, in my universe requires minimal (if any) input from openMM crowd, so we would technically be free of Peter's wrath

On the other hand US fed government shutting down antiviral research funding does not give anyone confidence to start these out of the box projects

arisu · Post by **arisu** » Fri Apr 25, 2025 1:41 am

Implementing that isn't actually very hard, and it's what I was in the middle of testing (although with GROMACS, but only because I am more familiar with its tools). But there are two big problems with the idea that make it a non-starter:

1. A membrane (or constraints) is not enough. Most projects use periodic bounding boxes, which means that a particle "wraps around" from one end to the other instead of bouncing off it. From the perspective of the protein, it's existing in an endless ocean of solvent. In fact, it "exists" in an infinite space tiled with infinitely repeating copies of the same system. But if you add a barrier, the protein can bump into it, solvent can get trapped between the protein and the barrier which changes its behavior etc. So the solution would be to just use really large bounding boxes. Basically one system with different proteins all put inside the same box, but placed far enough from each other that they will never interact. But that leads to the next problem.

2. FP precision limitations cause issues, especially with FP32. The center of the bounding box is (0,0,0) in XYZ coordinates. A particle that is exactly three units from the center the X direction would be at position (3,0,0). But IEE 754 floating point numbers lose precision the farther away they get from zero. The spacing between adjacent values (incrementing the mantissa) is proportional to the magnitude of the value (the exponent). This causes two sub-problems:

2a. The more "distant" particles have less precision because there are fewer discrete points between the same two real numbers. Calculations closer to the center of the bounding box always have the best accuracy because the mantissa is representing smaller values. In an extreme case, this can cause the more distant particles to clip into each other. This wouldn't be a problem with FP64 where the mantissa is big enough that the exponent can be kept pretty small at all times, but FP64 is too slow for FAH to use for force calculations.

2b. Now the "parallel" simulations have less reproducibility, because each system being simulated will actually be operating under slightly different physics based on how far it is from the center, even if you can perfectly isolate each system from each other.

If OpenMM supported multiple systems per context, each with its own periodic bounding box with center coordinates (0,0,0), then this would be doable. But it does not support that. So you'd have to deal with Peter.

On the plus side, he does not have any problem with multiple systems per context iirc, but it would require some substantial changes to the code, so it's a case of "patches accepted".

I'm not a Windows person, but if the MPS daemon was running on WSL2, would CUDA contexts possibly be managed by it? People with very wide GPUs are probably more willing to go through a few extra steps for a substantial PPD increase.

Folding Forum

Multiple projects on one GPU with MPS

Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS

Re: Multiple projects on one GPU with MPS