I created an account here on the forums specifically to say "Thank you!" to @arisu. If I may, here are some additional findings...
First, the box I'm running:
- Intel DX79Si "Siler" motherboard, PCIe 3.0 was in "beta" at this time and was only supported with very specific steppings of compatible processors
- Intel i7-3930k C2-stepping CPU, which supported PCIe 3 and also VT-* instructions
- Eight 4GB DDR3/1600 sticks of ram in quad channel 1T CL8 1600MT config.
- Asus TUF Gaming 4090 at 350W power limit, and +180 GPU / -1000 memory offsets, plugged into a PCIe 3.0 x16 slot.
- MSI Ventus 4070 Super at 200W power limit, and +90 GPU / -1000 memory offsets, plugged into a PCIe 3.0 x8 slot.
- Fedora 42 running NVIDIA 580.82 drivers, with coolbits enabled, on Xorg desktop
Without MPS, the 4090 will reliably generate somewhere between 20-22MPPD, and the 4070 Super about 9-11MPPD, obviously depending on WU distribution. With MPS enabled for two sessions each, the 4090 delivers between 22-30MPPD in the aggregate, whereas the 4070 Super seems to choke down about 8-12MPPD in the aggregate. Without any context other than these scores, you would imagine the 4070 Super is simply ill-equipped for use with MPS.
However, digging further into the situation shows @arisu's concerns about the disconnect between artificial scoring (PPD!) versus the actual science output are well and truly the problem. When digging into the actual WU output, both cards increased their output by significant double-digits; the 4090 saw an almost 55% increase in WU throughput, the 4070S saw a 35% increase. Exactly as @arisu expected, it seems the total wall-clock time necessary to compute a single WU artificially constrains the overall far-increased performance of the card and its ability to crank out more than 1 WU at a time.
I did test three MPS instances with the 4090 (with 16,384 CUDA cores, it seemed like ~5460 CUDA cores would be more than enough to chew through a lot of work) and while the total WU output increased even further to about 65%, the scores dropped dramatically to the 12-15MPPD range. Again, almost a two-thirds increase in total output, but a whopping 30% decrease in allocated points seems a little too much.
I've since left the 4090 split into two MPS sessions, and have left MPS configured for the 4070S but for now only have one resource group assigned to it. A week ago, that one box would average about 33MPPD, it now averages right around 40MPPD. This is also on a motherboard and processor from literally 14 years ago (!). I have another 4070 Super with the same specs and config running on an ASRock B550 + 5950X rig and, again using the same exact everything, belts out 11-12MPPD. I have another ASRock B550 + Ryzen 5500 coming in the mail today and I'll get that i7-3930k put out to pasture shortly. I should find another ~10% or better performance hiding in there...