Fun Fact:
On the 2 systems running, the setup using i5-6600 and 7 x RX580 is around 15% faster than the G4400 running 5 x RX580, when both machines happen to run WUs from the same project for comparison. This is stable over several days in time and not a once in a while thing.
PantherX wrote:The Servers don't differentiate between the speeds of GPUs, just the architecture so it doesn't know what Pascal GPU is a low-end or a high-end one. However, it does get the RAM available so some researchers might use that if their Project uses a significant amount of RAM. However, like you said, if you have heaps of RAM and a low-end GPU, you will be assigned those WUs.
Actually, getting bigger WUs that use more RAM would probably be better, since it would ease the CPU load and take longer on the GPU, as can already be seen from the numbers I get, where higher RAM usage has a lower CPU load and generates more PPD, so is there any statistics available showing WU size distribution based on amount of RAM available?
What I do find rather pointless is that there is no real usage of the GPUs memory to any extent, since the VRAM usage I see is minimal at best, so by looking at the numbers for the GPUs it doesn’t seem to matter if the graphics card has 100MB of memory or 8GB, since it doesn’t seem to be used much.
I would assume that more of the WU could be put directly in GPU memory in one go, instead of constantly feeding it tiny pieces, especially since I find it hard to believe that after the Core 22 with OpenCL 1.2 requirement you will find a graphics card with less than 512MB available memory.
Umm.. No.
Increase the RAM size, and you increase the data going to and back from the RAM and CPU. That's an additional bottleneck.
Running small WUs, and more of the data interchanged is from the L-Cache.
The GPU will still need to be constantly fed by the CPU, whether the WU is large, or small.
As far as GPU VRAM, I've seen it as high as just a little over 500MB.
Too bad they don't sell an RTX2080Ti with only 1 or 2 GB of VRAM, at a reduced cost.
I bet if the GPU only has 100MB of VRAM, it'll need to do some page swapping with the VRAM, which is additional PCIE load, and causes a major performance penalty.
Same can be said from games; where increased VRAM always results in increased performance on an otherwise identical system.
Anything below 500MB, will most certainly lower performance on some WUs.
HugoNotte wrote:In general FAH needs 1 CPU thread per GPU. Since AMD works with interrupts, unlike NVidia, the CPU doesn't really need a full thread and you get away with a bit more than 1.5 GPUs per CPU thread.
Are you running Windows? In that case, you might have better luck with Linux?
A better way to put it, is Nvidia prefers 1 CPU thread per GPU. But it doesn't need it.
You can very well run 2 GPUs on a single thread (provided the CPU is fast enough).
You'll just notice a small dip in performance of a few percent; which is what AMD GPUs are getting by default.
The reverse is true too,if AMD had optimized their drivers like Nvidia's, their GPUs would see a minor bump in performance.
MeeLee wrote:Umm.. No.
Increase the RAM size, and you increase the data going to and back from the RAM and CPU. That's an additional bottleneck.
This doesn’t match with the numbers this software produces on the hardware, since all numbers and indications so far has shown that the more RAM a WU is using on a FAH process, the less CPU intensive it is and the more PPD it gets.
A good example is the P13409 test thing flooding the network these days, where the WU is tiny, but it still manages to use more CPU resources than all the other non P13409 WUs running at the same time put together, thus also getting a very low PPD.
Sparkly wrote:...Actually, getting bigger WUs that use more RAM would probably be better, since it would ease the CPU load and take longer on the GPU, as can already be seen from the numbers I get, where higher RAM usage has a lower CPU load and generates more PPD, so is there any statistics available showing WU size distribution based on amount of RAM available?...
AFAIK, there' hasn't been a chart of RAM usage VS Project. However, the closest approximation would be based on atoms in a project which can be found here: https://apps.foldingathome.org/psummary?visibility=ALL you will need to sort on atoms and then focus only on OPENMM_21 and OPENMM_22 as those are the GPU Projects.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
PantherX wrote:AFAIK, there' hasn't been a chart of RAM usage VS Project. However, the closest approximation would be based on atoms in a project which can be found here: https://apps.foldingathome.org/psummary?visibility=ALL you will need to sort on atoms and then focus only on OPENMM_21 and OPENMM_22 as those are the GPU Projects.
Well, based on that list I am also getting the high atom count projects, since I can see in the logs that I have the P14415, P14416, P14417, P14561, so the WU distribution doesn’t seem very picky on the RAM size reported during distribution. These projects are also the ones showing the highest PPD registered in the logs, so more atoms use more RAM, which again use less CPU and more GPU, thus more PPD.
Clearly the software has significant room for improved regarding the handling of WUs on GPU systems, especially for the smaller atom count ones, since in comparison the P13409 test project, which has a tiny amount of atoms, is the worst performer in this setup.
Sparkly wrote:...Clearly the software has significant room for improved regarding the handling of WUs on GPU systems, especially for the smaller atom count ones, since in comparison the P13409 test project, which has a tiny amount of atoms, is the worst performer in this setup.
PantherX wrote:AFAIK, there' hasn't been a chart of RAM usage VS Project. However, the closest approximation would be based on atoms in a project which can be found here: https://apps.foldingathome.org/psummary?visibility=ALL you will need to sort on atoms and then focus only on OPENMM_21 and OPENMM_22 as those are the GPU Projects.
Well, based on that list I am also getting the high atom count projects, since I can see in the logs that I have the P14415, P14416, P14417, P14561, so the WU distribution doesn’t seem very picky on the RAM size reported during distribution. These projects are also the ones showing the highest PPD registered in the logs, so more atoms use more RAM, which again use less CPU and more GPU, thus more PPD.
Clearly the software has significant room for improved regarding the handling of WUs on GPU systems, especially for the smaller atom count ones, since in comparison the P13409 test project, which has a tiny amount of atoms, is the worst performer in this setup.
Keep in mind that GPU architecture has a big impact on some work units, and that it appears any given GPU might perform lesser on some WU's and stronger on others. This would probably take place regardless of RAM use and/or CPU use. I've found that in a system with only one meager onboard GPU, CPU use is not related at all to atom counts, but more to checkpoints. RAM does seem to vary with atom counts more, and in my case my GPU memory allocation can change as it shares system memory. A small atom count might use less than half the GPU memory of a large atom count work unit.
In the case of 13409 the small atom count seems to be disliked by many of the mid to higher end cards, in particular AMD GPU's. They get much better throughput with higher atom count projects, and for some reason bottleneck with the smaller ones. But having run several on my 2400G with Vega 11, they are by far the highest PPD returns I've had on any WU, producing about twice my usual average PPD. As atom counts grow PPD as a trend goes down with my system, but this is also impacted by checkpoints and CPU use. Larger does not always mean lower points return.
I think the server assignment issues that would be involved to only feed us specific WU's that are always more efficient in running on our rigs would be fairly complex. Even if system hardware specifics are accounted for, it would be impossible to account for background tasks taking up memory and resources on any system not 100% dedicated to folding. And in the case of your rig.... it would see available memory but have on way to know how many cards you always keep folding. Even beyond that, there would be no way to be sure how many "optimum" systems are available for a project, so it might have to overflow into less than optimum systems to get the science done in a rapid manner.
In a perfect world, we would all only get the projects that our hardware runs the best. But with all the variables involved, it's far from perfect and probably a complex issue to tackle.
I've had some of my best performing projects followed by some near my worst performing. I just fold them. It all balances out somewhat over time.
Why would you say larger WUs with higher atom count uses less CPU?
It's counter-intuitive.
The CPU has to feed the GPU. The GPU needs to process more data, so more data needs to be fed into the GPU.
Perhaps proportionally there's less CPU usage (eg: on a fast system with fast GPU and CPU, the CPU load will be lower at any given time than on a slow system).
But if one has a slow CPU and slow GPU, large WUs cause more overhead (CPU and RAM wise).
That's what I understand about it..
@BobWilliams:
Sometimes larger GPUs run at lower frequencies.
It is entirely possible to run an RTX2080 Ti at frequencies equivalent to a GT 1660, meaning above 2Ghz, on small WUs.
However, if the GPU is stock, the GTX 1660 may boost to 2Ghz, while the 2080 Ti may stay at 1935-1995Mhz.
And if the WU only uses say 1000 shader cores, and runs the same frequency, the PPD difference between the 2080 Ti and the 1660 may be negligent on smaller WUs.
Which brings up the next issue, with upcoming GPUs being a lot larger, we're going to want to have a lot higher atom count WUs.
Not sure if higher atom count is going to be beneficial, or just going to waste GPU cycles when investigating a structure.
FAH probably will need to invest research in running 2 smaller WUs per GPU.
Like the above example, where only 1000 shaders are used, the 2080Ti has 4500+. So those small WUs could be stacked up 4x.
MeeLee wrote:I would presume that FAH won't assign large WUs on those GPUs in the first place.
And that it looks at whatever RAM is available.
In my case, that was not particularly helpful. In a Windows machine with a GT740, a GTX960, and a CPU:6 slot running and with (only) 6GiB RAM, the GT740 was assigned a WU with over 400K atoms. I don't remember what was active in the other two slots, but It spent several days thrashing the paging file. At my request, they did impose a minimum RAM requirement for that project but it could still be assigned to a slow GPU.
Doing some RAM testing on the systems, so have upgraded from Single Channel to Dual Channel and put more RAM in both systems:
System 1 – G4400 and 5 cards
From 4GB Single Channel to 8GB Dual Channel
System 2 – i5-6600 and 7 cards
From 4GB Single Channel to 16GB Dual Channel
System 1 got a 10-15% PPD increase, when running 5 cards, depending on project, where projects using more RAM got the biggest increase.
System 2 got a 5-10% PPD increase, when running 7 cards, depending on project, where projects using more RAM got the biggest increase.
System 1 – 2 Core 2 Thread - showed a slightly lower overall CPU load than before, so it can now run 6 cards.
System 2 – 4 Core 4 Thread - showed a significantly lower overall CPU load than before, so it can now run all 9 cards.
Most projects don’t seem to grab more RAM than before, even if it is available, so the mid-small atom count projects seem to still only be grabbing the 200-400MB they did before, even thou there are several more GB available.
High atom count projects, like P14253 and P14201, now run more smoothly and grab 1GB or more while running, with the occasional peak, where CPU goes max and RAM use nearly doubles for a few seconds.
Activating additional cards, compared to before, lowered the PPD for each running card in each system, more on System 1 than System 2, but the decrease in single card PPD was recovered with an extra margin by the additional cards, so there was a healthy overall PPD increase in each system compared to running fewer cards at somewhat higher PPD.
Using a Dual Channel memory configuration clearly helps, when running multiple cards and FAH processes, and seems to help more the more cores a CPU is running at the same time, resulting in a lower overall load on the CPU compared to Single Channel.
More RAM helps on overall throughput, where less WUs are rejected, which is especially true for the high atom count projects that grab more RAM, but changes little for the lower atom count projects.
All the numbers show the same as they have been doing since the start, where the high atom count projects use more RAM and give more PPD, while putting less load on the CPU.
Getting very low atom count projects like the test project P13413 with 4082 Atoms, which use a lot of CPU while running, but has a very low PPD on GPU, also diminishes the total system throughput for the rest of the running processes and can decrease them by as much as 40%, depending on the number of low atom count projects running at the same time in the system.
Seeing as the low atom count projects use A LOT of CPU to run on GPU systems, it might be more useful to just assign those projects to CPU in the first place, since the atom count is clearly known up front and before creating the project assignments to CPU or GPU.
Sparkly wrote:Seeing as the low atom count projects use A LOT of CPU to run on GPU systems, it might be more useful to just assign those projects to CPU in the first place, since the atom count is clearly known up front and before creating the project assignments to CPU or GPU.
What do you mean by assigning those projects to CPU in the first place? What characteristics should a system have to get that assignment?
bruce wrote:What do you mean by assigning those projects to CPU in the first place? What characteristics should a system have to get that assignment?
Clearly the atom count is know by the people making the projects, so the projects with low atom count can be assigned to a7 or equivalent core exclusively from the start, instead of also adding in core 22 or equivalent, thus preventing the low atom count projects from ending up on GPU at all, and it might be as simple as saying that projects with less than 20k, or whatever number of atoms, are CPU only projects when making the WUs.
True, and that makes sense to me. In fact, I suspect that's the way it normally works.
There is a certain amount of cross-pollination, though. The number of CPU projects needs correlated with the number of active CPU slots and the number of GPU projects needs to be correlated with the number of active GPUs so that all projects have a reasonable chance of being processed. I have heard of some projects that were cloned and released to both CPUs and to GPUs when things got out-of-balance.
Right now, there's a certain amount of bench-marking going on to determine what to assign and what NOT to assign to your GPU. The applicability of a specific project for a specific platform is somewhat correlated with the atom counts, but I suspect there are other factors that need to be considered.