debugging sudden low performance on RX5500
Moderators: Site Moderators, FAHC Science Team
Re: debugging sudden low performance on RX5500
28 is still not the best choice, but the automatic thread count processing is complex and not 100% complete. Manually change it to 24 (temporarily) and see if it changes.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: debugging sudden low performance on RX5500
sorry i left you hanging. the high CPU usage was constant - around 70%. i don't remember AMD CPU threads ever being this CPU-hungry. on my windows machine the AMD CPU thread takes about 0.5% of one core. but the GPU there is very old (baffin) so it might not be a good comparison with RX5500, and it sounds like the AMD linux drivers are messed up to a certain degree.JohnChodera wrote:@astrorob: We are currently working internally by testing out a a wide assortment of systems, which have a wide variety of short WUs for different kinds of workloads.
The high CPU load is concerning. Is this constant throughout, or periodic (at checkpoints)? If constant, it is likely that the OpenMM CustomIntegrator we use is somehow eating up more CPU time than it should when sequencing many kernel launches. We may be able to do something about that.
If it's periodic, this is because the CPU vs GPU sanity checks that happen at every checkpoint (now every 5%, rather than 25%) try to use a number of CPU threads equal to the number of cores. They'll automatically load-balance between threads, but if you've got other cores churning away, this could potentially slow things down a lot. I've been meaning to find a way to (1) let you control how many threads to use, and (2) split the CPU sanity checks off onto an asynchronous thread so the GPU can continue chugging along.
The best way to figure out what is going on is to watch the `science.log` thats generated (or point me to one) to see if the time per % between 5-6%, 9-10% is significantly slower, and if this improves if you stop CPU workloads on other threads. If you can help us focus in on the issue, we can quickly get this fixed in an updated core and have you folding happily again.
Thanks so much for the feedback, and for helping us with the COVID Moonshot (http://covid.postera.ai/covid)!
~ John Chodera // MSKCC
i will check science.log. the problem at the moment seems to be that for some reason i'm not getting any WUs on this machine anymore, for either GPU (RTX1060 or RX5500) so there is nothing to check. not sure if this problem is on my end or F@H's end...
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
Comparing my two machines on the 13416 WU, when I get a "fast" one, the 3950x is at ~1% cpu utilization (constant) on core 22. Current PPD is 1.6 million+ occasionally going up to 1.75 million.
When I get a "slow" 13416 WU, on the 2600X, cpu utilization (constant) is ~9.5% on core 22. Current PPD is 900,000+.
Both machines are using 5700XT GPUs.
BTW, I've moved the 3950x back up to -1 on threads. It's using 30 of them now, so I occasionally get a "big" WU to crunch. The 2600x still is NOT doing CPU crunching.
As a complication, on Linux (Mint) using a 1070ti, the 13416 WU's are almost 900,000 PPD (which seems normal), but "top" reports that FahCore_22 at 100% utilization (Quad core i5 2500).
When I get a "slow" 13416 WU, on the 2600X, cpu utilization (constant) is ~9.5% on core 22. Current PPD is 900,000+.
Both machines are using 5700XT GPUs.
BTW, I've moved the 3950x back up to -1 on threads. It's using 30 of them now, so I occasionally get a "big" WU to crunch. The 2600x still is NOT doing CPU crunching.
As a complication, on Linux (Mint) using a 1070ti, the 13416 WU's are almost 900,000 PPD (which seems normal), but "top" reports that FahCore_22 at 100% utilization (Quad core i5 2500).
Re: debugging sudden low performance on RX5500
i managed to get FAHClient to pull some WUs but so far all the 13416's that i've gotten have pretty good performance with TPFs in the 4m30s range. CPU utilization on the FahCore_22 thread is ~9% when this kind of 13416 is running.
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
The "fast" 13416 WUs on my 3950x have a TPF at 2.0 minutes using a 5700XT. CPU utilitization on FAHCore_22.exe is ~1.0%.
Update: I got a "slow" one on the same machine.
The "slow" 13416 WU has a TPF of 03:26 (same graphics card). CPU utilization of FAHCore_22.exe is ~3.0%.
Update: I got a "slow" one on the same machine.
The "slow" 13416 WU has a TPF of 03:26 (same graphics card). CPU utilization of FAHCore_22.exe is ~3.0%.
Re: debugging sudden low performance on RX5500
Please report the PRCG numbers associated with "slot" and "fast" as well as the driver versions you're running.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
Haven't seen a 13416 WU today. Looking in the logs, I've found these.
Fast (3.30194 hours, 198.117 minutes)
21:22:32:WU02:FS01:0x22:Project: 13416 (Run 13, Clone 262, Gen 1)
00:40:33:WU02:FS01:0x22:Average performance: 144.966 ns/day
00:40:39:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13416 run:13 clone:262 gen:1 core:0x22
followed by a slow one (6.92694 hours, 415.617 minutes)
00:40:40:WU01:FS01:0x22:Project: 13416 (Run 1070, Clone 294, Gen 1)
07:36:10:WU01:FS01:0x22:Average performance: 82.3642 ns/day
07:36:17:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:13416 run:1070 clone:294 gen:1 core:0x22
Radeon drivers 20.7.2
Windows 10 Pro 1909 (OS Build 18363.836)
FAHClient/FAHControl: 7.6.13
OpenMM_22 (0.11)
Gigabyte 5700XT Gaming OC 8Gb
AMD 3950x (with simultaneous CPU WUs 30 threads (-1))
Fast (3.30194 hours, 198.117 minutes)
21:22:32:WU02:FS01:0x22:Project: 13416 (Run 13, Clone 262, Gen 1)
00:40:33:WU02:FS01:0x22:Average performance: 144.966 ns/day
00:40:39:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:13416 run:13 clone:262 gen:1 core:0x22
followed by a slow one (6.92694 hours, 415.617 minutes)
00:40:40:WU01:FS01:0x22:Project: 13416 (Run 1070, Clone 294, Gen 1)
07:36:10:WU01:FS01:0x22:Average performance: 82.3642 ns/day
07:36:17:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:13416 run:1070 clone:294 gen:1 core:0x22
Radeon drivers 20.7.2
Windows 10 Pro 1909 (OS Build 18363.836)
FAHClient/FAHControl: 7.6.13
OpenMM_22 (0.11)
Gigabyte 5700XT Gaming OC 8Gb
AMD 3950x (with simultaneous CPU WUs 30 threads (-1))
Re: debugging sudden low performance on RX5500
Just curious: How many CPU threads are idle when OpenMM_22 (0.11) decides it wants to use CPU resources. You MIGHT improve your throughput on some of the 134xx WUs if you further restricted the 30 threads.Starman157 wrote: Radeon drivers 20.7.2
Windows 10 Pro 1909 (OS Build 18363.836)
FAHClient/FAHControl: 7.6.13
OpenMM_22 (0.11)
Gigabyte 5700XT Gaming OC 8Gb
AMD 3950x (with simultaneous CPU WUs 30 threads (-1))
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
As in 0% utilization over long periods? None. Every thread shows some "blips" along the way. With FAHCore_22 running, the 3950x is at about 4% utilization (depending on WU), so that 4% is spread out (unevenly) over all threads possible.
But when OpenMM_22 fires up, CPU Wus will take 30. The one last core (2 threads) deals with FAHCore_22 (OPENMM_22) and all other windows processes. In fact, I've set the affinity of the CPU thread (FAHCore_a7.exe) to NOT use the weakest core on the 3950x, this results in LESS performance! I surmise that when FAHCore_a7 needs to do something, the windows scheduler doesn't like that it finds everything busy and as such has to "delay". When the FAHCore_a7 is given free reign over all threads, performance goes up.
As for improving throughput on some 13416 WUs? Sure. But at a cost. I found that the marginal increase was more than offset by the slowdown on the CPU WUs. This effect only happens on the 3950x. Yes, I have found the sweet spot at 29 threads for FAHCore_a7, so that FAHCore_22 isn't "starved" for cycles, but this now negates FAHCore_a7 being allowed to crunch the BIGADV WUs. Also, the cost of less threads for CPU crunching far outweighs the increase in GPU crunching. The overall PPD, and therefore WU's per day, is greater with 30 threads of CPU work (BIGADV unit effect) even though it starves the GPU scheduling somewhat.
Note: This is NOT the case when the same graphics card is used in the other machine, the AMD 2600x (6C/12T). The CPU crunching definitely starves the GPU work, and the PPD of CPU work does NOT make up for the loss in GPU work. So I run this machine GPU only, using the CPU to keep the GPU feed with work. I've tried most possibilities on setting the threads on this machine (10T, 8T, 6T... etc), but never found an instance where this did not result in a net loss of PPD (GPU starvation). I was guessing that since I was lowering the 3950x to 30 threads (out of 32 possible) that the 2600x should be the same at 10 threads (out of a possible 12). It doesn't work that way, at least with the WUs that it was being fed. It could still be the case that YMMV.
The only other complication is that the 3950x machine has it's GPU connected via PCI express 4.0 (X570 mobo). The 2600x is connected by PCI express 3.0 (B450). I normally wouldn't think that this would make a difference as the amount of data tranferred by the PCI express bus for the Folding programs is minimal at best as I'm seeing VRAM usage (of 8Gb) in the mid "teens" (14-17%).
But when OpenMM_22 fires up, CPU Wus will take 30. The one last core (2 threads) deals with FAHCore_22 (OPENMM_22) and all other windows processes. In fact, I've set the affinity of the CPU thread (FAHCore_a7.exe) to NOT use the weakest core on the 3950x, this results in LESS performance! I surmise that when FAHCore_a7 needs to do something, the windows scheduler doesn't like that it finds everything busy and as such has to "delay". When the FAHCore_a7 is given free reign over all threads, performance goes up.
As for improving throughput on some 13416 WUs? Sure. But at a cost. I found that the marginal increase was more than offset by the slowdown on the CPU WUs. This effect only happens on the 3950x. Yes, I have found the sweet spot at 29 threads for FAHCore_a7, so that FAHCore_22 isn't "starved" for cycles, but this now negates FAHCore_a7 being allowed to crunch the BIGADV WUs. Also, the cost of less threads for CPU crunching far outweighs the increase in GPU crunching. The overall PPD, and therefore WU's per day, is greater with 30 threads of CPU work (BIGADV unit effect) even though it starves the GPU scheduling somewhat.
Note: This is NOT the case when the same graphics card is used in the other machine, the AMD 2600x (6C/12T). The CPU crunching definitely starves the GPU work, and the PPD of CPU work does NOT make up for the loss in GPU work. So I run this machine GPU only, using the CPU to keep the GPU feed with work. I've tried most possibilities on setting the threads on this machine (10T, 8T, 6T... etc), but never found an instance where this did not result in a net loss of PPD (GPU starvation). I was guessing that since I was lowering the 3950x to 30 threads (out of 32 possible) that the 2600x should be the same at 10 threads (out of a possible 12). It doesn't work that way, at least with the WUs that it was being fed. It could still be the case that YMMV.
The only other complication is that the 3950x machine has it's GPU connected via PCI express 4.0 (X570 mobo). The 2600x is connected by PCI express 3.0 (B450). I normally wouldn't think that this would make a difference as the amount of data tranferred by the PCI express bus for the Folding programs is minimal at best as I'm seeing VRAM usage (of 8Gb) in the mid "teens" (14-17%).
Re: debugging sudden low performance on RX5500
Seeing "blips" isn't what I'm asking about. The scheduler in your OS may assign work to any free thread at any time.
If you have 32 threads, Windows reports that CPU utilization may go as high as 100% so if it reports between 90% and 97% there are between 1 and 3 idle threads. (Linux reports it differently.) Assuming you do NOT set affinity, there's still a cost to starting a thread briefly and then have it wait. FAH will assume that one GPU needs one CPU so running GROMACS and/or ROSETTA on a maximum of 30 threads is still too much because there's always other things going on. Moreover, FAHCore_2x does the sanity checks on the CPU, so even when the rate at which FAHCore_2x generates new kernels can be assumed to only take one thread you need to leave more.
If you have 32 threads, Windows reports that CPU utilization may go as high as 100% so if it reports between 90% and 97% there are between 1 and 3 idle threads. (Linux reports it differently.) Assuming you do NOT set affinity, there's still a cost to starting a thread briefly and then have it wait. FAH will assume that one GPU needs one CPU so running GROMACS and/or ROSETTA on a maximum of 30 threads is still too much because there's always other things going on. Moreover, FAHCore_2x does the sanity checks on the CPU, so even when the rate at which FAHCore_2x generates new kernels can be assumed to only take one thread you need to leave more.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
Yes, of course.
I already mentioned that FAHCore_2x runs the best when I set the CPU threads to 29 (well, 28 actually since 29 is a prime >3; a drop of 2 cores/4 threads from max), which allows more extraneous work to be spread around available HT cpu cores. However, 28 is not a number of threads that allows the BIGADV WUs to be worked on. I've seen WU's with the threads set to 30 take upwards of 90 minutes to complete. When set to 28, at most the work units it gets take about 35 minutes. The PPD difference is noticeable, with BIG units around 550-600K, "small" units around 250-275K. Again, these vary depending on the WU given.
I already mentioned that FAHCore_2x runs the best when I set the CPU threads to 29 (well, 28 actually since 29 is a prime >3; a drop of 2 cores/4 threads from max), which allows more extraneous work to be spread around available HT cpu cores. However, 28 is not a number of threads that allows the BIGADV WUs to be worked on. I've seen WU's with the threads set to 30 take upwards of 90 minutes to complete. When set to 28, at most the work units it gets take about 35 minutes. The PPD difference is noticeable, with BIG units around 550-600K, "small" units around 250-275K. Again, these vary depending on the WU given.
Re: debugging sudden low performance on RX5500
The fact that one wu can be competed in 90 minurtes and another can be completed in 35 means absolutely nothing. No two WUs in p134xx take the same amount of time. It's not a traditional project running a single protein. And 6.92694 hours is not really slow.
Aviud Covid if you choose, but a script like that would certainly not be in the spirit of FAH:
Aviud Covid if you choose, but a script like that would certainly not be in the spirit of FAH:
The fact that a specific protein has been placed in a project 134xx means it DOES have a developer attention, including his wanting to determine which GPUs it runs slow on and why.zookeeny wrote:I'm considering changing my FAHClient to run only non-Covid tasks to avoid these WUs. Or instead of that, maybe writing a script to auto-dump them if they hang the GPU for more than 5 minutes. It's a shame that's necessary, but like you said, AMD cards running on Linux are a rarity... I doubt they warrant much developer attention.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 30
- Joined: Tue Jul 14, 2020 12:55 pm
- Hardware configuration: 3950x/5700XT, 2600x/5700XT, 2500/1070ti, 1090T/7950, 3570K/NA
Re: debugging sudden low performance on RX5500
I understand that. The credit given for work units when 30 threads are active is about 4x higher than if only 28 threads are.
Also, 6.9 hours to complete a WU, as you say, is not slow. But when a WU from the same project, 13416, takes 1/3 the time on the same hardware, the question remains, why is one WU 3 times slower than a fast work unit. This could be why 13416 has such a bad "rap" (causing others to possibly "auto-dump" them), because the perception is that the same project will create loads that are essentially the same.
In addition, there are others who've ranked various graphics cards in relation to FAH loads. These ranks are useless when there's such variability in the same WU's. This leads to consumer confusion as to which graphics card to buy ($/performance) when creating FAH compute boxes. Power/performance numbers are also useless, and ultimately, it the cost of electricity that determines time on task.
Also, 6.9 hours to complete a WU, as you say, is not slow. But when a WU from the same project, 13416, takes 1/3 the time on the same hardware, the question remains, why is one WU 3 times slower than a fast work unit. This could be why 13416 has such a bad "rap" (causing others to possibly "auto-dump" them), because the perception is that the same project will create loads that are essentially the same.
In addition, there are others who've ranked various graphics cards in relation to FAH loads. These ranks are useless when there's such variability in the same WU's. This leads to consumer confusion as to which graphics card to buy ($/performance) when creating FAH compute boxes. Power/performance numbers are also useless, and ultimately, it the cost of electricity that determines time on task.
-
- Pande Group Member
- Posts: 467
- Joined: Fri Feb 22, 2013 9:59 pm
Re: debugging sudden low performance on RX5500
We're not quite sure yet. The different RUNs do contains the same numbers of atoms, but the ligands---which are treated with a variety of special `CustomForce` kernels that are compiled specifically from special alchemically-modified forces---do differ in size. We think this is likely the case, and that the custom kernels, or the constraints associated with these ligands, lead to kernels that perform differently on different hardware.> Also, 6.9 hours to complete a WU, as you say, is not slow. But when a WU from the same project, 13416, takes 1/3 the time on the same hardware, the question remains, why is one WU 3 times slower than a fast work unit. This could be why 13416 has such a bad "rap" (causing others to possibly "auto-dump" them), because the perception is that the same project will create loads that are essentially the same.
> In addition, there are others who've ranked various graphics cards in relation to FAH loads. These ranks are useless when there's such variability in the same WU's. This leads to consumer confusion as to which graphics card to buy ($/performance) when creating FAH compute boxes. Power/performance numbers are also useless, and ultimately, it the cost of electricity that determines time on task.
We're looking into this in more detail over the next week or two now that we have some help on board, but we're still getting a ton of useful results to help prioritize ligands for the COVID Moonshot (http://covid.postera.ai/covid), and the science is demanding a lot of attention to make sure we can continue to help the chemists identify how to push toward potent inhibitors as we get closer to animal models in advance of clinical trials. We realize this isn't ideal, but we're not operating under ideal circumstances while trying to aid a rapidly-moving open science drug discovery project.
What we've been able to do is run a benchmark project (17100) that contains a variety of workloads in short WUs on a huge variety of GPUs on FAH, and we are lucky enough to have a data scientist working with us for a few months to help us refine our GPUSpecies categories so we can better restrict projects to swaths of GPUs that should perform well on the workload. This doesn't quite fix the within-project PPD variation, but he'll hopefully be able to help us with that too.
Thanks again for bearing with us. All of this will improve with each generation of project, but it will take some time while we onboard more folks to help with the infrastructure so we can focus on the science and drug discovery.
~ John Chodera // MSKCC
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: debugging sudden low performance on RX5500
Welcome to the F@H Forum Starman157,Starman157 wrote:...However, 28 is not a number of threads that allows the BIGADV WUs to be worked on...
Please note that bigadv WUs were long stopped many years ago: viewtopic.php?f=24&t=25598
Thus, there's no specific WUs that you will be assigned by only having X number of threads. Instead, you have CPU WUs that can be assigned to a wide range of threads and in some cases where the thread count is high, it can fail due to domain decomposition errors. However, work is being done to improve it to ensure that donors don't have to worry about that and can contribute without issues
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues