Page 1 of 2

13400 assigned to GPUs that are way too slow

Posted: Sat Apr 25, 2020 1:15 pm
by MaartenBaert
I'm running F@H on 3 machines with the following GPUs:

- Nvidia NVS 310 (very slow)
- Nvidia GTX 660
- Nvidia GTX 1060

Today all three GPUs were assigned project 13400 with a 2 day deadline. However the GTX 660 needs ~2.4 days to complete this WU and the NVS 310 needs ~26.8 days! The GTX 1060 needs ~16.3 hours so that's fine.

Are these WUs supposed to be that slow? If so, they probably shouldn't be assigned to weaker graphics cards ...

Re: 13400 assigned to GPUs that are way too slow

Posted: Sat Apr 25, 2020 2:45 pm
by Neil-B
Caveat: Estimated times in the early folds (pre 5%) of a new project on a slot can be fairly anomalous.

The NVS may be getting a tad slow even for smaller WUs? … 48 shaders and only OpenCL 1.1 (but with Double Precision FP) looks to be the specs which from recent threads might put it fairly close to retirement?

The other two are fairly old generation but still doing well considering the 13400 (iircc) uses the latest GPU core and (again iirc) uses parts of OpenMM not used previously so it doesn't surprise me that it pushes the cards a bit - and you are right the 660 probably shouldn't be sent this type of WU, but I'm not sure how much granularity of control the AS has on this (hopefully enough) … I'm guessing folders with the latest GPUs are loving them.

What OS is each of the GPUs running under? … as this might have a relevance if I can find something I am sure I read earlier today - found it viewtopic.php?f=19&t=34745&p=329658#p329658 … this is obviously a very quickly adapting scenario so that post may be outdated … also see viewtopic.php?f=19&t=34745&p=329702#p329737 which explains that even with the issues they are having it is still helping the science.

It may be that one of the team can walk you through the best way to "dump" the two WUs on the slower cards so that they get flagged for "immediate" reassignment

Re: 13400 assigned to GPUs that are way too slow

Posted: Sat Apr 25, 2020 3:48 pm
by MaartenBaert
The GTX 660 is running on Arch Linux, kernel version 5.6.6, Nvidia driver version 440.82, CPU is Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz.
The NVS 310 is running on CentOS 7, kernel 3.10.0, Nvidia driver version 390.116, CPU is Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz.
The GTX 1060 is running on CentOS 7, kernel 3.10.0, Nvidia driver version 440.64, CPU is Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz.

I should probably just disable the NVS 310, it has even less processing power than the CPU.

Edit: Would it be feasible to transfer the unfinished WU from the GTX 660 to the GTX 1060? Is it just a matter of copying the files or will that break things?

Re: 13400 assigned to GPUs that are way too slow

Posted: Sat Apr 25, 2020 4:39 pm
by Joe_H
It is more than just copying the files. A WU and the necessary other files can be moved to a similar enough machine, but is complicated if you are already processing WUs on that machine. It is not something I can recommend.

Re: 13400 assigned to GPUs that are way too slow

Posted: Sun Apr 26, 2020 3:01 am
by Nuitari
Got 13400 (42, 21, 4) assigned to a Radeon Baffin XT RX 560 (not Ellesmere) and the TPF is at 24 minutes, 28 secs.
Likely going to complete midway between the timeout and the expiration.

Re: 13400 assigned to GPUs that are way too slow

Posted: Sun Apr 26, 2020 4:50 am
by lazyacevw
Do you have your 310 or 660 set to client-type advanced? If so, you might want to remove them. Not sure but looks like 13400 is a beta task. My 1080 TI is 4 min 60 sec per fold, so about 6 or 7 hours.

https://stats.foldingathome.org/project?p=13400

Re: 13400 assigned to GPUs that are way too slow

Posted: Sun Apr 26, 2020 6:08 am
by bruce
Yes, the settings for project 13400 are being adjusted. It is a very large project and probably should be restricted to hardware faster than any of yours except the GTX 1060 which should be able to handle it.

Re: 13400 assigned to GPUs that are way too slow

Posted: Sun Apr 26, 2020 6:41 am
by lazyacevw
I like the larger/longer projects. Keeps the GPUs gainfully employed and reduces the number of connection requests to the work and collection servers.

Re: 13400 assigned to GPUs that are way too slow

Posted: Sun Apr 26, 2020 11:27 am
by MaartenBaert
All my clients are using default settings. The GTX 1060 has indeed completed the WU without issues.

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 3:12 am
by lazyacevw
If you are running stock settings, I guess the devs need to figure out a way to blacklist less powerful GPUs to avoid waste. All of my WU's so far today are 7 hour tasks on my 1080TIs. Not blacklisting or making them beta tasks will just cause the WUs to time out on less powerful GPUs.

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 5:12 am
by JohnChodera
Apologies for the extremely short deadline/timeout for 13400. This is a brand new type of workload for us---relative binding free energy calculations using a new nonequilibrium integrator that exploits features just rolled out in core22 0.0.5.
We're still learning how to improve things, and there will be some hiccups in the first few projects (like 13400).
I've changed 13400 to collect-only, and we'll be making modifications to future iterations of this workload.
We collected a ton of useful data in this first trial that we'll use to make improvements.
Thanks again for bearing with us!

~ John Chodera // MSKCC

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 5:56 am
by Nuitari
1050ti is just a tad too slow to do it within the timeout at about 1.2 days

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 6:53 am
by Theonlycure
I have a RTX 2080Ti and WU 13400 is way too slow also. I usually plow through the work units and get near max credit. This one however is a slug. Only 45% finished and estimated 9hr and 42 minutes left. This would not bother me except for the fact the points don't reflect how much time and electricity I am expending. Estimated credit 317635. Very sad.

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 10:53 am
by HaloJones
Theonlycure wrote:I have a RTX 2080Ti and WU 13400 is way too slow also. I usually plow through the work units and get near max credit. This one however is a slug. Only 45% finished and estimated 9hr and 42 minutes left. This would not bother me except for the fact the points don't reflect how much time and electricity I am expending. Estimated credit 317635. Very sad.
Can you provide a little detail?

What OS?
What "client-type" do you have set? Advanced? Beta?

Re: 13400 assigned to GPUs that are way too slow

Posted: Mon Apr 27, 2020 11:09 am
by Basti
Ran into this today, too.
Did not change anything in config.

Code: Select all

# Project
13400

# os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"

# CPU
model name      : Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz

# GPU
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Polaris11] (rev ff) (prog-if 00 [VGA controller])
        Subsystem: Sapphire Technology Limited Baffin [Radeon RX 550 640SP / RX 560/560X] (Radeon RX 550 640SP)

# uptime
 13:02:42 up 15 days, 19:33,  1 user,  load average: 0,11, 0,08, 0,09

Code: Select all

# log
06:42:10:WU01:FS01:0x22:Completed 1860000 out of 2000000 steps (93%)
07:13:58:WU01:FS01:0x22:Completed 1880000 out of 2000000 steps (94%)
07:45:44:WU01:FS01:0x22:Completed 1900000 out of 2000000 steps (95%)
07:45:53:WARNING:WU01:FS01:Past final deadline 2020-04-27T07:45:52Z, dumping
07:45:53:WU01:FS01:Shutting core down
07:45:53:WU01:FS01:0x22:Caught signal SIGINT(2) on PID 1642
07:45:53:WU01:FS01:0x22:Exiting, please wait. . .
07:45:53:WU01:FS01:0x22:Folding@home Core Shutdown: INTERRUPTED

Code: Select all

# config.xml
<config>
  <!-- Client Control -->
  <fold-anon v='true'/>

  <!-- HTTP Server -->
  <allow v='10.20.30.20'/>

  <!-- Network -->
  <proxy v=':8080'/>

  <!-- Remote Command Server -->
  <password v='*'/>

  <!-- Slot Control -->
  <power v='full'/>

  <!-- User Information -->
  <passkey v='*'/>
  <team v='*'/>
  <user v='*'/>

  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>