Page 1 of 1

Project: 17201 (Run 0, Clone 2568, Gen 185) reduced threads

Posted: Mon Sep 21, 2020 6:25 pm
by HendricksSA
Have we made a change to server codes to reduce threads to avoid domain decomposition errors? I noticed the fans on my 48 thread machine were idle and found it was only using 9 threads to process this Project: 17201 (Run 0, Clone 2568, Gen 185). Looking through the log I found this. It is the first time I've noticed it.

17:29:44:WU00:FS00:Connecting to assign1.foldingathome.org:80
[93m17:29:45:WARNING:WU00:FS00:Failed to get assignment from 'assign1.foldingathome.org:80': No WUs available for this configuration[0m
17:29:45:WU00:FS00:Connecting to assign2.foldingathome.org:80
17:29:45:WU00:FS00:Assigned to work server 128.252.203.10
17:29:45:WU00:FS00:Requesting new work unit for slot 00: READY cpu:48 from 128.252.203.10
17:29:45:WU00:FS00:Connecting to 128.252.203.10:8080
17:29:46:WU00:FS00:Downloading 2.22MiB
17:29:47:WU00:FS00:Download complete
17:29:47:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:17201 run:0 clone:2568 gen:185 core:0xa7 unit:0x000000d480fccb0a5f32fed6c8584138
17:29:47:WU00:FS00:Starting
[93m17:29:47:WARNING:WU00:FS00:AS lowered CPUs from 48 to 9[0m
17:29:47:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 706 -lifeline 1876 -checkpoint 15 -np 9
17:29:47:WU00:FS00:Started FahCore on PID 10403
17:29:47:WU00:FS00:Core PID:10407
17:29:47:WU00:FS00:FahCore 0xa7 started

Re: Project: 17201 (Run 0, Clone 2568, Gen 185) reduced thre

Posted: Mon Sep 21, 2020 7:17 pm
by bruce
Yes. GROMACS has some severe limitations in the numbers of threads that can be used on a project. The version used in FAHCore_a7 use some hack-like corrections to enable it to work. THe upcoming version that will be in FAHCore_a8 will change all that but there still will be some similar issues. Domain Decomposition was designed back when CPUs had 1,2,4,8,12,16 cores. Nobody could conceive of trying to use as many threads as you have.

The OpenMM code used on GPUs is entirely different ... and if it reduces the parallelism to avoid specific problems, it doesn't tell you about it.

FAH is considering some improvements for CPUs.

I reommend using several CPU slots while avoiding any numbers with large prime factors.

Re: Project: 17201 (Run 0, Clone 2568, Gen 185) reduced thre

Posted: Mon Sep 21, 2020 7:23 pm
by Joe_H
The code has been there for some time. Depending on server settings being also all set correctly, if a WU is not available for the requested CPU thread number, a WU that will use fewer will be assigned.

What is unusual here is that the AS went so far down, usually there are WUs available for somewhat higher thread counts. It would be more normal to see a reduction to 32 or 24 for example.

The A8 folding core is less tied to domain decomposition numbers, there are projects waiting to be created once new servers are ready. Some smaller projects may not use a large number of threads as efficiently, but they will still process. But for right now there is a bit of a shortage of CPU WUs, especially for higher thread counts.

Re: Project: 17201 (Run 0, Clone 2568, Gen 185) reduced thre

Posted: Tue Sep 22, 2020 3:47 am
by PantherX
There was discussions about what to do when a CPU with X CPUs requests work and there wasn't any. Thus, instead of idle CPU, the idea was that it would assign you a WU for Y CPUs where Y < X thus, you can still contribute. As Joe_H mentioned, this is due to a shortage of CPU WUs under some conditions which will hopefully be resolved soon.