Page 1 of 1

There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 1:24 pm
by bamapolitics
I keep getting WU's that won't fold on 24 cores so they just sit there doing nothing all night while I am asleep. How can I avoid these?

Code: Select all

13:22:48:WU00:FS00:0xa7:*********************** Log Started 2020-04-21T13:22:47Z ***********************
13:22:48:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
13:22:48:WU00:FS00:0xa7:       Type: 0xa7
13:22:48:WU00:FS00:0xa7:       Core: Gromacs
13:22:48:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 129135 -checkpoint 5 -np
13:22:48:WU00:FS00:0xa7:             24
13:22:48:WU00:FS00:0xa7:************************************ CBang *************************************
13:22:48:WU00:FS00:0xa7:       Date: Nov 5 2019
13:22:48:WU00:FS00:0xa7:       Time: 06:06:57
13:22:48:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
13:22:48:WU00:FS00:0xa7:     Branch: master
13:22:48:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
13:22:48:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
13:22:48:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
13:22:48:WU00:FS00:0xa7:       Bits: 64
13:22:48:WU00:FS00:0xa7:       Mode: Release
13:22:48:WU00:FS00:0xa7:************************************ System ************************************
13:22:48:WU00:FS00:0xa7:        CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
13:22:48:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
13:22:48:WU00:FS00:0xa7:       CPUs: 32
13:22:48:WU00:FS00:0xa7:     Memory: 15.55GiB
13:22:48:WU00:FS00:0xa7:Free Memory: 500.07MiB
13:22:48:WU00:FS00:0xa7:    Threads: POSIX_THREADS
13:22:48:WU00:FS00:0xa7: OS Version: 5.5
13:22:48:WU00:FS00:0xa7:Has Battery: false
13:22:48:WU00:FS00:0xa7: On Battery: false
13:22:48:WU00:FS00:0xa7: UTC Offset: -5
13:22:48:WU00:FS00:0xa7:        PID: 129139
13:22:48:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
13:22:48:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
13:22:48:WU00:FS00:0xa7:    Version: 0.0.18
13:22:48:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
13:22:48:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
13:22:48:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
13:22:48:WU00:FS00:0xa7:       Date: Nov 5 2019
13:22:48:WU00:FS00:0xa7:       Time: 06:13:26
13:22:48:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
13:22:48:WU00:FS00:0xa7:     Branch: master
13:22:48:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
13:22:48:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
13:22:48:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
13:22:48:WU00:FS00:0xa7:       Bits: 64
13:22:48:WU00:FS00:0xa7:       Mode: Release
13:22:48:WU00:FS00:0xa7:************************************ Build *************************************
13:22:48:WU00:FS00:0xa7:       SIMD: avx_256
13:22:48:WU00:FS00:0xa7:********************************************************************************
13:22:48:WU00:FS00:0xa7:Project: 14576 (Run 0, Clone 770, Gen 67)
13:22:48:WU00:FS00:0xa7:Unit: 0x00000053287234c95e792335830607aa
13:22:48:WU00:FS00:0xa7:Reading tar file core.xml
13:22:48:WU00:FS00:0xa7:Reading tar file frame67.tpr
13:22:48:WU00:FS00:0xa7:Digital signatures verified
13:22:48:WU00:FS00:0xa7:Calling: mdrun -s frame67.tpr -o frame67.trr -x frame67.xtc -cpt 5 -nt 24
13:22:48:WU00:FS00:0xa7:Steps: first=33500000 total=500000
13:22:48:WU00:FS00:0xa7:ERROR:
13:22:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:22:48:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
13:22:48:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
13:22:48:WU00:FS00:0xa7:ERROR:
13:22:48:WU00:FS00:0xa7:ERROR:Fatal error:
13:22:48:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
13:22:48:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
13:22:48:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
13:22:48:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
13:22:48:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
13:22:48:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
13:22:52:WU00:FS00:0xa7:WARNING:Unexpected exit() call
13:22:52:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
13:22:52:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
13:22:52:WU00:FS00:0xa7:Saving result file md.log
13:22:52:WU00:FS00:0xa7:Saving result file science.log
13:22:53:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 1:29 pm
by JimboPalmer
In the taskbar to the lower right of the screen, you should see a F@H molecule icon, click it (you may need to click an Up Arrow to see it ^)

The second item in this menu is Advanced Control, click it

On this screen to the left is a Configure button, click it

Now you get a screen with a Slots tab, click it

On this white field should be a cpu item, click it and then click edit

By default F@H set the number of CPUs to -1 meaning let the software decide.
You can enter any number from 1 to the number of threads your CPU supports.

If you have GPUs, F@H reserves one CPU per GPU to feed it data across the PCIE bus.

F@H has difficulty with large primes and their multiples number of CPUs.
7 is always large, 5 is sometimes large, and 3 is never large. Try to choose a number that is a multiple of 2 and/or 3.
2, 3, 4, 6, 8, 9, 12, 16, 18, 20, 21, 24, 27, 30, 32 are usually good numbers of CPUs to choose. (_r2w_ben has advised me of more good numbers)
5. 10. 15, 20, 25, 28 may work most of the time. Other numbers will bite you

Type the number you want, and click save. (I would try 18 or 16, if you want you can add another slot with 12 cpus, you have 32)

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 1:45 pm
by bamapolitics
I've already done that and have it set to 24. Yet it is downloading WUs that won't work on 24 and just sits there. Is there no way I can just avoid those WUs?

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 2:08 pm
by ajm
This looks like a bad configuration of the WU. There is quite a number of those errors reported in the forum and they are spread on the whole range of CPUs, from 4 to 96 threads.
Do you use the client-type advanced (or beta)?

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 3:14 pm
by Neil-B

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 4:27 pm
by foldy
The message "There is no domain decomposition for 20 ranks" means you need to reduce cpu slot to 20 threads. Then the same work unit will work after pause and resume. You can create a second cpu slot for the remaining threads.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 4:40 pm
by Neil-B
The message means that for this WU the Gromacs core has split it into two elements (the main part and a PME part) … unfortunately this has caused the main element to be 20cores which is a multiple of 5 which is not good and in some circumstances (such as this) can cause issues … by reducing the core count of the slot downwards (actually from 24 to 21 might work) this should resolve itself and the WU should complete.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 5:55 pm
by Tohya
Once you have downloaded a WU you can never increase the number of cores it can use, only lower them. In this case you would want to try 18, or lower, and have the client finish the WU and then change your cores back to 24 or 21.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 8:15 pm
by Neil-B
Correct me if I am wrong but this was downloaded as a 24core slot (which has split 4PME off) - won't that mean that 21 is a reduction for this case? … granted gromacs may then use PME of 5 wih 16 but should still be ok.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 8:58 pm
by bamapolitics
So I obviously should not set it to 24 if I can not monitor it 24/7? I have a 1950x but need some free cores for Plex and day to day working.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 9:01 pm
by Joe_H
I will pass on the report to the researcher running this project. It may have to be restricted from running on systems configured for 24 CPU threads.

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 9:09 pm
by Neil-B
Really odd thing is that I run 24core and 32core slots on advanced/beta for the last month or so and haven't had a failure from the myriad projects it has folded - this is a Project that went through Beta before I re-joined folding last month and I haven't had one of these on my 24slot … Normally 24would have been a "safe bet" :(

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 9:46 pm
by parkut
Here is another report of what looks like the same issue.

Project: 14576 (Run 0, Clone 284, Gen 77) - domain decom 20 ranks
Four Quad Xeon X7460's @ 2.66GHz in Dell Poweredge R900, Total of 24 cores available

This particular remote machine has the domain decomp issue. There
Were at least two other Work Units that had the same issue.

Code: Select all

*********************** Log Started 2020-04-20T14:42:25Z ***********************
14:42:25:************************* Folding@home Client *************************
14:42:25:    Website: https://foldingathome.org/
14:42:25:  Copyright: (c) 2009-2018 foldingathome.org
14:42:25:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
14:42:25:       Args: --child --lifeline 27058 /etc/fahclient/config.xml --run-as
14:42:25:             fahclient --pid-file=/var/run/fahclient.pid --daemon
14:42:25:     Config: /etc/fahclient/config.xml
14:42:25:******************************** Build ********************************
14:42:25:    Version: 7.5.1
14:42:25:       Date: May 11 2018
14:42:25:       Time: 19:59:04
14:42:25: Repository: Git
14:42:25:   Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
14:42:25:     Branch: master
14:42:25:   Compiler: GNU 6.3.0 20170516
14:42:25:    Options: -std=gnu++98 -O3 -funroll-loops
14:42:25:   Platform: linux2 4.14.0-3-amd64
14:42:25:       Bits: 64
14:42:25:       Mode: Release
14:42:25:******************************* System ********************************
14:42:25:        CPU: Intel(R) Xeon(R) CPU X7460 @ 2.66GHz
14:42:25:     CPU ID: GenuineIntel Family 6 Model 29 Stepping 1
14:42:25:       CPUs: 24
14:42:25:     Memory: 78.65GiB
14:42:25:Free Memory: 77.50GiB
14:42:25:    Threads: POSIX_THREADS
14:42:25: OS Version: 4.15
14:42:25:Has Battery: false
14:42:25: On Battery: false
14:42:25: UTC Offset: 0
14:42:25:        PID: 27061
14:42:25:        CWD: /var/lib/fahclient
14:42:25:         OS: Linux 4.15.0-96-generic x86_64
14:42:25:    OS Arch: AMD64
14:42:25:       GPUs: 0
14:42:25:       CUDA: Not detected: Failed to open dynamic library 'libcuda.so':
14:42:25:             libcuda.so: cannot open shared object file: No such file or
14:42:25:             directory
14:42:25:     OpenCL: Not detected: Failed to open dynamic library 'libOpenCL.so':
14:42:25:             libOpenCL.so: cannot open shared object file: No such file or
14:42:25:             directory
14:42:25:***********************************************************************

Code: Select all

20:29:50:WU00:FS00:Starting
20:29:50:WU00:FS00:Removing old file './work/00/logfile_01-20200421-195750.txt'
20:29:50:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 27061 -checkpoint 15 -
np 24
20:29:50:WU00:FS00:Started FahCore on PID 9496
20:29:50:WU00:FS00:Core PID:9500
20:29:50:WU00:FS00:FahCore 0xa7 started
20:29:51:WU00:FS00:0xa7:*********************** Log Started 2020-04-21T20:29:50Z ***********************
20:29:51:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
20:29:51:WU00:FS00:0xa7:       Type: 0xa7
20:29:51:WU00:FS00:0xa7:       Core: Gromacs
20:29:51:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 9496 -checkpoint 15 -np
20:29:51:WU00:FS00:0xa7:             24
20:29:51:WU00:FS00:0xa7:************************************ CBang *************************************
20:29:51:WU00:FS00:0xa7:       Date: Nov 5 2019
20:29:51:WU00:FS00:0xa7:       Time: 05:57:01
20:29:51:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
20:29:51:WU00:FS00:0xa7:     Branch: master
20:29:51:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
20:29:51:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
20:29:51:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
20:29:51:WU00:FS00:0xa7:       Bits: 64
20:29:51:WU00:FS00:0xa7:       Mode: Release
20:29:51:WU00:FS00:0xa7:************************************ System ************************************
20:29:51:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU X7460 @ 2.66GHz
20:29:51:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 29 Stepping 1
20:29:51:WU00:FS00:0xa7:       CPUs: 24
20:29:51:WU00:FS00:0xa7:     Memory: 78.65GiB
20:29:51:WU00:FS00:0xa7:Free Memory: 77.31GiB
20:29:51:WU00:FS00:0xa7:    Threads: POSIX_THREADS
20:29:51:WU00:FS00:0xa7: OS Version: 4.15
20:29:51:WU00:FS00:0xa7:Has Battery: false
20:29:51:WU00:FS00:0xa7: On Battery: false
20:29:51:WU00:FS00:0xa7: UTC Offset: 0
20:29:51:WU00:FS00:0xa7:        PID: 9500
20:29:51:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
20:29:51:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
20:29:51:WU00:FS00:0xa7:    Version: 0.0.18
20:29:51:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:29:51:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
20:29:51:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
20:29:51:WU00:FS00:0xa7:       Date: Nov 5 2019
20:29:51:WU00:FS00:0xa7:       Time: 06:13:26
20:29:51:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
20:29:51:WU00:FS00:0xa7:     Branch: master
20:29:51:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
20:29:51:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
20:29:51:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
20:29:51:WU00:FS00:0xa7:       Bits: 64
20:29:51:WU00:FS00:0xa7:       Mode: Release
20:29:51:WU00:FS00:0xa7:************************************ Build *************************************
20:29:51:WU00:FS00:0xa7:       SIMD: sse2
20:29:51:WU00:FS00:0xa7:********************************************************************************
20:29:51:WU00:FS00:0xa7:Project: 14576 (Run 0, Clone 284, Gen 77)
20:29:51:WU00:FS00:0xa7:Unit: 0x00000062287234c95e7923dcf8465b19
20:29:51:WU00:FS00:0xa7:Reading tar file core.xml
20:29:51:WU00:FS00:0xa7:Reading tar file frame77.tpr
20:29:51:WU00:FS00:0xa7:Digital signatures verified
20:29:51:WU00:FS00:0xa7:Calling: mdrun -s frame77.tpr -o frame77.trr -x frame77.xtc -cpt 15 -nt 24
20:29:51:WU00:FS00:0xa7:Steps: first=38500000 total=500000
20:29:51:WU00:FS00:0xa7:ERROR:
20:29:51:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
20:29:51:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
20:29:51:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-sse-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
20:29:51:WU00:FS00:0xa7:ERROR:
20:29:51:WU00:FS00:0xa7:ERROR:Fatal error:
20:29:51:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
20:29:51:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
20:29:51:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
20:29:51:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
20:29:51:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
20:29:51:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
20:29:55:WU00:FS00:0xa7:WARNING:Unexpected exit() call
20:29:55:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
20:29:55:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
20:29:55:WU00:FS00:0xa7:Saving result file md.log
20:29:55:WU00:FS00:0xa7:Saving result file science.log
20:29:56:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: There is no domain decomposition for 20 ranks

Posted: Tue Apr 21, 2020 9:48 pm
by Joe_H
@Neil-B

Well, in one case that _r2w_ben analyzed in his topic on domain decomposition, two different runs of one project had different usable thread counts. That could be happening here, or something else.

Strangely enough there are some projects that are good with a thread count of 21, 3x7. But so many others faience a multiple of 7 gets used, that is usually not an allowed setting.