Page 1 of 3

_a7 core crashing in Gromacs

Posted: Fri Apr 10, 2020 10:15 pm
by kyleedwardsny
I'm seeing the following error in my logs:

Code: Select all

22:11:48:WU01:FS00:0xa7:ERROR:
22:11:48:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:11:48:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:11:48:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
22:11:48:WU01:FS00:0xa7:ERROR:
22:11:48:WU01:FS00:0xa7:ERROR:Fatal error:
22:11:48:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:11:48:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:11:48:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:11:48:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:11:48:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:11:48:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
22:11:53:WU01:FS00:0xa7:WARNING:Unexpected exit() call
22:11:53:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
22:11:53:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
22:11:53:WU01:FS00:0xa7:Saving result file md.log
22:11:53:WU01:FS00:0xa7:Saving result file science.log
22:11:53:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
I am running F@H on a 12-core (24-thread) processor, and a quick look at the Gromacs FAQ (I can't post the link due to being a new user) suggests it's trying to do too much parallelization. What can I do to fix this?

Re: _a7 core crashing in Gromacs

Posted: Fri Apr 10, 2020 10:41 pm
by JimboPalmer
Welcome to Folding@Home!

This project is objecting to 20 CPUs. It sometimes happens that numbers of CPUs divisible by 5 are not handled well.

I am going to suggest that you edit your existing CPU slot for 16 CPUs then add another CPU slot with 8 CPUs.

Here is a generic discussion of how to make changes of the number of threads (F@H calls them CPUs) used.

In the taskbar to the lower right of the screen, you should see a F@H molecule icon, click it (you may need to click an Up Arrow to see it ^)

The second item in this menu is Advanced Control, click it

On this screen to the left is a Configure button, click it

Now you get a screen with a Slots tab, click it

On this white field should be a cpu item, click it and then click edit

By default F@H set the number of CPUs to -1 meaning let the software decide.
You can enter any number from 1 to the number of threads your CPU supports.
If you have GPUs, F@H reserves one CPU per GPU to feed it data across the PCIE bus.

F@H has difficulty with large primes and their multiples number of CPUs.
7 is always large, 5 is sometimes large, and 3 is never large. Try to choose a number that is a multiple of 2 and/or 3.
2, 3, 4, 6, 8, 9, 12, 16, 18, etc. are good numbers of CPUs to choose.
5. 10. 15, 20 etc may work most of the time. Other numbers will bite you

Type the number you want, and click save.

You can also add a second cpu slot on this same screen, same advice as above. 6 or 8 are good choices. (6 if you have a GPU that folds)

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 1:29 am
by kyleedwardsny
Hi JimboPalmer, thanks for the reply.

I am not using Windows or a graphical application. This is running on a Linux server inside a Docker image. I do not have a GPU. Could you please point me to the config files I need to edit?

I also don't understand why it tries to only use 20 ranks when I have 24 CPUS?

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 1:39 am
by PantherX
Can you please post your log file. Ensure that you have copied the System configuration which is present at the start of the log file (viewtopic.php?f=72&t=26036).

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 2:22 am
by _r2w_ben
If you catch one of these actively failing, can you also post this section of md.log located deep within the work folder? Search for "Domain Decomposition".

Code: Select all

Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.370 nm, LJC-14 q, atoms 13 15
  multi-body bonded interactions: 0.370 nm, Proper Dih., atoms 13 15
Minimum cell size due to bonded interactions: 0.407 nm
Maximum distance for 13 constraints, at 120 deg. angles, all-trans: 0.204 nm
Estimated maximum distance required for P-LINCS: 0.204 nm
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 2 cells with a minimum initial size of 0.509 nm
The maximum allowed number of cells is: X 6 Y 6 Z 6
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

Using 2 MPI threads

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 2:42 am
by kyleedwardsny
The contents of /config/log.txt are the following, over and over again:

Code: Select all

02:35:31:WU01:FS00:Starting
02:35:31:WU01:FS00:Removing old file './work/01/logfile_01-20200411-020331.txt'
02:35:31:WU01:FS00:Running FahCore: /app/usr/bin/FAHCoreWrapper /config/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 257 -checkpoint 15 -np 24
02:35:31:WU01:FS00:Started FahCore on PID 8340
02:35:31:WU01:FS00:Core PID:8344
02:35:31:WU01:FS00:FahCore 0xa7 started
02:35:32:WU01:FS00:0xa7:*********************** Log Started 2020-04-11T02:35:31Z ***********************
02:35:32:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
02:35:32:WU01:FS00:0xa7:       Type: 0xa7
02:35:32:WU01:FS00:0xa7:       Core: Gromacs
02:35:32:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8340 -checkpoint 15 -np
02:35:32:WU01:FS00:0xa7:             24
02:35:32:WU01:FS00:0xa7:************************************ CBang *************************************
02:35:32:WU01:FS00:0xa7:       Date: Nov 5 2019
02:35:32:WU01:FS00:0xa7:       Time: 06:06:57
02:35:32:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
02:35:32:WU01:FS00:0xa7:     Branch: master
02:35:32:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
02:35:32:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
02:35:32:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
02:35:32:WU01:FS00:0xa7:       Bits: 64
02:35:32:WU01:FS00:0xa7:       Mode: Release
02:35:32:WU01:FS00:0xa7:************************************ System ************************************
02:35:32:WU01:FS00:0xa7:        CPU: AMD Ryzen Threadripper 1920X 12-Core Processor
02:35:32:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
02:35:32:WU01:FS00:0xa7:       CPUs: 24
02:35:32:WU01:FS00:0xa7:     Memory: 31.34GiB
02:35:32:WU01:FS00:0xa7:Free Memory: 26.81GiB
02:35:32:WU01:FS00:0xa7:    Threads: POSIX_THREADS
02:35:32:WU01:FS00:0xa7: OS Version: 4.15
02:35:32:WU01:FS00:0xa7:Has Battery: false
02:35:32:WU01:FS00:0xa7: On Battery: false
02:35:32:WU01:FS00:0xa7: UTC Offset: -4
02:35:32:WU01:FS00:0xa7:        PID: 8344
02:35:32:WU01:FS00:0xa7:        CWD: /config/work
02:35:32:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
02:35:32:WU01:FS00:0xa7:    Version: 0.0.18
02:35:32:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:35:32:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
02:35:32:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
02:35:32:WU01:FS00:0xa7:       Date: Nov 5 2019
02:35:32:WU01:FS00:0xa7:       Time: 06:13:26
02:35:32:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
02:35:32:WU01:FS00:0xa7:     Branch: master
02:35:32:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
02:35:32:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
02:35:32:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
02:35:32:WU01:FS00:0xa7:       Bits: 64
02:35:32:WU01:FS00:0xa7:       Mode: Release
02:35:32:WU01:FS00:0xa7:************************************ Build *************************************
02:35:32:WU01:FS00:0xa7:       SIMD: avx_256
02:35:32:WU01:FS00:0xa7:********************************************************************************
02:35:32:WU01:FS00:0xa7:Project: 16417 (Run 803, Clone 2, Gen 27)
02:35:32:WU01:FS00:0xa7:Unit: 0x0000001f96880e6e5e8a61cb0e4c6ac6
02:35:32:WU01:FS00:0xa7:Reading tar file core.xml
02:35:32:WU01:FS00:0xa7:Reading tar file frame27.tpr
02:35:32:WU01:FS00:0xa7:Digital signatures verified
02:35:32:WU01:FS00:0xa7:Calling: mdrun -s frame27.tpr -o frame27.trr -x frame27.xtc -cpt 15 -nt 24
02:35:32:WU01:FS00:0xa7:Steps: first=6750000 total=250000
02:35:32:WU01:FS00:0xa7:ERROR:
02:35:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:35:32:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
02:35:32:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
02:35:32:WU01:FS00:0xa7:ERROR:
02:35:32:WU01:FS00:0xa7:ERROR:Fatal error:
02:35:32:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
02:35:32:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
02:35:32:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
02:35:32:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
02:35:32:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
02:35:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
02:35:37:WU01:FS00:0xa7:WARNING:Unexpected exit() call
02:35:37:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
02:35:37:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
02:35:37:WU01:FS00:0xa7:Saving result file md.log
02:35:37:WU01:FS00:0xa7:Saving result file science.log
02:35:37:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
The relevant contents of /config/work/01/01/md.log are the following:

Code: Select all

Initializing Domain Decomposition on 24 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.425 nm, LJ-14, atoms 3095 3104
  multi-body bonded interactions: 0.425 nm, Proper Dih., atoms 3095 3104
Minimum cell size due to bonded interactions: 0.468 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.138 nm
Estimated maximum distance required for P-LINCS: 1.138 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 20 particle-particle and 4 PME only ranks
This is a guess, check the performance at the end of the log file
Using 4 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 20 cells with a minimum initial size of 1.423 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4
Please bear with me, I am brand new to F@H and don't know much about it :)

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 6:47 am
by PantherX
The starting of the log would indicate how your FAHClient is configured. It looks something like this:

Code: Select all

*********************** Log Started 2020-04-11T03:29:02Z ***********************
03:29:02:************************* Folding@home Client *************************
03:29:02:        Website: https://foldingathome.org/
03:29:02:      Copyright: (c) 2009-2018 foldingathome.org
03:29:02:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
03:29:02:           Args: 
03:29:02:         Config: C:\Users\PantherX-H\AppData\Roaming\FAHClient\config.xml
03:29:02:******************************** Build ********************************
03:29:02:        Version: 7.5.1
03:29:02:           Date: May 11 2018
03:29:02:           Time: 13:06:32
03:29:02:     Repository: Git
03:29:02:       Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
03:29:02:         Branch: master
03:29:02:       Compiler: Visual C++ 2008
03:29:02:        Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
03:29:02:       Platform: win32 10
03:29:02:           Bits: 32
03:29:02:           Mode: Release
03:29:02:******************************* System ********************************
03:29:02:            CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
03:29:02:         CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
03:29:02:           CPUs: 8
03:29:02:         Memory: 31.94GiB
03:29:02:    Free Memory: 27.89GiB
03:29:02:        Threads: WINDOWS_THREADS
03:29:02:     OS Version: 6.2
03:29:02:    Has Battery: false
03:29:02:     On Battery: false
03:29:02:     UTC Offset: 12
03:29:02:            PID: 1532
03:29:02:            CWD: C:\Users\PantherX-H\AppData\Roaming\FAHClient
03:29:02:             OS: Windows 10 Enterprise
03:29:02:        OS Arch: AMD64
03:29:02:           GPUs: 1
03:29:02:          GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 GP102 [GeForce GTX 1080 Ti] 11380
03:29:02:  CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:6.1 Driver:10.2
03:29:02:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:442.19
03:29:02:  Win32 Service: false
03:29:02:***********************************************************************
03:29:02:<config>
03:29:02:  <!-- Network -->
03:29:02:  <proxy v=':8080'/>
03:29:02:
03:29:02:  <!-- Slot Control -->
03:29:02:  <power v='full'/>
03:29:02:
03:29:02:  <!-- User Information -->
03:29:02:  <passkey v='********************************'/>
03:29:02:  <team v='69411'/>
03:29:02:  <user v='PantherX'/>
03:29:02:
03:29:02:  <!-- Folding Slots -->
03:29:02:  <slot id='1' type='GPU'>
03:29:02:    <next-unit-percentage v='100'/>
03:29:02:    <pause-on-start v='true'/>
03:29:02:  </slot>
03:29:02:</config>
03:29:02:Trying to access database...
03:29:02:Successfully acquired database lock
03:29:02:Enabled folding slot 01: PAUSED gpu:0:GP102 [GeForce GTX 1080 Ti] 11380 (by user)
03:30:05:FS01:Unpaused
03:30:05:WU01:FS01:Starting
03:30:05:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\PantherX-H\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 1532 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
03:30:05:WU01:FS01:Started FahCore on PID 8052
03:30:06:WU01:FS01:Core PID:1800
03:30:06:WU01:FS01:FahCore 0x22 started
03:30:06:WU01:FS01:0x22:*********************** Log Started 2020-04-11T03:30:06Z ***********************
03:30:06:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
03:30:06:WU01:FS01:0x22:       Type: 0x22
03:30:06:WU01:FS01:0x22:       Core: Core22
03:30:06:WU01:FS01:0x22:    Website: https://foldingathome.org/
03:30:06:WU01:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
03:30:06:WU01:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
03:30:06:WU01:FS01:0x22:             <rafal.wiewiora@choderalab.org>
03:30:06:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8052 -checkpoint 15
03:30:06:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
03:30:06:WU01:FS01:0x22:             0 -gpu 0
03:30:06:WU01:FS01:0x22:     Config: <none>
03:30:06:WU01:FS01:0x22:************************************ Build *************************************
03:30:06:WU01:FS01:0x22:    Version: 0.0.2
03:30:06:WU01:FS01:0x22:       Date: Dec 6 2019
03:30:06:WU01:FS01:0x22:       Time: 21:30:31
03:30:06:WU01:FS01:0x22: Repository: Git
03:30:06:WU01:FS01:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
03:30:06:WU01:FS01:0x22:     Branch: HEAD
03:30:06:WU01:FS01:0x22:   Compiler: Visual C++ 2008
03:30:06:WU01:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
03:30:06:WU01:FS01:0x22:   Platform: win32 10
03:30:06:WU01:FS01:0x22:       Bits: 64
03:30:06:WU01:FS01:0x22:       Mode: Release
03:30:06:WU01:FS01:0x22:************************************ System ************************************
03:30:06:WU01:FS01:0x22:        CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
03:30:06:WU01:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
03:30:06:WU01:FS01:0x22:       CPUs: 8
03:30:06:WU01:FS01:0x22:     Memory: 31.94GiB
03:30:06:WU01:FS01:0x22:Free Memory: 27.29GiB
03:30:06:WU01:FS01:0x22:    Threads: WINDOWS_THREADS
03:30:06:WU01:FS01:0x22: OS Version: 6.2
03:30:06:WU01:FS01:0x22:Has Battery: false
03:30:06:WU01:FS01:0x22: On Battery: false
03:30:06:WU01:FS01:0x22: UTC Offset: 12
03:30:06:WU01:FS01:0x22:        PID: 1800
03:30:06:WU01:FS01:0x22:        CWD: C:\Users\PantherX-H\AppData\Roaming\FAHClient\work
03:30:06:WU01:FS01:0x22:         OS: Windows 10 Pro
03:30:06:WU01:FS01:0x22:    OS Arch: AMD64
03:30:06:WU01:FS01:0x22:********************************************************************************
03:30:06:WU01:FS01:0x22:Project: 11761 (Run 0, Clone 7538, Gen 10)
03:30:06:WU01:FS01:0x22:Unit: 0x0000001a80fccb0a5e7001958353bbf1
03:30:06:WU01:FS01:0x22:Digital signatures verified
03:30:06:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
03:30:06:WU01:FS01:0x22:Version 0.0.2
03:30:06:WU01:FS01:0x22:  Found a checkpoint file
03:30:11:WU01:FS01:0x22:Completed 1050000 out of 2000000 steps (52%)
I can believe that you have allocated 24 CPUs to fold. However, the assigned WU is unable to be folding using 24 CPUs. I will message the Project owner and inform them of this so that they can place the restrictions on the F@H Servers to prevent this from happening again.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 1:29 pm
by _r2w_ben
kyleedwardsny wrote:Please bear with me, I am brand new to F@H and don't know much about it :)
You're dong great! I've asked other people for md.log and they generally don't provide it. Yours confirms a theory I've had.

Code: Select all

Will use 20 particle-particle and 4 PME only ranks
GROMACS, the molecular dynamics code, is splitting your 24 threads into 20 PP (particle-particle) and 4 PME threads/ranks. Lower thread counts only use PP threads, which makes it easier to guess what the domain decomposition will be.

Code: Select all

Optimizing the DD grid for 20 cells with a minimum initial size of 1.423 nm
GROMACS is now attempting to split the volume of the molecule into cells for each thread. There is a limit to how small a cell can be for it to still be efficient.

Code: Select all

The maximum allowed number of cells is: X 4 Y 4 Z 4
Based on the size limit, each axis could be split into 4 sections. I think this should allow for a maximum of 4x4x4 = 64 PP threads/ranks.

On your machine, it tries to factor 20 into these constraints. 20 factors to 2x2x5. From the source code, it looks like GROMACS combines multiple instances of the same factor because it needs to be a combination of 2 or 3 numbers. 2x2x5 would become 4x5x1. Notice that 5 is greater than the limit of 4 so the cell would be too small. This causes the domain decomposition to fail with the message about 20 ranks.

GROMACS does not retry the with a lower starting number because it was requested to use all 24 threads. It would be good if FAHClient would do this automatically.

For example, the following combinations might work depending on the ratio GROMACS uses for PP vs. PME. The GROMACS manual mentions 20-33% but your log mentions 4/24 = 17%. These breakdowns are accounting for the max limit of 4x4x4.

Code: Select all

23 threads = 18 PP (2x3x3) + 5 PME (22%)
22 threads = 18 PP (2x3x3) + 4 PME (18%)
21 threads = 18 PP (2x3x3) + 3 PME (14% might be too low)
21 threads = 16 PP (4x4x1) + 5 PME (23%)
20 threads = 18 PP (2x3x3) + 2 PME (10% might be too low)
20 threads = 16 PP (4x4x1) + 4 PME (25% PME is equal to one of the factors, which might be ideal)
19 threads = 16 PP (4x4x1) + 3 PME (15% might be too low)
18 threads = 18 PP (2x3x3) + 0 PME (0% PME is optional)
18 threads = 16 PP (4x4x1) + 2 PME (11% might be too low)
17 threads = 16 PP (4x4x1) + 1 PME (6% too low)
17 threads = 12 PP (4x3x1) + 5 PME (29% might be too high)
16 threads = 16 PP (4x4x1) + 0 PME (0% PME is optional)
16 threads = 12 PP (4x3x1) + 4 PME (25%)
15 threads = 12 PP (4x3x1) + 3 PME (20%)
14 threads = 12 PP (4x3x1) + 2 PME (14% might be too low)
13 threads = 12 PP (4x3x1) + 1 PME (7% too low)
13 threads =  9 PP (3x3x1) + 4 PME (31% might be too high)
12 threads = 12 PP (4x3x1) + 0 PME (0% PME is optional)
12 threads =  9 PP (3x3x1) + 3 PME (25% PME is equal to one of the factors, which might be ideal)
I'm not sure what the minimum thread count is before GROMACS starts to omit PME ranks.

PantherX, are the researchers able to limit specific CPU counts or is it only a maximum? The response to these reports generally seems to be to set a max and eliminate multiples of 5. If someone could confirm my breakdowns, it might be possible to limit thread counts more precisely and provide more work for high thread count machines.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 1:52 pm
by PantherX
_r2w_ben wrote:...PantherX, are the researchers able to limit specific CPU counts or is it only a maximum? The response to these reports generally seems to be to set a max and eliminate multiples of 5. If someone could confirm my breakdowns, it might be possible to limit thread counts more precisely and provide more work for high thread count machines.
AFAIK, the researchers can set an upper limit of CPUs and also exclude certain numbers from it. However, I am not sure how quick/easy it is.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 3:19 pm
by kyleedwardsny
Hi PantherX, sorry about the log file confusion on my part. Here is the system configuration portion of /config/log.txt:

Code: Select all

*********************** Log Started 2020-04-10T22:16:21Z ***********************
22:16:21:************************* Folding@home Client *************************
22:16:21:    Website: https://foldingathome.org/
22:16:21:  Copyright: (c) 2009-2018 foldingathome.org
22:16:21:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:16:21:       Args: --http-addresses 0.0.0.0:7396 --allow 0/0 --web-allow 0/0
22:16:21:             --command-allow-no-pass 0/0
22:16:21:     Config: /config/config.xml
22:16:21:******************************** Build ********************************
22:16:21:    Version: 7.5.1
22:16:21:       Date: May 11 2018
22:16:21:       Time: 19:59:04
22:16:21: Repository: Git
22:16:21:   Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
22:16:21:     Branch: master
22:16:21:   Compiler: GNU 6.3.0 20170516
22:16:21:    Options: -std=gnu++98 -O3 -funroll-loops
22:16:21:   Platform: linux2 4.14.0-3-amd64
22:16:21:       Bits: 64
22:16:21:       Mode: Release
22:16:21:******************************* System ********************************
22:16:21:        CPU: AMD Ryzen Threadripper 1920X 12-Core Processor
22:16:21:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
22:16:21:       CPUs: 24
22:16:21:     Memory: 31.34GiB
22:16:21:Free Memory: 26.81GiB
22:16:21:    Threads: POSIX_THREADS
22:16:21: OS Version: 4.15
22:16:21:Has Battery: false
22:16:21: On Battery: false
22:16:21: UTC Offset: -4
22:16:21:        PID: 257
22:16:21:        CWD: /config
22:16:21:         OS: Linux 4.15.0-91-generic x86_64
22:16:21:    OS Arch: AMD64
22:16:21:       GPUs: 1
22:16:21:      GPU 0: Bus:65 Slot:0 Func:0 AMD:4 Caicos [AMD RADEON HD 6450]
22:16:21:       CUDA: Not detected: cuInit() returned 100
22:16:21:     OpenCL: Not detected: clGetPlatformIDs() returned -1001
22:16:21:***********************************************************************
22:16:21:<config>
22:16:21:  <!-- Slot Control -->
22:16:21:  <power v='FULL'/>
22:16:21:
22:16:21:  <!-- User Information -->
22:16:21:  <user v='kyleedwardsny'/>
22:16:21:
22:16:21:  <!-- Folding Slots -->
22:16:21:  <slot id='0' type='CPU'/>
22:16:21:</config>
22:16:21:Trying to access database...
22:16:21:Successfully acquired database lock
22:16:21:Enabled folding slot 00: READY cpu:24

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 3:27 pm
by _r2w_ben
_r2w_ben wrote:I'm not sure what the minimum thread count is before GROMACS starts to omit PME ranks.
When using automatic configuration like FAH does, there has to be at least 20 threads before PME is used.

The code to calculate the number of PME ranks is far more complex than I expected. Based on reading the code rather than actively debugging, I'm going to revise the chart as follows to my best estimates (4x4x4 and 0.17 PME load):

Code: Select all

33 threads = 6 PME + 27 PP (3x3x3)
32 threads = bad (chooses 6 PME and then can't factor 26)
31 threads = bad (chooses 6 PME and then can't factor 25)
30 threads = bad (chooses 5 PME and then can't factor 25)
29 threads = 5 PME + 24 PP (4x3x2)
28 threads = bad (chooses 5 PME and then can't factor 23)
27 threads = bad (chooses 5 PME and then can't factor 22)
26 threads = bad (chooses 5 PME and then can't factor 21)
25 threads = bad (chooses 5 PME and then can't factor 20)
24 threads = bad (chooses 4 PME and then can't factor 20)
23 threads = bad (chooses 4 PME and then can't factor 19)
22 threads = 4 PME + 18 PP (3x3x2)
21 threads = bad (chooses 4 PME and then can't factor 17)
20 threads = 4 PME + 16 PP (4x4x1)
19 threads = bad
18 threads = 18 PP (2x3x3)
17 threads = bad
16 threads = 16 PP (4x4x1)
15 threads = bad
14 threads = bad
13 threads = bad
12 threads = 12 PP (4x3x1)
11 threads = bad
10 threads = bad
 9 threads =  9 PP (3x3x1)
 8 threads =  8 PP (4x2x1)

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 3:30 pm
by Neil-B
Intriguing … my 32core and 24core slots (also used to run 20core slot) have only had one issue in some 650 WUs … by this chart should they have mostly/all failed?

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 4:22 pm
by _r2w_ben
Neil-B wrote:Intriguing … my 32core and 24core slots (also used to run 20core slot) have only had one issue in some 650 WUs … by this chart should they have mostly/all failed?
No, this analysis is based on a maximum 4x4x4 decomposition.

The limits are different per project. I'm running p14542, which has a maximum 6x6x5.

Code: Select all

The maximum allowed number of cells is: X 6 Y 6 Z 5
The estimated PME load could differ as well, affecting the number of threads to use for PME. It's only output to md.log if you have 20 or more threads so I'm going to assume it's 0.17 like p16417.

Code: Select all

33 threads = 6 PME + 27 PP (3x3x3)
32 threads = bad (chooses 6 PME and then can't factor 26)
31 threads = 6 PME + 25 PP (5x5x1)
30 threads = 5 PME + 25 PP (5x5x1)
29 threads = 5 PME + 24 PP (4x3x2)
28 threads = bad (chooses 5 PME and then can't factor 23)
27 threads = bad (chooses 5 PME and then can't factor 22)
26 threads = bad (chooses 5 PME and then can't factor 21)
25 threads = 5 PME + 20 PP (5x5x1)
24 threads = 4 PME + 20 PP (5x5x1)
23 threads = bad (chooses 4 PME and then can't factor 19)
22 threads = 4 PME + 18 PP (3x3x2)
21 threads = bad (chooses 4 PME and then can't factor 17)
20 threads = 4 PME + 16 PP (4x4x1)
19 threads = bad
18 threads = 18 PP (2x3x3)
17 threads = bad
16 threads = 16 PP (4x4x1)
15 threads = 15 PP (5x3x1)
14 threads = bad
13 threads = bad
12 threads = 12 PP (4x3x1)
11 threads = bad
10 threads = 10 PP (5x2x1)
 9 threads =  9 PP (3x3x1)
 8 threads =  8 PP (4x2x1)
You can see how a lot more combinations will work. FAH also compensates for some primes and lowers the thread count to improve the likelihood of success.

If you can post the domain decomposition from some of the work units on your machines, I may be able to further improve my model and make a spreadsheet to do some of these calculations.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 5:08 pm
by Neil-B
Couple of cut from the two current WUs - both happen to be project 16500.

Code: Select all

Initializing Domain Decomposition on 32 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.433 nm, LJ-14, atoms 9545 9552
  multi-body bonded interactions: 0.433 nm, Proper Dih., atoms 9545 9552
Minimum cell size due to bonded interactions: 0.476 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
Estimated maximum distance required for P-LINCS: 0.819 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.10
Will use 24 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 24 cells with a minimum initial size of 1.024 nm
The maximum allowed number of cells is: X 11 Y 11 Z 10
Domain decomposition grid 8 x 3 x 1, separate PME ranks 8
PME domain decomposition: 8 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.

Code: Select all

Initializing Domain Decomposition on 24 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.431 nm, LJ-14, atoms 6162 6170
  multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 6162 6170
Minimum cell size due to bonded interactions: 0.474 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.819 nm
Estimated maximum distance required for P-LINCS: 0.819 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.10
Will use 20 particle-particle and 4 PME only ranks
This is a guess, check the performance at the end of the log file
Using 4 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 20 cells with a minimum initial size of 1.024 nm
The maximum allowed number of cells is: X 11 Y 11 Z 10
Domain decomposition grid 4 x 5 x 1, separate PME ranks 4
PME domain decomposition: 4 x 1 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 8:03 pm
by kyleedwardsny
Alright, so, is there anything I can do to skip over this work unit, or otherwise get it to work? It's been crashing over and over since yesterday afternoon and keeping my computer idle :(