Page 1 of 1

Issues with box with very large number of CPU cores

Posted: Sat Mar 28, 2020 10:29 am
by xandm
I have a server running CentOS 7 which presents 96 CPU cores - 4 x 12-core CPUs with hyper-threading. It is dedicated to running fahclient. Several different work units will not run on it, failing with errors such as:

Code: Select all

10:20:32:WU00:FS00:Starting
10:20:32:WU00:FS00:Removing old file './work/00/logfile_01-20200328-094831.txt'
10:20:32:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 30256 -checkpoint 15 -np 96
10:20:32:WU00:FS00:Started FahCore on PID 52244
10:20:32:WU00:FS00:Core PID:52248
10:20:32:WU00:FS00:FahCore 0xa7 started
10:20:32:WU00:FS00:0xa7:*********************** Log Started 2020-03-28T10:20:32Z ***********************
10:20:32:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
10:20:32:WU00:FS00:0xa7:       Type: 0xa7
10:20:32:WU00:FS00:0xa7:       Core: Gromacs
10:20:32:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 52244 -checkpoint 15 -np
10:20:32:WU00:FS00:0xa7:             96
10:20:32:WU00:FS00:0xa7:************************************ CBang *************************************
10:20:32:WU00:FS00:0xa7:       Date: Nov 5 2019
10:20:32:WU00:FS00:0xa7:       Time: 06:06:57
10:20:32:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
10:20:32:WU00:FS00:0xa7:     Branch: master
10:20:32:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
10:20:32:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
10:20:32:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
10:20:32:WU00:FS00:0xa7:       Bits: 64
10:20:32:WU00:FS00:0xa7:       Mode: Release
10:20:32:WU00:FS00:0xa7:************************************ System ************************************
10:20:32:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz
10:20:32:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 4
10:20:32:WU00:FS00:0xa7:       CPUs: 96
10:20:32:WU00:FS00:0xa7:     Memory: 251.63GiB
10:20:32:WU00:FS00:0xa7:Free Memory: 246.44GiB
10:20:32:WU00:FS00:0xa7:    Threads: POSIX_THREADS
10:20:32:WU00:FS00:0xa7: OS Version: 3.10
10:20:32:WU00:FS00:0xa7:Has Battery: false
10:20:32:WU00:FS00:0xa7: On Battery: false
10:20:32:WU00:FS00:0xa7: UTC Offset: 0
10:20:32:WU00:FS00:0xa7:        PID: 52248
10:20:32:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
10:20:32:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
10:20:32:WU00:FS00:0xa7:    Version: 0.0.18
10:20:32:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:20:32:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
10:20:32:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
10:20:32:WU00:FS00:0xa7:       Date: Nov 5 2019
10:20:32:WU00:FS00:0xa7:       Time: 06:13:26
10:20:32:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
10:20:32:WU00:FS00:0xa7:     Branch: master
10:20:32:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
10:20:32:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
10:20:32:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
10:20:32:WU00:FS00:0xa7:       Bits: 64
10:20:32:WU00:FS00:0xa7:       Mode: Release
10:20:32:WU00:FS00:0xa7:************************************ Build *************************************
10:20:32:WU00:FS00:0xa7:       SIMD: avx_256
10:20:32:WU00:FS00:0xa7:********************************************************************************
10:20:32:WU00:FS00:0xa7:Project: 14574 (Run 0, Clone 1428, Gen 2)
10:20:32:WU00:FS00:0xa7:Unit: 0x00000004287234c95e792514c7d8ef20
10:20:32:WU00:FS00:0xa7:Reading tar file core.xml
10:20:32:WU00:FS00:0xa7:Reading tar file frame2.tpr
10:20:32:WU00:FS00:0xa7:Digital signatures verified
10:20:32:WU00:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 15 -nt 96
10:20:32:WU00:FS00:0xa7:Steps: first=1000000 total=500000
10:20:32:WU00:FS00:0xa7:ERROR:
10:20:32:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:20:32:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
10:20:32:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
10:20:32:WU00:FS00:0xa7:ERROR:
10:20:32:WU00:FS00:0xa7:ERROR:Fatal error:
10:20:32:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 72 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
10:20:32:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:20:32:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:20:32:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:20:32:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:20:32:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
and

Code: Select all

19:15:10:WU01:FS00:Starting
19:15:10:WU01:FS00:Removing old file './work/01/logfile_01-20200327-184309.txt'
19:15:10:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 87143 -checkpoint 15 -np 96
19:15:10:WU01:FS00:Started FahCore on PID 30108
19:15:10:WU01:FS00:Core PID:30112
19:15:10:WU01:FS00:FahCore 0xa7 started
19:15:11:WU01:FS00:0xa7:*********************** Log Started 2020-03-27T19:15:10Z ***********************
19:15:11:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
19:15:11:WU01:FS00:0xa7:       Type: 0xa7
19:15:11:WU01:FS00:0xa7:       Core: Gromacs
19:15:11:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 30108 -checkpoint 15 -np
19:15:11:WU01:FS00:0xa7:             96
19:15:11:WU01:FS00:0xa7:************************************ CBang *************************************
19:15:11:WU01:FS00:0xa7:       Date: Nov 5 2019
19:15:11:WU01:FS00:0xa7:       Time: 06:06:57
19:15:11:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
19:15:11:WU01:FS00:0xa7:     Branch: master
19:15:11:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
19:15:11:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
19:15:11:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:15:11:WU01:FS00:0xa7:       Bits: 64
19:15:11:WU01:FS00:0xa7:       Mode: Release
19:15:11:WU01:FS00:0xa7:************************************ System ************************************
19:15:11:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz
19:15:11:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 4
19:15:11:WU01:FS00:0xa7:       CPUs: 96
19:15:11:WU01:FS00:0xa7:     Memory: 251.63GiB
19:15:11:WU01:FS00:0xa7:Free Memory: 246.45GiB
19:15:11:WU01:FS00:0xa7:    Threads: POSIX_THREADS
19:15:11:WU01:FS00:0xa7: OS Version: 3.10
19:15:11:WU01:FS00:0xa7:Has Battery: false
19:15:11:WU01:FS00:0xa7: On Battery: false
19:15:11:WU01:FS00:0xa7: UTC Offset: 0
19:15:11:WU01:FS00:0xa7:        PID: 30112
19:15:11:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
19:15:11:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
19:15:11:WU01:FS00:0xa7:    Version: 0.0.18
19:15:11:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:15:11:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
19:15:11:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
19:15:11:WU01:FS00:0xa7:       Date: Nov 5 2019
19:15:11:WU01:FS00:0xa7:       Time: 06:13:26
19:15:11:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
19:15:11:WU01:FS00:0xa7:     Branch: master
19:15:11:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
19:15:11:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
19:15:11:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
19:15:11:WU01:FS00:0xa7:       Bits: 64
19:15:11:WU01:FS00:0xa7:       Mode: Release
19:15:11:WU01:FS00:0xa7:************************************ Build *************************************
19:15:11:WU01:FS00:0xa7:       SIMD: avx_256
19:15:11:WU01:FS00:0xa7:********************************************************************************
19:15:11:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 226, Gen 5)
19:15:11:WU01:FS00:0xa7:Unit: 0x00000008287234c95e7923efa2c4aea3
19:15:11:WU01:FS00:0xa7:Reading tar file core.xml
19:15:11:WU01:FS00:0xa7:Reading tar file frame5.tpr
19:15:11:WU01:FS00:0xa7:Digital signatures verified
19:15:11:WU01:FS00:0xa7:Calling: mdrun -s frame5.tpr -o frame5.trr -x frame5.xtc -cpt 15 -nt 96
19:15:11:WU01:FS00:0xa7:Steps: first=2500000 total=500000
19:15:11:WU01:FS00:0xa7:ERROR:
19:15:11:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
19:15:11:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
19:15:11:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
19:15:11:WU01:FS00:0xa7:ERROR:
19:15:11:WU01:FS00:0xa7:ERROR:Fatal error:
19:15:11:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 80 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
19:15:11:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
19:15:11:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
19:15:11:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
19:15:11:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
19:15:11:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
What's the best way to make use of this box for folding? Should I consider splitting it up by running VMs on it, with say 8 CPU cores each? Currently it needs constant attention to delete non-working WUs and then hoping to get one that works. Other servers with fewer (40) CPUs have the same issue, but less frequently.

Re: Issues with box with very large number of CPU cores

Posted: Sat Mar 28, 2020 3:01 pm
by Joe_H
The researchers setting up the folding projects have limited access to systems having many CPU cores, so often have no data to set an upper limit on CPU threads for assignment. Sometimes pausing and changing the thread count to a lower value will work, will mention usable thread counts later.

They do have some ability to test up to about 32 threads on a regular basis, and there are CPU projects that will assign to 16+ threads and work up to some maximum number. What I would suggest is set up several CPU folding slots, with thread counts that you have seen working on your servers. The total threads between the slot should not exceed what your CPUs can provide.

Usable thread counts will be multiples of 2, 3 and sometimes 5. The Gromacs code used in the folding cores has problems with multiples of larger primes greater than or equal to 7. Multiples of 5 sometimes cause problems, they try to identify projects where that is an issue, but some are found after being out for a while.

If you do run into WU's from a project that give the decomposition error message, if you report on the project number and what thread count it failed at, we can forward that information to the researcher to adjust the limits.

Re: Issues with box with very large number of CPU cores

Posted: Sat Mar 28, 2020 3:08 pm
by Neil-B
My guess would be one client with say three 32 slots or four 24 slots should (these use the available threads nicely) … Going smaller helps avoid some of the availability issues but the bigger the slots the faster the WUs turn round … Finding the sweet spot whilst avoiding core values that are divisible by larger (say 7 or above) primes is a sport :) … There are a number of threads which work into this in detail (search prime should find you most of them) … I am currently picking up WUs at 24core and 30core back to back - but I have got the advanced client-type flag set which may explain why I am constantly getting WUs.

Re: Issues with box with very large number of CPU cores

Posted: Sat Mar 28, 2020 3:20 pm
by EXT64
For a lot of projects it is a minefield above 24 threads. On new projects I'm trying to help test up to 128t, but testing every combination can be tedious. And unfortunately since the bigadv and big flags are deprecated, we have no way to discriminate.

At this time there are only a couple options. You can try without HT to reduce "core" count with minimal performance loss. Some projects will likely still fail on larger systems.

Option 2 is to run multiple slots (24 or less threads) per system. Recognize though that although this will increase throughput, QRB will punish you and your PPD will likely plummet. Also in many cases skipping HT can help the scheduler keep things from making too big of a mess. I would also recommend running Linux with numactl if you have a multi-socket system.

Typically 64, 48, 32 core counts work ok on many projects, but even those will explode on some of the smaller projects.

Re: Issues with box with very large number of CPU cores

Posted: Sat Mar 28, 2020 3:52 pm
by _r2w_ben
For average desktop GPUs, GROMACS splits the work up based on the number of cores using PP ranks. With these high core scenarios, cores are divided into two sets: PP ranks and PME ranks. (See page 63 of the manual. PME is 25-33% of the load.)

From your log, 96 was split into 72 PP ranks and 24 PME ranks.

If you want to aim for maximum threads on one work unit, I would try for 64 PP ranks. Keeping the 25-33% in mind, this would mean a core count between 80 and 86 on the slot.