Issues with box with very large number of CPU cores
Posted: Sat Mar 28, 2020 10:29 am
I have a server running CentOS 7 which presents 96 CPU cores - 4 x 12-core CPUs with hyper-threading. It is dedicated to running fahclient. Several different work units will not run on it, failing with errors such as:
and
What's the best way to make use of this box for folding? Should I consider splitting it up by running VMs on it, with say 8 CPU cores each? Currently it needs constant attention to delete non-working WUs and then hoping to get one that works. Other servers with fewer (40) CPUs have the same issue, but less frequently.
Code: Select all
10:20:32:WU00:FS00:Starting
10:20:32:WU00:FS00:Removing old file './work/00/logfile_01-20200328-094831.txt'
10:20:32:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 30256 -checkpoint 15 -np 96
10:20:32:WU00:FS00:Started FahCore on PID 52244
10:20:32:WU00:FS00:Core PID:52248
10:20:32:WU00:FS00:FahCore 0xa7 started
10:20:32:WU00:FS00:0xa7:*********************** Log Started 2020-03-28T10:20:32Z ***********************
10:20:32:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
10:20:32:WU00:FS00:0xa7: Type: 0xa7
10:20:32:WU00:FS00:0xa7: Core: Gromacs
10:20:32:WU00:FS00:0xa7: Args: -dir 00 -suffix 01 -version 705 -lifeline 52244 -checkpoint 15 -np
10:20:32:WU00:FS00:0xa7: 96
10:20:32:WU00:FS00:0xa7:************************************ CBang *************************************
10:20:32:WU00:FS00:0xa7: Date: Nov 5 2019
10:20:32:WU00:FS00:0xa7: Time: 06:06:57
10:20:32:WU00:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
10:20:32:WU00:FS00:0xa7: Branch: master
10:20:32:WU00:FS00:0xa7: Compiler: GNU 8.3.0
10:20:32:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
10:20:32:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
10:20:32:WU00:FS00:0xa7: Bits: 64
10:20:32:WU00:FS00:0xa7: Mode: Release
10:20:32:WU00:FS00:0xa7:************************************ System ************************************
10:20:32:WU00:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz
10:20:32:WU00:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 63 Stepping 4
10:20:32:WU00:FS00:0xa7: CPUs: 96
10:20:32:WU00:FS00:0xa7: Memory: 251.63GiB
10:20:32:WU00:FS00:0xa7:Free Memory: 246.44GiB
10:20:32:WU00:FS00:0xa7: Threads: POSIX_THREADS
10:20:32:WU00:FS00:0xa7: OS Version: 3.10
10:20:32:WU00:FS00:0xa7:Has Battery: false
10:20:32:WU00:FS00:0xa7: On Battery: false
10:20:32:WU00:FS00:0xa7: UTC Offset: 0
10:20:32:WU00:FS00:0xa7: PID: 52248
10:20:32:WU00:FS00:0xa7: CWD: /var/lib/fahclient/work
10:20:32:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
10:20:32:WU00:FS00:0xa7: Version: 0.0.18
10:20:32:WU00:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:20:32:WU00:FS00:0xa7: Copyright: 2019 foldingathome.org
10:20:32:WU00:FS00:0xa7: Homepage: https://foldingathome.org/
10:20:32:WU00:FS00:0xa7: Date: Nov 5 2019
10:20:32:WU00:FS00:0xa7: Time: 06:13:26
10:20:32:WU00:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
10:20:32:WU00:FS00:0xa7: Branch: master
10:20:32:WU00:FS00:0xa7: Compiler: GNU 8.3.0
10:20:32:WU00:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
10:20:32:WU00:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
10:20:32:WU00:FS00:0xa7: Bits: 64
10:20:32:WU00:FS00:0xa7: Mode: Release
10:20:32:WU00:FS00:0xa7:************************************ Build *************************************
10:20:32:WU00:FS00:0xa7: SIMD: avx_256
10:20:32:WU00:FS00:0xa7:********************************************************************************
10:20:32:WU00:FS00:0xa7:Project: 14574 (Run 0, Clone 1428, Gen 2)
10:20:32:WU00:FS00:0xa7:Unit: 0x00000004287234c95e792514c7d8ef20
10:20:32:WU00:FS00:0xa7:Reading tar file core.xml
10:20:32:WU00:FS00:0xa7:Reading tar file frame2.tpr
10:20:32:WU00:FS00:0xa7:Digital signatures verified
10:20:32:WU00:FS00:0xa7:Calling: mdrun -s frame2.tpr -o frame2.trr -x frame2.xtc -cpt 15 -nt 96
10:20:32:WU00:FS00:0xa7:Steps: first=1000000 total=500000
10:20:32:WU00:FS00:0xa7:ERROR:
10:20:32:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
10:20:32:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
10:20:32:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
10:20:32:WU00:FS00:0xa7:ERROR:
10:20:32:WU00:FS00:0xa7:ERROR:Fatal error:
10:20:32:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 72 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
10:20:32:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
10:20:32:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
10:20:32:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
10:20:32:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
10:20:32:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
Code: Select all
19:15:10:WU01:FS00:Starting
19:15:10:WU01:FS00:Removing old file './work/01/logfile_01-20200327-184309.txt'
19:15:10:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 87143 -checkpoint 15 -np 96
19:15:10:WU01:FS00:Started FahCore on PID 30108
19:15:10:WU01:FS00:Core PID:30112
19:15:10:WU01:FS00:FahCore 0xa7 started
19:15:11:WU01:FS00:0xa7:*********************** Log Started 2020-03-27T19:15:10Z ***********************
19:15:11:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
19:15:11:WU01:FS00:0xa7: Type: 0xa7
19:15:11:WU01:FS00:0xa7: Core: Gromacs
19:15:11:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 705 -lifeline 30108 -checkpoint 15 -np
19:15:11:WU01:FS00:0xa7: 96
19:15:11:WU01:FS00:0xa7:************************************ CBang *************************************
19:15:11:WU01:FS00:0xa7: Date: Nov 5 2019
19:15:11:WU01:FS00:0xa7: Time: 06:06:57
19:15:11:WU01:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
19:15:11:WU01:FS00:0xa7: Branch: master
19:15:11:WU01:FS00:0xa7: Compiler: GNU 8.3.0
19:15:11:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
19:15:11:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
19:15:11:WU01:FS00:0xa7: Bits: 64
19:15:11:WU01:FS00:0xa7: Mode: Release
19:15:11:WU01:FS00:0xa7:************************************ System ************************************
19:15:11:WU01:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz
19:15:11:WU01:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 63 Stepping 4
19:15:11:WU01:FS00:0xa7: CPUs: 96
19:15:11:WU01:FS00:0xa7: Memory: 251.63GiB
19:15:11:WU01:FS00:0xa7:Free Memory: 246.45GiB
19:15:11:WU01:FS00:0xa7: Threads: POSIX_THREADS
19:15:11:WU01:FS00:0xa7: OS Version: 3.10
19:15:11:WU01:FS00:0xa7:Has Battery: false
19:15:11:WU01:FS00:0xa7: On Battery: false
19:15:11:WU01:FS00:0xa7: UTC Offset: 0
19:15:11:WU01:FS00:0xa7: PID: 30112
19:15:11:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work
19:15:11:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
19:15:11:WU01:FS00:0xa7: Version: 0.0.18
19:15:11:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
19:15:11:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
19:15:11:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
19:15:11:WU01:FS00:0xa7: Date: Nov 5 2019
19:15:11:WU01:FS00:0xa7: Time: 06:13:26
19:15:11:WU01:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
19:15:11:WU01:FS00:0xa7: Branch: master
19:15:11:WU01:FS00:0xa7: Compiler: GNU 8.3.0
19:15:11:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
19:15:11:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
19:15:11:WU01:FS00:0xa7: Bits: 64
19:15:11:WU01:FS00:0xa7: Mode: Release
19:15:11:WU01:FS00:0xa7:************************************ Build *************************************
19:15:11:WU01:FS00:0xa7: SIMD: avx_256
19:15:11:WU01:FS00:0xa7:********************************************************************************
19:15:11:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 226, Gen 5)
19:15:11:WU01:FS00:0xa7:Unit: 0x00000008287234c95e7923efa2c4aea3
19:15:11:WU01:FS00:0xa7:Reading tar file core.xml
19:15:11:WU01:FS00:0xa7:Reading tar file frame5.tpr
19:15:11:WU01:FS00:0xa7:Digital signatures verified
19:15:11:WU01:FS00:0xa7:Calling: mdrun -s frame5.tpr -o frame5.trr -x frame5.xtc -cpt 15 -nt 96
19:15:11:WU01:FS00:0xa7:Steps: first=2500000 total=500000
19:15:11:WU01:FS00:0xa7:ERROR:
19:15:11:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
19:15:11:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
19:15:11:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
19:15:11:WU01:FS00:0xa7:ERROR:
19:15:11:WU01:FS00:0xa7:ERROR:Fatal error:
19:15:11:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 80 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
19:15:11:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
19:15:11:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
19:15:11:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
19:15:11:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
19:15:11:WU01:FS00:0xa7:ERROR:-------------------------------------------------------