Project 16404 (0, 4835, 72) -- no domain decomposition
Posted: Sun Apr 19, 2020 9:55 pm
Hello,
I have received a WU that continually generates the following error message reproduced below regarding there being no domain decomposition for 20 ranks that is compatible with the machine.
So, that CPU slot was stuck in a loop, attempting to run the WU, erroring out, and then trying again.
I manually reduced my number of usable threads to 18 and that seems to have gotten the unit running again.
Just wanted to be sure that this issue was known. Seems that this WU should not have been served to my configuration.
Thanks in advance. Details follow.
Machine:
Error Message:
I have received a WU that continually generates the following error message reproduced below regarding there being no domain decomposition for 20 ranks that is compatible with the machine.
So, that CPU slot was stuck in a loop, attempting to run the WU, erroring out, and then trying again.
I manually reduced my number of usable threads to 18 and that seems to have gotten the unit running again.
Just wanted to be sure that this issue was known. Seems that this WU should not have been served to my configuration.
Thanks in advance. Details follow.
Machine:
Code: Select all
21:36:50:WU01:FS00:Started FahCore on PID 378401
21:36:50:WU01:FS00:Core PID:378405
21:36:50:WU01:FS00:FahCore 0xa7 started
21:36:50:WU01:FS00:0xa7:*********************** Log Started 2020-04-19T21:36:50Z ***********************
21:36:50:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
21:36:50:WU01:FS00:0xa7: Type: 0xa7
21:36:50:WU01:FS00:0xa7: Core: Gromacs
21:36:50:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 705 -lifeline 378401 -checkpoint 15 -np
21:36:50:WU01:FS00:0xa7: 29
21:36:50:WU01:FS00:0xa7:************************************ CBang *************************************
21:36:50:WU01:FS00:0xa7: Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7: Time: 06:06:57
21:36:50:WU01:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
21:36:50:WU01:FS00:0xa7: Branch: master
21:36:50:WU01:FS00:0xa7: Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
21:36:50:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7: Bits: 64
21:36:50:WU01:FS00:0xa7: Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ System ************************************
21:36:50:WU01:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
21:36:50:WU01:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
21:36:50:WU01:FS00:0xa7: CPUs: 32
21:36:50:WU01:FS00:0xa7: Memory: 503.81GiB
21:36:50:WU01:FS00:0xa7:Free Memory: 417.90GiB
21:36:50:WU01:FS00:0xa7: Threads: POSIX_THREADS
21:36:50:WU01:FS00:0xa7: OS Version: 5.5
21:36:50:WU01:FS00:0xa7:Has Battery: false
21:36:50:WU01:FS00:0xa7: On Battery: false
21:36:50:WU01:FS00:0xa7: UTC Offset: -4
21:36:50:WU01:FS00:0xa7: PID: 378405
21:36:50:WU01:FS00:0xa7: CWD: /opt/fah/work
21:36:50:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
21:36:50:WU01:FS00:0xa7: Version: 0.0.18
21:36:50:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:36:50:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
21:36:50:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
21:36:50:WU01:FS00:0xa7: Date: Nov 5 2019
21:36:50:WU01:FS00:0xa7: Time: 06:13:26
21:36:50:WU01:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
21:36:50:WU01:FS00:0xa7: Branch: master
21:36:50:WU01:FS00:0xa7: Compiler: GNU 8.3.0
21:36:50:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
21:36:50:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
21:36:50:WU01:FS00:0xa7: Bits: 64
21:36:50:WU01:FS00:0xa7: Mode: Release
21:36:50:WU01:FS00:0xa7:************************************ Build *************************************
21:36:50:WU01:FS00:0xa7: SIMD: avx_256
21:36:50:WU01:FS00:0xa7:********************************************************************************
Code: Select all
21:36:50:WU01:FS00:0xa7:Project: 16404 (Run 0, Clone 4835, Gen 72)
21:36:50:WU01:FS00:0xa7:Unit: 0x0000004fa8f5c67d5e7eb9072a30cb57
21:36:50:WU01:FS00:0xa7:Reading tar file core.xml
21:36:50:WU01:FS00:0xa7:Reading tar file frame72.tpr
21:36:50:WU01:FS00:0xa7:Digital signatures verified
21:36:50:WU01:FS00:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3
21:36:50:WU01:FS00:0xa7:Calling: mdrun -s frame72.tpr -o frame72.trr -x frame72.xtc -cpt 15 -nt 28
21:36:50:WU01:FS00:0xa7:Steps: first=36000000 total=500000
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
21:36:50:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
21:36:50:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
21:36:50:WU01:FS00:0xa7:ERROR:
21:36:50:WU01:FS00:0xa7:ERROR:Fatal error:
21:36:50:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
21:36:50:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
21:36:50:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
21:36:50:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
21:36:50:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
21:36:50:WU01:FS00:0xa7:ERROR:-------------------------------------------------------