segfaults: GROMACS "There is no domain decomposition for..."
Posted: Wed Mar 04, 2020 8:54 am
After a few hours of normal operation my fah client segfaults constantly (once every minute).
Here's one error cycle in the log file:
I wonder why FahCore segfaults. The GROMACS error doesn't look so bad after all, seems like a different argument like "-rdd 9" might fix the underlying issue. But do I really need to set this myself? I mean I have no clue why 10 ranks is bad for domain decomposition but a different number might work. Even worse: fah already adapted the initial value from 11 to 10 itself, so I guess that's the real issue here.
Here's one error cycle in the log file:
Code: Select all
08:21:37:WU01:FS00:Starting
08:21:37:WU01:FS00:Removing old file './work/01/logfile_01-20200304-074936.txt'
08:21:37:WU01:FS00:Running FahCore: /opt/fah/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version
705 -lifeline 15548 -checkpoint 15 -np 11
08:21:37:WU01:FS00:Started FahCore on PID 87796
08:21:37:WU01:FS00:Core PID:87800
08:21:37:WU01:FS00:FahCore 0xa7 started
08:21:37:WU01:FS00:0xa7:*********************** Log Started 2020-03-04T08:21:37Z ***********************
08:21:37:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:21:37:WU01:FS00:0xa7: Type: 0xa7
08:21:37:WU01:FS00:0xa7: Core: Gromacs
08:21:37:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 705 -lifeline 87796 -checkpoint 15 -np
08:21:37:WU01:FS00:0xa7: 11
08:21:37:WU01:FS00:0xa7:************************************ CBang *************************************
08:21:37:WU01:FS00:0xa7: Date: Nov 5 2019
08:21:37:WU01:FS00:0xa7: Time: 06:06:57
08:21:37:WU01:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:21:37:WU01:FS00:0xa7: Branch: master
08:21:37:WU01:FS00:0xa7: Compiler: GNU 8.3.0
08:21:37:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:21:37:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
08:21:37:WU01:FS00:0xa7: Bits: 64
08:21:37:WU01:FS00:0xa7: Mode: Release
08:21:37:WU01:FS00:0xa7:************************************ System ************************************
08:21:37:WU01:FS00:0xa7: CPU: AMD Ryzen 5 1600X Six-Core Processor
08:21:37:WU01:FS00:0xa7: CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
08:21:37:WU01:FS00:0xa7: CPUs: 12
08:21:37:WU01:FS00:0xa7: Memory: 15.65GiB
08:21:37:WU01:FS00:0xa7:Free Memory: 9.72GiB
08:21:37:WU01:FS00:0xa7: Threads: POSIX_THREADS
08:21:37:WU01:FS00:0xa7: OS Version: 5.5
08:21:37:WU01:FS00:0xa7:Has Battery: false
08:21:37:WU01:FS00:0xa7: On Battery: false
08:21:37:WU01:FS00:0xa7: UTC Offset: 1
08:21:37:WU01:FS00:0xa7: PID: 87800
08:21:37:WU01:FS00:0xa7: CWD: /var/lib/private/fah/work
08:21:37:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
08:21:37:WU01:FS00:0xa7: Version: 0.0.18
08:21:37:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:21:37:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
08:21:37:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
08:21:37:WU01:FS00:0xa7: Date: Nov 5 2019
08:21:37:WU01:FS00:0xa7: Time: 06:13:26
08:21:37:WU01:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:21:37:WU01:FS00:0xa7: Branch: master
08:21:37:WU01:FS00:0xa7: Compiler: GNU 8.3.0
08:21:37:WU01:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:21:37:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
08:21:37:WU01:FS00:0xa7: Bits: 64
08:21:37:WU01:FS00:0xa7: Mode: Release
08:21:37:WU01:FS00:0xa7:************************************ Build *************************************
08:21:37:WU01:FS00:0xa7: SIMD: avx_256
08:21:37:WU01:FS00:0xa7:********************************************************************************
08:21:37:WU01:FS00:0xa7:Project: 14318 (Run 2, Clone 39, Gen 14)
08:21:37:WU01:FS00:0xa7:Unit: 0x000000160002894b5df7b4e0fec692f3
08:21:37:WU01:FS00:0xa7:Reading tar file core.xml
08:21:37:WU01:FS00:0xa7:Reading tar file frame14.tpr
08:21:37:WU01:FS00:0xa7:Digital signatures verified
08:21:37:WU01:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
08:21:37:WU01:FS00:0xa7:Calling: mdrun -s frame14.tpr -o frame14.trr -cpt 15 -nt 10
08:21:37:WU01:FS00:0xa7:Steps: first=7000000 total=500000
08:21:37:WU01:FS00:0xa7:ERROR:
08:21:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
08:21:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:21:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:21:37:WU01:FS00:0xa7:ERROR:
08:21:37:WU01:FS00:0xa7:ERROR:Fatal error:
08:21:37:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.61758 nm
08:21:37:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rdd or -dds
08:21:37:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:21:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:21:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:21:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
08:21:42:WU01:FS00:0xa7:WARNING:Unexpected exit() call
08:21:42:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
08:21:42:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
08:21:42:WU01:FS00:0xa7:Saving result file md.log
08:21:42:WU01:FS00:0xa7:Saving result file science.log
08:21:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)