Page 1 of 1

segfaults: GROMACS "There is no domain decomposition for..."

Posted: Wed Mar 04, 2020 8:54 am
by thorstenhirsch
After a few hours of normal operation my fah client segfaults constantly (once every minute).
Here's one error cycle in the log file:

Code: Select all

08:21:37:WU01:FS00:Starting
08:21:37:WU01:FS00:Removing old file './work/01/logfile_01-20200304-074936.txt'
08:21:37:WU01:FS00:Running FahCore: /opt/fah/FAHCoreWrapper /var/lib/private/fah/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version
 705 -lifeline 15548 -checkpoint 15 -np 11
08:21:37:WU01:FS00:Started FahCore on PID 87796
08:21:37:WU01:FS00:Core PID:87800
08:21:37:WU01:FS00:FahCore 0xa7 started
08:21:37:WU01:FS00:0xa7:*********************** Log Started 2020-03-04T08:21:37Z ***********************
08:21:37:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:21:37:WU01:FS00:0xa7:       Type: 0xa7
08:21:37:WU01:FS00:0xa7:       Core: Gromacs
08:21:37:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 87796 -checkpoint 15 -np
08:21:37:WU01:FS00:0xa7:             11
08:21:37:WU01:FS00:0xa7:************************************ CBang *************************************
08:21:37:WU01:FS00:0xa7:       Date: Nov 5 2019
08:21:37:WU01:FS00:0xa7:       Time: 06:06:57
08:21:37:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:21:37:WU01:FS00:0xa7:     Branch: master
08:21:37:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
08:21:37:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:21:37:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:21:37:WU01:FS00:0xa7:       Bits: 64
08:21:37:WU01:FS00:0xa7:       Mode: Release
08:21:37:WU01:FS00:0xa7:************************************ System ************************************
08:21:37:WU01:FS00:0xa7:        CPU: AMD Ryzen 5 1600X Six-Core Processor
08:21:37:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 1 Stepping 1
08:21:37:WU01:FS00:0xa7:       CPUs: 12
08:21:37:WU01:FS00:0xa7:     Memory: 15.65GiB
08:21:37:WU01:FS00:0xa7:Free Memory: 9.72GiB
08:21:37:WU01:FS00:0xa7:    Threads: POSIX_THREADS
08:21:37:WU01:FS00:0xa7: OS Version: 5.5
08:21:37:WU01:FS00:0xa7:Has Battery: false
08:21:37:WU01:FS00:0xa7: On Battery: false
08:21:37:WU01:FS00:0xa7: UTC Offset: 1
08:21:37:WU01:FS00:0xa7:        PID: 87800
08:21:37:WU01:FS00:0xa7:        CWD: /var/lib/private/fah/work
08:21:37:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
08:21:37:WU01:FS00:0xa7:    Version: 0.0.18
08:21:37:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:21:37:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:21:37:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
08:21:37:WU01:FS00:0xa7:       Date: Nov 5 2019
08:21:37:WU01:FS00:0xa7:       Time: 06:13:26
08:21:37:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:21:37:WU01:FS00:0xa7:     Branch: master
08:21:37:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
08:21:37:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:21:37:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:21:37:WU01:FS00:0xa7:       Bits: 64
08:21:37:WU01:FS00:0xa7:       Mode: Release
08:21:37:WU01:FS00:0xa7:************************************ Build *************************************
08:21:37:WU01:FS00:0xa7:       SIMD: avx_256
08:21:37:WU01:FS00:0xa7:********************************************************************************
08:21:37:WU01:FS00:0xa7:Project: 14318 (Run 2, Clone 39, Gen 14)
08:21:37:WU01:FS00:0xa7:Unit: 0x000000160002894b5df7b4e0fec692f3
08:21:37:WU01:FS00:0xa7:Reading tar file core.xml
08:21:37:WU01:FS00:0xa7:Reading tar file frame14.tpr
08:21:37:WU01:FS00:0xa7:Digital signatures verified
08:21:37:WU01:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
08:21:37:WU01:FS00:0xa7:Calling: mdrun -s frame14.tpr -o frame14.trr -cpt 15 -nt 10
08:21:37:WU01:FS00:0xa7:Steps: first=7000000 total=500000
08:21:37:WU01:FS00:0xa7:ERROR:
08:21:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
08:21:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:21:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:21:37:WU01:FS00:0xa7:ERROR:
08:21:37:WU01:FS00:0xa7:ERROR:Fatal error:
08:21:37:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.61758 nm
08:21:37:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rdd or -dds
08:21:37:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:21:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:21:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:21:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
08:21:42:WU01:FS00:0xa7:WARNING:Unexpected exit() call
08:21:42:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
08:21:42:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
08:21:42:WU01:FS00:0xa7:Saving result file md.log
08:21:42:WU01:FS00:0xa7:Saving result file science.log
08:21:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
I wonder why FahCore segfaults. The GROMACS error doesn't look so bad after all, seems like a different argument like "-rdd 9" might fix the underlying issue. But do I really need to set this myself? I mean I have no clue why 10 ranks is bad for domain decomposition but a different number might work. Even worse: fah already adapted the initial value from 11 to 10 itself, so I guess that's the real issue here.

Re: segfaults: GROMACS "There is no domain decomposition for

Posted: Wed Mar 04, 2020 2:32 pm
by bruce
Gromacs has always had difficulties when the number of threads contains factors other than larger than 2 or 3. The FAHCore code adjusts the number of threads to avoid things like -nt 11. The factor 5 is often acceptable and isn't always avoided. (Yes, 10 contains the factors 2 and 5.)

Re: segfaults: GROMACS "There is no domain decomposition for

Posted: Wed Mar 04, 2020 2:33 pm
by JimboPalmer
I see you claiming it seg faults, but I missed any hint in the log that it did.

Are you using Seg Fault in some new meaning that the program decide not to complete, as opposed to its traditional meaning of a segmentation fault? (I.e. memory access violation https://en.wikipedia.org/wiki/Segmentation_fault )

You have a 12 core CPU and you are using 11 of them. (Perhaps you have a Video card, but you did not give us the first 200 lines of the log where we could see your configuration and make better guesses)
The program recognizes, that 11 is bad and steps down to 10 CPUs, but the Work Unit dislikes 10 (2 x 5) and halts. You could try 12 or 9, since I don't know why we are not already at 12, I would guess 9 CPUs is safe.

Link to how to post a log viewtopic.php?f=24&t=26036

Re: segfaults: GROMACS "There is no domain decomposition for

Posted: Wed Mar 04, 2020 2:48 pm
by bruce
The error message ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.61758 nm can be resolved if the project owner adjusts the MinCellSize or the BoxSize. Unfortunately you can't do anything about it so a new WU needs to be assigned.

This is not a Segfault.

Re: segfaults: GROMACS "There is no domain decomposition for

Posted: Wed Mar 04, 2020 5:56 pm
by thorstenhirsch
Thanks for your answers, guys. The segfault can be seen in dmesg. And in journalctl I can see both - the fah log and the coredumps. This is how it looks like:

Code: Select all

Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:Steps: first=7000000 total=500000
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/>
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:Fatal error:
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box>
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rdd or -dds
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
Mär 04 03:27:24 XXX FAHClient[15548]: 02:27:24:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
Mär 04 03:27:29 XXX kernel: FahCore_a7[31032]: segfault at 70 ip 000000000120aa3d sp 00007fff4fff3380 error 4 in FahCore_a7[406000+10cc000]
Mär 04 03:27:29 XXX kernel: Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 >
Mär 04 03:27:29 XXX systemd[1]: Created slice system-systemd\x2dcoredump.slice.
Mär 04 03:27:29 XXX systemd[1]: Started Process Core Dump (PID 31058/UID 0).
Mär 04 03:27:29 XXX FAHClient[15548]: 02:27:29:WU01:FS00:0xa7:WARNING:Unexpected exit() call
Mär 04 03:27:29 XXX FAHClient[15548]: 02:27:29:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
Mär 04 03:27:29 XXX FAHClient[15548]: 02:27:29:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
Mär 04 03:27:29 XXX FAHClient[15548]: 02:27:29:WU01:FS00:0xa7:Saving result file md.log
Mär 04 03:27:29 XXX FAHClient[15548]: 02:27:29:WU01:FS00:0xa7:Saving result file science.log
Mär 04 03:27:29 XXX systemd-coredump[31059]: Process 31032 (FahCore_a7) of user 62464 dumped core.