Page 1 of 2

project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 cores

Posted: Wed Aug 05, 2020 6:31 pm
by UofM.MartinK
I found a 16-CPU-core slot on a 3900X Linux machine stuck in the following endless-loop for days:

Code: Select all

15:49:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:50:24:WU01:FS00:Starting
15:50:24:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-151402.txt'
15:50:24:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1623069 -checkpoint 15 -np 16
15:50:24:WU01:FS00:Started FahCore on PID 1626543
15:50:24:WU01:FS00:Core PID:1626547
15:50:24:WU01:FS00:FahCore 0xa7 started
15:50:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:50:24Z ***********************
15:50:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:50:25:WU01:FS00:0xa7:       Type: 0xa7
15:50:25:WU01:FS00:0xa7:       Core: Gromacs
15:50:25:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 1626543 -checkpoint 15
15:50:25:WU01:FS00:0xa7:             -np 16
15:50:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:50:25:WU01:FS00:0xa7:       Date: Nov 27 2019
15:50:25:WU01:FS00:0xa7:       Time: 11:26:54
15:50:25:WU01:FS00:0xa7:   Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:50:25:WU01:FS00:0xa7:     Branch: master
15:50:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:50:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:50:25:WU01:FS00:0xa7:             -fno-pie -fPIC
15:50:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:50:25:WU01:FS00:0xa7:       Bits: 64
15:50:25:WU01:FS00:0xa7:       Mode: Release
15:50:25:WU01:FS00:0xa7:************************************ System ************************************
15:50:25:WU01:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
15:50:25:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:50:25:WU01:FS00:0xa7:       CPUs: 24
15:50:25:WU01:FS00:0xa7:     Memory: 31.30GiB
15:50:25:WU01:FS00:0xa7:Free Memory: 5.31GiB
15:50:25:WU01:FS00:0xa7:    Threads: POSIX_THREADS
15:50:25:WU01:FS00:0xa7: OS Version: 5.4
15:50:25:WU01:FS00:0xa7:Has Battery: false
15:50:25:WU01:FS00:0xa7: On Battery: false
15:50:25:WU01:FS00:0xa7: UTC Offset: -4
15:50:25:WU01:FS00:0xa7:        PID: 1626547
15:50:25:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
15:50:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:50:25:WU01:FS00:0xa7:    Version: 0.0.19
15:50:25:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:50:25:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
15:50:25:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
15:50:25:WU01:FS00:0xa7:       Date: Nov 26 2019
15:50:25:WU01:FS00:0xa7:       Time: 00:41:42
15:50:25:WU01:FS00:0xa7:   Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:50:25:WU01:FS00:0xa7:     Branch: master
15:50:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:50:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:50:25:WU01:FS00:0xa7:             -fno-pie
15:50:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:50:25:WU01:FS00:0xa7:       Bits: 64
15:50:25:WU01:FS00:0xa7:       Mode: Release
15:50:25:WU01:FS00:0xa7:************************************ Build *************************************
15:50:25:WU01:FS00:0xa7:       SIMD: avx_256
15:50:25:WU01:FS00:0xa7:********************************************************************************
15:50:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:50:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:50:25:WU01:FS00:0xa7:Digital signatures verified
15:50:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 16
15:50:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:50:27:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:50:32:WU01:FS00:0xa7:ERROR:
15:50:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:32:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:32:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:32:WU01:FS00:0xa7:ERROR:
15:50:32:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:32:WU01:FS00:0xa7:ERROR:22 particles communicated to PME rank 11 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension y.
15:50:32:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:32:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:32:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:37:WU01:FS00:0xa7:ERROR:59 particles communicated to PME rank 10 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:50:37:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:37:WU01:FS00:0xa7:ERROR:16 particles communicated to PME rank 11 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:50:37:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
The only variation in the hundreds of errors being X and Y in the
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
resulting always in FahCore returned: INTERRUPTED (102 = 0x66)

Reduced core count to 12, and the resulting errors varied a bit:

Code: Select all

15:58:25:WU01:FS00:FahCore 0xa7 started
15:58:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:58:25Z ***********************
15:58:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:58:25:WU01:FS00:0xa7:       Type: 0xa7
15:58:25:WU01:FS00:0xa7:       Core: Gromacs
15:58:25:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 1627903 -checkpoint 15
15:58:25:WU01:FS00:0xa7:             -np 12
15:58:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:58:25:WU01:FS00:0xa7:       Date: Nov 27 2019
15:58:25:WU01:FS00:0xa7:       Time: 11:26:54
15:58:25:WU01:FS00:0xa7:   Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:58:25:WU01:FS00:0xa7:     Branch: master
15:58:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:58:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:58:25:WU01:FS00:0xa7:             -fno-pie -fPIC
15:58:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:58:25:WU01:FS00:0xa7:       Bits: 64
15:58:25:WU01:FS00:0xa7:       Mode: Release
15:58:25:WU01:FS00:0xa7:************************************ System ************************************
15:58:25:WU01:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
15:58:25:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:58:25:WU01:FS00:0xa7:       CPUs: 24
15:58:25:WU01:FS00:0xa7:     Memory: 31.30GiB
15:58:25:WU01:FS00:0xa7:Free Memory: 5.25GiB
15:58:25:WU01:FS00:0xa7:    Threads: POSIX_THREADS
15:58:25:WU01:FS00:0xa7: OS Version: 5.4
15:58:25:WU01:FS00:0xa7:Has Battery: false
15:58:25:WU01:FS00:0xa7: On Battery: false
15:58:25:WU01:FS00:0xa7: UTC Offset: -4
15:58:25:WU01:FS00:0xa7:        PID: 1627907
15:58:25:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
15:58:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:58:25:WU01:FS00:0xa7:    Version: 0.0.19
15:58:25:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:58:25:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
15:58:25:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
15:58:25:WU01:FS00:0xa7:       Date: Nov 26 2019
15:58:25:WU01:FS00:0xa7:       Time: 00:41:42
15:58:25:WU01:FS00:0xa7:   Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:58:25:WU01:FS00:0xa7:     Branch: master
15:58:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:58:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:58:25:WU01:FS00:0xa7:             -fno-pie
15:58:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:58:25:WU01:FS00:0xa7:       Bits: 64
15:58:25:WU01:FS00:0xa7:       Mode: Release
15:58:25:WU01:FS00:0xa7:************************************ Build *************************************
15:58:25:WU01:FS00:0xa7:       SIMD: avx_256
15:58:25:WU01:FS00:0xa7:********************************************************************************
15:58:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:58:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:58:25:WU01:FS00:0xa7:Digital signatures verified
15:58:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 12
15:58:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:58:29:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:58:36:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:Fatal error:
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:58:36:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:Fatal error:
15:58:36:WU01:FS00:0xa7:ERROR:4 particles communicated to PME rank 3 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:58:36:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:58:36:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:58:36:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:2 particles communicated to PME rank 2 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:58:36:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:58:36:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:58:36:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:41:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:58:55:WU02:FS01:0xa7:Completed 57500 out of 250000 steps (23%)
15:59:25:WU01:FS00:Starting
15:59:25:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-152202.txt'
15:59:25:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1626803 -checkpoint 15 -np 12
15:59:25:WU01:FS00:Started FahCore on PID 1627925
15:59:25:WU01:FS00:Core PID:1627929
15:59:25:WU01:FS00:FahCore 0xa7 started
15:59:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:59:25Z ***********************
15:59:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:59:25:WU01:FS00:0xa7:       Type: 0xa7
15:59:25:WU01:FS00:0xa7:       Core: Gromacs
15:59:25:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 706 -lifeline 1627925 -checkpoint 15
15:59:25:WU01:FS00:0xa7:             -np 12
15:59:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:59:25:WU01:FS00:0xa7:       Date: Nov 27 2019
15:59:25:WU01:FS00:0xa7:       Time: 11:26:54
15:59:25:WU01:FS00:0xa7:   Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:59:25:WU01:FS00:0xa7:     Branch: master
15:59:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:59:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:59:25:WU01:FS00:0xa7:             -fno-pie -fPIC
15:59:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:59:25:WU01:FS00:0xa7:       Bits: 64
15:59:25:WU01:FS00:0xa7:       Mode: Release
15:59:25:WU01:FS00:0xa7:************************************ System ************************************
15:59:25:WU01:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
15:59:25:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:59:25:WU01:FS00:0xa7:       CPUs: 24
15:59:25:WU01:FS00:0xa7:     Memory: 31.30GiB
15:59:25:WU01:FS00:0xa7:Free Memory: 5.24GiB
15:59:25:WU01:FS00:0xa7:    Threads: POSIX_THREADS
15:59:25:WU01:FS00:0xa7: OS Version: 5.4
15:59:25:WU01:FS00:0xa7:Has Battery: false
15:59:25:WU01:FS00:0xa7: On Battery: false
15:59:25:WU01:FS00:0xa7: UTC Offset: -4
15:59:25:WU01:FS00:0xa7:        PID: 1627929
15:59:25:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
15:59:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:59:25:WU01:FS00:0xa7:    Version: 0.0.19
15:59:25:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:59:25:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
15:59:25:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
15:59:25:WU01:FS00:0xa7:       Date: Nov 26 2019
15:59:25:WU01:FS00:0xa7:       Time: 00:41:42
15:59:25:WU01:FS00:0xa7:   Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:59:25:WU01:FS00:0xa7:     Branch: master
15:59:25:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:59:25:WU01:FS00:0xa7:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:59:25:WU01:FS00:0xa7:             -fno-pie
15:59:25:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:59:25:WU01:FS00:0xa7:       Bits: 64
15:59:25:WU01:FS00:0xa7:       Mode: Release
15:59:25:WU01:FS00:0xa7:************************************ Build *************************************
15:59:25:WU01:FS00:0xa7:       SIMD: avx_256
15:59:25:WU01:FS00:0xa7:********************************************************************************
15:59:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:59:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:59:25:WU01:FS00:0xa7:Digital signatures verified
15:59:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 12
15:59:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:59:29:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:35:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:35:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:35:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:35:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:40:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:40:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:40:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:40:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:40:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:40:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:40:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:40:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:40:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:40:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:ERROR:
15:59:45:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:45:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:45:WU01:FS00:0xa7:ERROR:
15:59:45:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:45:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:45:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:45:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:45:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:45:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:WARNING:Unexpected exit
15:59:45:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:59:50:WU02:FS01:0xa7:Completed 60000 out of 250000 steps (24%)
16:00:25:WU01:FS00:Starting
16:00:25:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-152302.txt'
16:00:25:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1626803 -checkpoint 15 -np 12
16:00:25:WU01:FS00:Started FahCore on PID 1628867
16:00:25:WU01:FS00:Core PID:1628871
16:00:25:WU01:FS00:FahCore 0xa7 started
So now, GROMACS sometimes failed with the usual
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
but also sometimes with
An atom moved too far between two domain decomposition steps
which, fortunately, results in FahCore returned: EARLY_UNIT_END (123 = 0x7b)

It detected 10 "EARLY_UNIT_END" in the next 15 attempts:

Code: Select all

15:53:25:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:53:45:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:54:46:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:55:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:56:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:57:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:58:41:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:59:45:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:00:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:01:46:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
16:02:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:03:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:04:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:05:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:06:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:06:40:WARNING:WU01:FS00:Too many errors, failing
And so this odyssey finally came to an end.

The faulty unit was sent back 2+ hours ago:

Code: Select all

16:06:40:WU01:FS00:Sending unit results: id:01 state:SEND error:FAILED project:14217 run:1724 clone:3 gen:0 core:0xa7 unit:0x00000004cedfaa925eab742a5d3e4286
16:06:40:WU01:FS00:Connecting to 206.223.170.146:8080
16:06:40:WU01:FS00:Server responded WORK_ACK (400)
But the WU is not reported here yet:

https://apps.foldingathome.org/wu#proje ... ne=3&gen=0

Code: Select all

User        Team 	CPUID 	Credit 	Assigned 	Returned 	Credited 	Days 	Code
APC2020     244369  D390AB5E44FC89F3 	2.06 	2020-07-27 06:33:20 	2020-07-27 18:18:43 	2020-07-27 06:37:43 	0.003 	Faulty
Anonymous   0       DBAEAC5EC838FAB5 	4.01 	2020-07-27 06:37:50 	2020-07-27 18:18:46 	2020-07-27 06:44:00 	0.004 	Faulty 2
Only two older attempts showed up as of submitting this post, and they are pretty dated as well.

I hope this WU didn't stall other slots for days (as it did for me), so another reason for reporting it here.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Wed Aug 05, 2020 8:17 pm
by Joe_H
UofM.MartinK wrote:I hope this WU didn't stall other slots for days (as it did for me), so another reason for reporting it here.
The continuous cycling is an unfortunate bug in the linux version of the client or folding core. The wrong level of error is being detected, so instead of stopping processing of a WU after several tries and reporting as faulty, it just keeps cycling through. Not sure when it will get fixed. The check for this does work in Windows.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 12:42 am
by anandhanju
Thanks for reporting this UofM.MartinK. The faulty unit has been flagged to the researchers for review and action.

The stats for the server that hosts this project are delayed and a few days out of sync. See viewtopic.php?f=18&t=35864#p340220 The necessary folks are aware of this and looking into it

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 1:36 am
by UofM.MartinK
Joe_H wrote: The continuous cycling is an unfortunate bug in the linux version of the client or folding core. The wrong level of error is being detected, so instead of stopping processing of a WU after several tries and reporting as faulty, it just keeps cycling through.
That's what I suspected, thanks for the confirmation! I might include some rules in the automated log analysis to learn about this type of glitch earlier, not after almost a week :)

anandhanju wrote:Thanks for reporting this UofM.MartinK. The faulty unit has been flagged to the researchers for review and action.

The stats for the server that hosts this project are delayed and a few days out of sync. See viewtopic.php?f=18&t=35864#p340220 The necessary folks are aware of this and looking into it
Thanks, glad it's flagged & properly reported!

Is there any "automatic" or "semi-automatic" mechanism to notify the researchers about WU's sent back to the WS/CS with the "faulty" flag?
If so, is that method depending on the stats server analysis being completed?

Or are faulty WUs usually just dealt with when the bulk results are looked at for further processing for a particular project?

Also, which server is keeping track about how often a unit was returned as faulty, and then decides to not hand it out anymore?
Is it just the WS+CS, no stats processing involved?
I am asking because I got the WU in question assigned on 2020-07-30 14:43:24, and
https://apps.foldingathome.org/wu#proje ... ne=3&gen=0
shows "Faulty" assigned 2020-07-27 06:33:20, returned 2020-07-27 18:18:43, credited 2020-07-27 06:37:43
and "Faulty 2" assigned 2020-07-27 06:37:50, returned 2020-07-27 18:18:46, credited 2020-07-27 06:44:00

Which raises two questions:
1) Why was this WU assigned twice within 4 minutes, before being returned once? Or are the values for "credited" and "returned" swapped?
2) Why was this WU "re"-assigned about 3 days after at least two allegedly independent fault reports?

Just curious and trying to understand how F@h works and perhaps finding "new" bugs :)

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 2:06 am
by anandhanju
The respective work servers keep track of how many times a WU has failed and also reissued. Most projects are set to reissue a WU 5 times if it has been returned in faulty. In most cases, faulty submissions are due to instabilties in hardware or a misconfiguration on the user's system (e.g., missing or bad driver that spits out failures at 0%) and a reissue will return a successful return. However, sometimes the WU instructions itself can cause failures and in such cases, all 5 retries would have failed.

The researchers keep a eye out on the overall failure rate during the course of a project to pull it back if the failure rates are above nominal.

To answer your two questions:
1. You're right -- the values for "credited" and "returned" are swapped -- see https://github.com/FoldingAtHome/fah-issues/issues/1383
2. It will get reassigned 5 times before it is permanently failed. While reassignments do not necessarily have to happen soon after a previous failure (it depends on the assignment queue depth and assignment determination logic), this may have been reissued twice but was never returned before it was assigned to you. Or it may have just sat it in the queue for 3 days before it was reassigned.

To explain the first possibility further, p14127 has a timeout of 1.6 days (which means it gets reassigned if the work isn't returned in 1.6 days). The last recorded (failed) return was at 2020-07-27 06:44:00. Assuming it was reissued to someone else within the same hour of being returned, the timeout for that assignment expired around 2020-07-28 21:00:00. Assuming it was reissued immediately, the timeout for this assignment expired around 2020-07-30 13:00:00. At this point, assuming it was reissued to you, it aligns with the time you got the assignment at 2020-07-30 14:43:24. Of course, this is just speculation as the WU log we saw logs credits, not assignments that have no returns. There's isn't a publicly available place for us to verify this.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 8:10 am
by Neil-B
Given the server is not currently reporting WU returns (see a number of reports elsewhere) it may be it isn't seeing either faulty or completed returns anyway and so defaulting to the timeout reissue ... In the past delays between WUs being received by CS and arriving at WS have caused this delay.

If as anandjanju and I speculate this is what is occurring then when the system catches up you may find that the WU Status will show you return in fifth place (when sorted by issued timestamp).

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 1:38 pm
by UofM.MartinK
BTW, I _think_ my WU was sent back to the original WS, not an associated CS:

Code: Select all

14:42:48:WU01:FS00:Connecting to 206.223.170.146:8080
14:42:48:WU01:FS00:Downloading 24.07MiB
14:42:54:WU01:FS00:Download 85.41%
14:42:56:WU01:FS00:Download complete
14:42:56:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14217 run:1724 clone:3 gen:0 core:0xa7 unit:0x00000004cedfaa925eab742a5d3e4286
...many days later, different logfile....
16:06:40:WARNING:WU01:FS00:Too many errors, failing
16:06:40:WU01:FS00:Sending unit results: id:01 state:SEND error:FAILED project:14217 run:1724 clone:3 gen:0 core:0xa7 unit:0x00000004cedfaa925eab742a5d3e4286
16:06:40:WU01:FS00:Connecting to 206.223.170.146:8080
16:06:40:WU01:FS00:Server responded WORK_ACK (400)
So most likely the WS kept track of the 'faulties' - even more curious now to see what the WU history looks like once stats caught up :)

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 1:49 pm
by Neil-B
It could be possible that the "receiving WUs" part of the WS is not talking properly to the "processing and generating new WUs" part of the server ... Hopefully someone will let the researcher know to look at this ... The Completed WUs and/or Reports of Failure should at some point catch up though ... and yes, I am curious as well to see what the history will end up looking like :eugeek:

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 3:51 pm
by bruce
The endless-loop problem needs to be fixed. Some versionf of FAHClient/FAHCore do catch this loop and some don't, so somebody/somwhere knows how to fix it.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Thu Aug 06, 2020 6:56 pm
by UofM.MartinK
Hm... above, Joe_H said Linux is affected, Windows not. Error/return code handling is a little different on these platforms, and it could even be the GROMACS package/library which behaves differently.

Using strace and some other tools on Linux to further debug the issue is certainly within my capabilities, but I should first know what others learned about the problem already.

On an initial glance, I didn't find the corresponding issue on https://github.com/FoldingAtHome/fah-issues/issues

And then getting access to that "bad WU" again would be very helpful, unfortunately I didn't make a backup :) Because hitting another one behaving like that might take a while, after all, this was about 1:7000 for me...

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Fri Aug 07, 2020 3:54 pm
by bruce
Creating a list of applicable projects is one place to start. I doubt that this is the only project that has this problem.

Opening a ticket on Github that can collect our current understanding is another place to start. Is this best described as "Error ABCDE on Linux causes an infinite loop on projects XXXXX"? (We can refine the title later as we learn more about the problem.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Fri Aug 07, 2020 4:52 pm
by Joe_H
I thought that someone had already posted this problem in the past, the bug has been around for a while and dates back at least to 7.5.1. The report might have been against the folding core, issues migrated to that area are not visible. When I get a chance, I will take a closer look at the visible issues and see if it is there.

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Fri Aug 07, 2020 8:25 pm
by UofM.MartinK

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Tue Aug 18, 2020 3:55 am
by UofM.MartinK
The stats for the workserver handling this WU finally got processed... :)

https://apps.foldingathome.org/wu#proje ... ne=3&gen=0

It turns out, hanging in the endless loop for six days was worth 1.5k points after all :)

BTW, over 2 Million WUs just got added to the database, I think this is the most impressive spike I've ever seen on the Credit Log:
https://apps.foldingathome.org/credit-log

Re: project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 c

Posted: Tue Aug 18, 2020 8:34 am
by bruce
All those missing credit reports wanting to be set free ... and the dam burst.