project:14217 run:1724 clone:3 gen:0 failed on 16 & 12 cores
Posted: Wed Aug 05, 2020 6:31 pm
I found a 16-CPU-core slot on a 3900X Linux machine stuck in the following endless-loop for days:
The only variation in the hundreds of errors being X and Y in the
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
resulting always in FahCore returned: INTERRUPTED (102 = 0x66)
Reduced core count to 12, and the resulting errors varied a bit:
So now, GROMACS sometimes failed with the usual
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
but also sometimes with
An atom moved too far between two domain decomposition steps
which, fortunately, results in FahCore returned: EARLY_UNIT_END (123 = 0x7b)
It detected 10 "EARLY_UNIT_END" in the next 15 attempts:
And so this odyssey finally came to an end.
The faulty unit was sent back 2+ hours ago:
But the WU is not reported here yet:
https://apps.foldingathome.org/wu#proje ... ne=3&gen=0
Only two older attempts showed up as of submitting this post, and they are pretty dated as well.
I hope this WU didn't stall other slots for days (as it did for me), so another reason for reporting it here.
Code: Select all
15:49:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:50:24:WU01:FS00:Starting
15:50:24:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-151402.txt'
15:50:24:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1623069 -checkpoint 15 -np 16
15:50:24:WU01:FS00:Started FahCore on PID 1626543
15:50:24:WU01:FS00:Core PID:1626547
15:50:24:WU01:FS00:FahCore 0xa7 started
15:50:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:50:24Z ***********************
15:50:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:50:25:WU01:FS00:0xa7: Type: 0xa7
15:50:25:WU01:FS00:0xa7: Core: Gromacs
15:50:25:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 706 -lifeline 1626543 -checkpoint 15
15:50:25:WU01:FS00:0xa7: -np 16
15:50:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:50:25:WU01:FS00:0xa7: Date: Nov 27 2019
15:50:25:WU01:FS00:0xa7: Time: 11:26:54
15:50:25:WU01:FS00:0xa7: Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:50:25:WU01:FS00:0xa7: Branch: master
15:50:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:50:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:50:25:WU01:FS00:0xa7: -fno-pie -fPIC
15:50:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:50:25:WU01:FS00:0xa7: Bits: 64
15:50:25:WU01:FS00:0xa7: Mode: Release
15:50:25:WU01:FS00:0xa7:************************************ System ************************************
15:50:25:WU01:FS00:0xa7: CPU: AMD Ryzen 9 3900X 12-Core Processor
15:50:25:WU01:FS00:0xa7: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:50:25:WU01:FS00:0xa7: CPUs: 24
15:50:25:WU01:FS00:0xa7: Memory: 31.30GiB
15:50:25:WU01:FS00:0xa7:Free Memory: 5.31GiB
15:50:25:WU01:FS00:0xa7: Threads: POSIX_THREADS
15:50:25:WU01:FS00:0xa7: OS Version: 5.4
15:50:25:WU01:FS00:0xa7:Has Battery: false
15:50:25:WU01:FS00:0xa7: On Battery: false
15:50:25:WU01:FS00:0xa7: UTC Offset: -4
15:50:25:WU01:FS00:0xa7: PID: 1626547
15:50:25:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work
15:50:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:50:25:WU01:FS00:0xa7: Version: 0.0.19
15:50:25:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:50:25:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
15:50:25:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
15:50:25:WU01:FS00:0xa7: Date: Nov 26 2019
15:50:25:WU01:FS00:0xa7: Time: 00:41:42
15:50:25:WU01:FS00:0xa7: Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:50:25:WU01:FS00:0xa7: Branch: master
15:50:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:50:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:50:25:WU01:FS00:0xa7: -fno-pie
15:50:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:50:25:WU01:FS00:0xa7: Bits: 64
15:50:25:WU01:FS00:0xa7: Mode: Release
15:50:25:WU01:FS00:0xa7:************************************ Build *************************************
15:50:25:WU01:FS00:0xa7: SIMD: avx_256
15:50:25:WU01:FS00:0xa7:********************************************************************************
15:50:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:50:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:50:25:WU01:FS00:0xa7:Digital signatures verified
15:50:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 16
15:50:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:50:27:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:50:32:WU01:FS00:0xa7:ERROR:
15:50:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:32:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:32:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:32:WU01:FS00:0xa7:ERROR:
15:50:32:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:32:WU01:FS00:0xa7:ERROR:22 particles communicated to PME rank 11 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension y.
15:50:32:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:32:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:32:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:32:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:37:WU01:FS00:0xa7:ERROR:59 particles communicated to PME rank 10 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:50:37:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:37:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:50:37:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:50:37:WU01:FS00:0xa7:ERROR:
15:50:37:WU01:FS00:0xa7:ERROR:Fatal error:
15:50:37:WU01:FS00:0xa7:ERROR:16 particles communicated to PME rank 11 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:50:37:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:50:37:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:50:37:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:50:37:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:50:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
resulting always in FahCore returned: INTERRUPTED (102 = 0x66)
Reduced core count to 12, and the resulting errors varied a bit:
Code: Select all
15:58:25:WU01:FS00:FahCore 0xa7 started
15:58:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:58:25Z ***********************
15:58:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:58:25:WU01:FS00:0xa7: Type: 0xa7
15:58:25:WU01:FS00:0xa7: Core: Gromacs
15:58:25:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 706 -lifeline 1627903 -checkpoint 15
15:58:25:WU01:FS00:0xa7: -np 12
15:58:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:58:25:WU01:FS00:0xa7: Date: Nov 27 2019
15:58:25:WU01:FS00:0xa7: Time: 11:26:54
15:58:25:WU01:FS00:0xa7: Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:58:25:WU01:FS00:0xa7: Branch: master
15:58:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:58:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:58:25:WU01:FS00:0xa7: -fno-pie -fPIC
15:58:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:58:25:WU01:FS00:0xa7: Bits: 64
15:58:25:WU01:FS00:0xa7: Mode: Release
15:58:25:WU01:FS00:0xa7:************************************ System ************************************
15:58:25:WU01:FS00:0xa7: CPU: AMD Ryzen 9 3900X 12-Core Processor
15:58:25:WU01:FS00:0xa7: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:58:25:WU01:FS00:0xa7: CPUs: 24
15:58:25:WU01:FS00:0xa7: Memory: 31.30GiB
15:58:25:WU01:FS00:0xa7:Free Memory: 5.25GiB
15:58:25:WU01:FS00:0xa7: Threads: POSIX_THREADS
15:58:25:WU01:FS00:0xa7: OS Version: 5.4
15:58:25:WU01:FS00:0xa7:Has Battery: false
15:58:25:WU01:FS00:0xa7: On Battery: false
15:58:25:WU01:FS00:0xa7: UTC Offset: -4
15:58:25:WU01:FS00:0xa7: PID: 1627907
15:58:25:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work
15:58:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:58:25:WU01:FS00:0xa7: Version: 0.0.19
15:58:25:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:58:25:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
15:58:25:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
15:58:25:WU01:FS00:0xa7: Date: Nov 26 2019
15:58:25:WU01:FS00:0xa7: Time: 00:41:42
15:58:25:WU01:FS00:0xa7: Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:58:25:WU01:FS00:0xa7: Branch: master
15:58:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:58:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:58:25:WU01:FS00:0xa7: -fno-pie
15:58:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:58:25:WU01:FS00:0xa7: Bits: 64
15:58:25:WU01:FS00:0xa7: Mode: Release
15:58:25:WU01:FS00:0xa7:************************************ Build *************************************
15:58:25:WU01:FS00:0xa7: SIMD: avx_256
15:58:25:WU01:FS00:0xa7:********************************************************************************
15:58:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:58:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:58:25:WU01:FS00:0xa7:Digital signatures verified
15:58:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 12
15:58:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:58:29:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:58:36:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:Fatal error:
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:58:36:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
15:58:36:WU01:FS00:0xa7:ERROR:
15:58:36:WU01:FS00:0xa7:ERROR:Fatal error:
15:58:36:WU01:FS00:0xa7:ERROR:4 particles communicated to PME rank 3 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:58:36:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:58:36:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:58:36:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:36:WU01:FS00:0xa7:ERROR:2 particles communicated to PME rank 2 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
15:58:36:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
15:58:36:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:58:36:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:58:36:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:41:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:58:55:WU02:FS01:0xa7:Completed 57500 out of 250000 steps (23%)
15:59:25:WU01:FS00:Starting
15:59:25:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-152202.txt'
15:59:25:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1626803 -checkpoint 15 -np 12
15:59:25:WU01:FS00:Started FahCore on PID 1627925
15:59:25:WU01:FS00:Core PID:1627929
15:59:25:WU01:FS00:FahCore 0xa7 started
15:59:25:WU01:FS00:0xa7:*********************** Log Started 2020-08-05T15:59:25Z ***********************
15:59:25:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:59:25:WU01:FS00:0xa7: Type: 0xa7
15:59:25:WU01:FS00:0xa7: Core: Gromacs
15:59:25:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 706 -lifeline 1627925 -checkpoint 15
15:59:25:WU01:FS00:0xa7: -np 12
15:59:25:WU01:FS00:0xa7:************************************ CBang *************************************
15:59:25:WU01:FS00:0xa7: Date: Nov 27 2019
15:59:25:WU01:FS00:0xa7: Time: 11:26:54
15:59:25:WU01:FS00:0xa7: Revision: d25803215b59272441049dfa05a0a9bf7a6e3c48
15:59:25:WU01:FS00:0xa7: Branch: master
15:59:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:59:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:59:25:WU01:FS00:0xa7: -fno-pie -fPIC
15:59:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:59:25:WU01:FS00:0xa7: Bits: 64
15:59:25:WU01:FS00:0xa7: Mode: Release
15:59:25:WU01:FS00:0xa7:************************************ System ************************************
15:59:25:WU01:FS00:0xa7: CPU: AMD Ryzen 9 3900X 12-Core Processor
15:59:25:WU01:FS00:0xa7: CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:59:25:WU01:FS00:0xa7: CPUs: 24
15:59:25:WU01:FS00:0xa7: Memory: 31.30GiB
15:59:25:WU01:FS00:0xa7:Free Memory: 5.24GiB
15:59:25:WU01:FS00:0xa7: Threads: POSIX_THREADS
15:59:25:WU01:FS00:0xa7: OS Version: 5.4
15:59:25:WU01:FS00:0xa7:Has Battery: false
15:59:25:WU01:FS00:0xa7: On Battery: false
15:59:25:WU01:FS00:0xa7: UTC Offset: -4
15:59:25:WU01:FS00:0xa7: PID: 1627929
15:59:25:WU01:FS00:0xa7: CWD: /var/lib/fahclient/work
15:59:25:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:59:25:WU01:FS00:0xa7: Version: 0.0.19
15:59:25:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:59:25:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
15:59:25:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
15:59:25:WU01:FS00:0xa7: Date: Nov 26 2019
15:59:25:WU01:FS00:0xa7: Time: 00:41:42
15:59:25:WU01:FS00:0xa7: Revision: d5b5c747532224f986b7cd02c968ed9a20c16d6e
15:59:25:WU01:FS00:0xa7: Branch: master
15:59:25:WU01:FS00:0xa7: Compiler: GNU 8.3.0
15:59:25:WU01:FS00:0xa7: Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
15:59:25:WU01:FS00:0xa7: -fno-pie
15:59:25:WU01:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
15:59:25:WU01:FS00:0xa7: Bits: 64
15:59:25:WU01:FS00:0xa7: Mode: Release
15:59:25:WU01:FS00:0xa7:************************************ Build *************************************
15:59:25:WU01:FS00:0xa7: SIMD: avx_256
15:59:25:WU01:FS00:0xa7:********************************************************************************
15:59:25:WU01:FS00:0xa7:Project: 14217 (Run 1724, Clone 3, Gen 0)
15:59:25:WU01:FS00:0xa7:Unit: 0x00000004cedfaa925eab742a5d3e4286
15:59:25:WU01:FS00:0xa7:Digital signatures verified
15:59:25:WU01:FS00:0xa7:Calling: mdrun -s frame0.tpr -o frame0.trr -x frame0.xtc -cpt 15 -nt 12
15:59:25:WU01:FS00:0xa7:Steps: first=0 total=62500
15:59:29:WU01:FS00:0xa7:Completed 1 out of 62500 steps (0%)
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:35:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:35:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:35:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:35:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:35:WU01:FS00:0xa7:ERROR:
15:59:35:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:35:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:35:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:35:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:35:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:35:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:40:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:40:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:40:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:40:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:40:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:40:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:40:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:40:WU01:FS00:0xa7:ERROR:
15:59:40:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:40:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:40:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:40:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:40:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:40:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:ERROR:
15:59:45:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:59:45:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 4390
15:59:45:WU01:FS00:0xa7:ERROR:
15:59:45:WU01:FS00:0xa7:ERROR:Fatal error:
15:59:45:WU01:FS00:0xa7:ERROR:An atom moved too far between two domain decomposition steps
15:59:45:WU01:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated
15:59:45:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:59:45:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:59:45:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:59:45:WU01:FS00:0xa7:WARNING:Unexpected exit
15:59:45:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:59:50:WU02:FS01:0xa7:Completed 60000 out of 250000 steps (24%)
16:00:25:WU01:FS00:Starting
16:00:25:WU01:FS00:Removing old file 'work/01/logfile_01-20200805-152302.txt'
16:00:25:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx-256/a7-0.0.19/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 706 -lifeline 1626803 -checkpoint 15 -np 12
16:00:25:WU01:FS00:Started FahCore on PID 1628867
16:00:25:WU01:FS00:Core PID:1628871
16:00:25:WU01:FS00:FahCore 0xa7 started
FS00:0xa7:ERROR:X particles communicated to PME rank Y are more than 2/3 times ... ,
but also sometimes with
An atom moved too far between two domain decomposition steps
which, fortunately, results in FahCore returned: EARLY_UNIT_END (123 = 0x7b)
It detected 10 "EARLY_UNIT_END" in the next 15 attempts:
Code: Select all
15:53:25:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:53:45:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:54:46:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:55:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:56:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:57:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
15:58:41:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
15:59:45:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:00:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:01:46:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
16:02:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:03:51:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:04:41:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:05:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:06:40:WARNING:WU01:FS00:FahCore returned: EARLY_UNIT_END (123 = 0x7b)
16:06:40:WARNING:WU01:FS00:Too many errors, failing
The faulty unit was sent back 2+ hours ago:
Code: Select all
16:06:40:WU01:FS00:Sending unit results: id:01 state:SEND error:FAILED project:14217 run:1724 clone:3 gen:0 core:0xa7 unit:0x00000004cedfaa925eab742a5d3e4286
16:06:40:WU01:FS00:Connecting to 206.223.170.146:8080
16:06:40:WU01:FS00:Server responded WORK_ACK (400)
https://apps.foldingathome.org/wu#proje ... ne=3&gen=0
Code: Select all
User Team CPUID Credit Assigned Returned Credited Days Code
APC2020 244369 D390AB5E44FC89F3 2.06 2020-07-27 06:33:20 2020-07-27 18:18:43 2020-07-27 06:37:43 0.003 Faulty
Anonymous 0 DBAEAC5EC838FAB5 4.01 2020-07-27 06:37:50 2020-07-27 18:18:46 2020-07-27 06:44:00 0.004 Faulty 2
I hope this WU didn't stall other slots for days (as it did for me), so another reason for reporting it here.