ERROR: BAD_WORK_UNIT
Posted: Thu Apr 23, 2020 11:30 am
Hi all,
I've been F@H for about three years now and this is the first time I've come across this problem. I've just upgraded my client to 7.6.9, but I think the timing of my upgrade and the error is coincidental.
Quick system spec:
Now the actual error:
I've tried changing 'cause' but the project is stuck on Covid-19 16417 (which is CPU only; I'd like to change that if possible i.e., how does one reach out to project GPU compatible?)
The Computational Chemist in me thinks that the of the project topology and coordinate data was designed for fewer cores (hence a domain decomposition issue). However, I would be very surprised if this work unit got through to a client if such was the case. Hence, I would like to hand over to your better judgement in terms of the F@H software.
Quick note: I've left this running for 3 days hoping it would right itself. It continues to pull down the same project WU and fail. Also, another note. This is my own work-horse of a machine running my own comp chem calculations, suggestions on how to restart the F@H Core without restarting that machine if that's all it takes. Thanks.
I've been F@H for about three years now and this is the first time I've come across this problem. I've just upgraded my client to 7.6.9, but I think the timing of my upgrade and the error is coincidental.
Quick system spec:
Code: Select all
11:16:12:WU02:FS00:0xa7:*********************** Log Started 2020-04-23T11:16:12Z ***********************
11:16:12:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
11:16:12:WU02:FS00:0xa7: Type: 0xa7
11:16:12:WU02:FS00:0xa7: Core: Gromacs
11:16:12:WU02:FS00:0xa7: Args: -dir 02 -suffix 01 -version 704 -lifeline 10318 -checkpoint 15 -np
11:16:12:WU02:FS00:0xa7: 15
11:16:12:WU02:FS00:0xa7:************************************ CBang *************************************
11:16:12:WU02:FS00:0xa7: Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7: Time: 06:06:57
11:16:12:WU02:FS00:0xa7: Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
11:16:12:WU02:FS00:0xa7: Branch: master
11:16:12:WU02:FS00:0xa7: Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
11:16:12:WU02:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7: Bits: 64
11:16:12:WU02:FS00:0xa7: Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ System ************************************
11:16:12:WU02:FS00:0xa7: CPU: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
11:16:12:WU02:FS00:0xa7: CPU ID: GenuineIntel Family 6 Model 45 Stepping 7
11:16:12:WU02:FS00:0xa7: CPUs: 32
11:16:12:WU02:FS00:0xa7: Memory: 125.88GiB
11:16:12:WU02:FS00:0xa7:Free Memory: 43.49GiB
11:16:12:WU02:FS00:0xa7: Threads: POSIX_THREADS
11:16:12:WU02:FS00:0xa7: OS Version: 4.10
11:16:12:WU02:FS00:0xa7:Has Battery: false
11:16:12:WU02:FS00:0xa7: On Battery: false
11:16:12:WU02:FS00:0xa7: UTC Offset: 1
11:16:12:WU02:FS00:0xa7: PID: 10322
11:16:12:WU02:FS00:0xa7: CWD: /var/lib/fahclient/work
11:16:12:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
11:16:12:WU02:FS00:0xa7: Version: 0.0.18
11:16:12:WU02:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
11:16:12:WU02:FS00:0xa7: Copyright: 2019 foldingathome.org
11:16:12:WU02:FS00:0xa7: Homepage: https://foldingathome.org/
11:16:12:WU02:FS00:0xa7: Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7: Time: 06:13:26
11:16:12:WU02:FS00:0xa7: Revision: 490c9aa2957b725af319379424d5c5cb36efb656
11:16:12:WU02:FS00:0xa7: Branch: master
11:16:12:WU02:FS00:0xa7: Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7: Options: -std=c++11 -O3 -funroll-loops -fno-pie
11:16:12:WU02:FS00:0xa7: Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7: Bits: 64
11:16:12:WU02:FS00:0xa7: Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ Build *************************************
11:16:12:WU02:FS00:0xa7: SIMD: avx_256
11:16:12:WU02:FS00:0xa7:********************************************************************************
Code: Select all
11:16:12:WU02:FS00:0xa7:Project: 16417 (Run 1322, Clone 1, Gen 62)
11:16:12:WU02:FS00:0xa7:Unit: 0x0000004996880e6e5e8a61572b189804
11:16:12:WU02:FS00:0xa7:Reading tar file core.xml
11:16:12:WU02:FS00:0xa7:Reading tar file frame62.tpr
11:16:12:WU02:FS00:0xa7:Digital signatures verified
11:16:12:WU02:FS00:0xa7:Calling: mdrun -s frame62.tpr -o frame62.trr -x frame62.xtc -cpt 15 -nt 15
11:16:12:WU02:FS00:0xa7:Steps: first=15500000 total=250000
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:12:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
11:16:12:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:Fatal error:
11:16:12:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
11:16:12:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
11:16:12:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
11:16:12:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
11:16:12:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit() call
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
11:16:17:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
11:16:17:WU02:FS00:0xa7:Saving result file md.log
11:16:17:WU02:FS00:0xa7:Saving result file science.log
11:16:17:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
11:16:17:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
The Computational Chemist in me thinks that the
Code: Select all
gmx grompp
Quick note: I've left this running for 3 days hoping it would right itself. It continues to pull down the same project WU and fail. Also, another note. This is my own work-horse of a machine running my own comp chem calculations, suggestions on how to restart the F@H Core without restarting that machine if that's all it takes. Thanks.