Page 1 of 1

ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 11:30 am
by acnash
Hi all,

I've been F@H for about three years now and this is the first time I've come across this problem. I've just upgraded my client to 7.6.9, but I think the timing of my upgrade and the error is coincidental.

Quick system spec:

Code: Select all

11:16:12:WU02:FS00:0xa7:*********************** Log Started 2020-04-23T11:16:12Z ***********************
11:16:12:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
11:16:12:WU02:FS00:0xa7:       Type: 0xa7
11:16:12:WU02:FS00:0xa7:       Core: Gromacs
11:16:12:WU02:FS00:0xa7:       Args: -dir 02 -suffix 01 -version 704 -lifeline 10318 -checkpoint 15 -np
11:16:12:WU02:FS00:0xa7:             15
11:16:12:WU02:FS00:0xa7:************************************ CBang *************************************
11:16:12:WU02:FS00:0xa7:       Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7:       Time: 06:06:57
11:16:12:WU02:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
11:16:12:WU02:FS00:0xa7:     Branch: master
11:16:12:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
11:16:12:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7:       Bits: 64
11:16:12:WU02:FS00:0xa7:       Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ System ************************************
11:16:12:WU02:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
11:16:12:WU02:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 45 Stepping 7
11:16:12:WU02:FS00:0xa7:       CPUs: 32
11:16:12:WU02:FS00:0xa7:     Memory: 125.88GiB
11:16:12:WU02:FS00:0xa7:Free Memory: 43.49GiB
11:16:12:WU02:FS00:0xa7:    Threads: POSIX_THREADS
11:16:12:WU02:FS00:0xa7: OS Version: 4.10
11:16:12:WU02:FS00:0xa7:Has Battery: false
11:16:12:WU02:FS00:0xa7: On Battery: false
11:16:12:WU02:FS00:0xa7: UTC Offset: 1
11:16:12:WU02:FS00:0xa7:        PID: 10322
11:16:12:WU02:FS00:0xa7:        CWD: /var/lib/fahclient/work
11:16:12:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
11:16:12:WU02:FS00:0xa7:    Version: 0.0.18
11:16:12:WU02:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
11:16:12:WU02:FS00:0xa7:  Copyright: 2019 foldingathome.org
11:16:12:WU02:FS00:0xa7:   Homepage: https://foldingathome.org/
11:16:12:WU02:FS00:0xa7:       Date: Nov 5 2019
11:16:12:WU02:FS00:0xa7:       Time: 06:13:26
11:16:12:WU02:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
11:16:12:WU02:FS00:0xa7:     Branch: master
11:16:12:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
11:16:12:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
11:16:12:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
11:16:12:WU02:FS00:0xa7:       Bits: 64
11:16:12:WU02:FS00:0xa7:       Mode: Release
11:16:12:WU02:FS00:0xa7:************************************ Build *************************************
11:16:12:WU02:FS00:0xa7:       SIMD: avx_256
11:16:12:WU02:FS00:0xa7:********************************************************************************
Now the actual error:

Code: Select all

11:16:12:WU02:FS00:0xa7:Project: 16417 (Run 1322, Clone 1, Gen 62)
11:16:12:WU02:FS00:0xa7:Unit: 0x0000004996880e6e5e8a61572b189804
11:16:12:WU02:FS00:0xa7:Reading tar file core.xml
11:16:12:WU02:FS00:0xa7:Reading tar file frame62.tpr
11:16:12:WU02:FS00:0xa7:Digital signatures verified
11:16:12:WU02:FS00:0xa7:Calling: mdrun -s frame62.tpr -o frame62.trr -x frame62.xtc -cpt 15 -nt 15
11:16:12:WU02:FS00:0xa7:Steps: first=15500000 total=250000
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:12:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
11:16:12:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
11:16:12:WU02:FS00:0xa7:ERROR:
11:16:12:WU02:FS00:0xa7:ERROR:Fatal error:
11:16:12:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 15 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
11:16:12:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
11:16:12:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
11:16:12:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
11:16:12:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
11:16:12:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit() call
11:16:17:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
11:16:17:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
11:16:17:WU02:FS00:0xa7:Saving result file md.log
11:16:17:WU02:FS00:0xa7:Saving result file science.log
11:16:17:WU02:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
11:16:17:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
I've tried changing 'cause' but the project is stuck on Covid-19 16417 (which is CPU only; I'd like to change that if possible i.e., how does one reach out to project GPU compatible?)

The Computational Chemist in me thinks that the

Code: Select all

gmx grompp
of the project topology and coordinate data was designed for fewer cores (hence a domain decomposition issue). However, I would be very surprised if this work unit got through to a client if such was the case. Hence, I would like to hand over to your better judgement in terms of the F@H software.

Quick note: I've left this running for 3 days hoping it would right itself. It continues to pull down the same project WU and fail. Also, another note. This is my own work-horse of a machine running my own comp chem calculations, suggestions on how to restart the F@H Core without restarting that machine if that's all it takes. Thanks.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 11:45 am
by Neil-B
Looks like you may have a 15core cpu slot ... try backing that down to 12 and it might clear

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 12:15 pm
by acnash
I hope it's that simple!

Sorry if I sound dense, but how do I drop to 12? The Client Control provides "Light = 15, Medium=30, Full=31 & with GPU running". It is currently set on "Light".

Nudging it up to "Full" I get:

Code: Select all

12:14:14:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
Which again is an issue with Gromacs trying to split the unit cell over that number of ranks.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 12:49 pm
by Neil-B
With Linux I am not sure - but If can find/open advanced control, click on configure, select the slots tab, click on the CPU folding slot then click edit - then change the number of CPU threads to 12.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 12:57 pm
by acnash
Thank you! I hadn't spotted that option before.

I've set it to 4, so the command now comes up as "-nt 4".

It looks to be holding and running without interruption. I'm now just wondering when can I change it back to "-1" for the client to decide. I've never had to manually set the cpu count.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 1:08 pm
by Neil-B
right click on the slot in status and mark as finish ... then once it has finished reset the cpu slot to whatever.

On the topic of number of cores ... -1 is the default but can cause some issue/challenges ... how many cores do you actually have and what/how many slots do you fold on (CPU & GPU)? ... Do you use the power slider (which doesn't actually work the way you may think it does) or are you happy to just set a stable set of folding slots? ... If you let us know your priorities/usage style setting the slot to a specific number of cores might avoid a few challenges in the future - and choosing what core count to use can make quiet a difference.

Believe the way the slots currently work is High uses the total number of cores you system has less one for each GPU slot/card ... Medium actually just reduces the core count by one from High ... Light halves the core count and pauses the GPU slot(s) - It also makes the GPU slots look as if they are paused waiting for idle which can be confusing.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 1:15 pm
by acnash
Thanks for the information. I kind of figured out the slidder behaviour by checking the mdrun commands in the system log.

There are 32 cores. If the is any way of running 15/16 of them whilst using the GPU that I would be awesome. The other half I need for my own work.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 1:18 pm
by Neil-B
Where you set 4 once this WU has finished set it to 16 (better count than 15 - if you are interested search he forums for "large primes") ... that will limit/run that CPU slot at 16cores ... however the FAH cores run at low priority so you shouldn't have issues with something higher as your other work should easily take priority (say 24 - if it does impact then you can down step the WU to 16 by the method you have just used - but it shouldn't) - GPU is a bit tougher as the way OSs work GPU is pretty much non prioritised - if you find you are getting lags then sometimes turning off hardware acceleration can help ... but most importantly, make sure you choose something you are happy with :)

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 1:41 pm
by acnash
Thanks for the information. The WU finished and now at 16 cores everything seems fine.

During a break from my own calculations, I changed it to 30 cores and the TEP per day obviously increased, however, my GPU is still ideal. The project (14366) is for only CPUs. Is there any way of forcing the F@H client to pick only CPU+GPU based projects?

Thanks

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 1:49 pm
by Neil-B
30cores is relatively stable (for the most part) but is a multiple of 5 so may have an occasional issue.

CPU and GPU are totally separate and independent - you will get CPU WUs (Gromacs based) for the CPU Slot and GPU WUs (OpenMM based) for the GPU and the client will quite happily run both at once (if the WUs are available) ... There has been a bit of a scarcity of GPU WUs recently but I believe this is improving and chances of getting them are increasing ... I believe most people are now back to 24/7 CPU folding ... GPU is still showing periods of waiting.

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 3:06 pm
by acnash
Thanks for that information, that all makes sense. Also, interesting to know F@H is also folding using openMM, I always thought it was Gromacs only.

Thanks again.
Anthony

Re: ERROR: BAD_WORK_UNIT

Posted: Thu Apr 23, 2020 3:54 pm
by Neil-B
Someone for the "FAH Forum History Department" aka an "Old Timer" will no doubt be able to give chapter and verse on when FAH embraced OpenMM ... Think I spotted somewhere that Gromacs itself is moving to embracing OpenMM in some way - but that may be a fantasy world in my grey matter.

Re: ERROR: BAD_WORK_UNIT

Posted: Fri Apr 24, 2020 1:58 am
by PantherX
acnash wrote:...Is there any way of forcing the F@H client to pick only CPU+GPU based projects?...
There's no single WU that can utilize both the CPU and GPU simultaneously yet (it's a cool idea but not sure if it's feasible or not). Instead, you can get with WUs for CPU only or GPU only. You can fold on both CPU and GPU but they would be processing different WUs, not the same.

Re: ERROR: BAD_WORK_UNIT

Posted: Fri Apr 24, 2020 2:19 am
by NoMoreQuarantine
PantherX wrote:There's no single WU that can utilize both the CPU and GPU simultaneously yet (it's a cool idea but not sure if it's feasible or not). Instead, you can get with WUs for CPU only or GPU only. You can fold on both CPU and GPU but they would be processing different WUs, not the same.
I've been wondering for a while now why GPU acceleration isn't used for the Gromacs core: http://www.gromacs.org/Documentation/Ac ... lelization

Re: ERROR: BAD_WORK_UNIT

Posted: Fri Apr 24, 2020 2:46 am
by bruce
"Old Timer" here.

History: FAH was started about 20 years ago at Stanford University. Early simulations were done with a variety of Open Software analysis packeges but gradually GROMACS dominated the field. It was updated from x86 code to SSE and 3dNow! and before long SSE2 was adopted as a requirement. Meanwile, the GPU had been mostly idle and a team was put together at Staford it support the new hardware with OpenMM. A lot has changed since then but FAH has adopted GROMACS exclusifly for CPUs and OpenMM exclusively for GPUs. (The stand alone versions for individual scientists are not exclusive.)

The OpenMM project is still at Stanford, while FAH has move out to a Consortium of many Universities.
https://foldingathome.org/about/the-fol ... onsortium/
https://foldingathome.org/project-timeline/