Possible bad WU?

Moderators: Site Moderators, FAHC Science Team

Post Reply
Kebast
Posts: 386
Joined: Thu Aug 06, 2015 5:21 pm

Possible bad WU?

Post by Kebast »

I got this error, and am not sure what it means:

Code: Select all

20:45:31:WU00:FS00:Downloading 2.16MiB
20:45:31:WU00:FS00:Download complete
20:45:31:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13833 run:0 clone:3814 gen:10 core:0xa7 unit:0x0000000b80fccb095e6e55b22adad799
20:45:31:WU00:FS00:Downloading core from http://cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah
20:45:31:WU00:FS00:Connecting to cores.foldingathome.org:80
20:45:32:WU00:FS00:FahCore a7: Downloading 6.71MiB
20:45:32:WU00:FS00:FahCore a7: Download complete
20:45:32:WU00:FS00:Valid core signature
20:45:32:WU00:FS00:Unpacked 19.85MiB to cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe
20:45:32:WU00:FS00:Unpacked 2.64MiB to cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/libfftw3f-3.dll
20:45:32:WU00:FS00:Starting
20:45:32:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\xxxxxx\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 705 -lifeline 8808 -checkpoint 15 -np 10
20:45:32:WU00:FS00:Started FahCore on PID 15272
20:45:32:WU00:FS00:Core PID:16916
20:45:32:WU00:FS00:FahCore 0xa7 started
20:45:33:WU00:FS00:0xa7:*********************** Log Started 2020-03-17T20:45:32Z ***********************
20:45:33:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
20:45:33:WU00:FS00:0xa7:       Type: 0xa7
20:45:33:WU00:FS00:0xa7:       Core: Gromacs
20:45:33:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 15272 -checkpoint 15 -np
20:45:33:WU00:FS00:0xa7:             10
20:45:33:WU00:FS00:0xa7:************************************ CBang *************************************
20:45:33:WU00:FS00:0xa7:       Date: Oct 26 2019
20:45:33:WU00:FS00:0xa7:       Time: 01:38:25
20:45:33:WU00:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
20:45:33:WU00:FS00:0xa7:     Branch: master
20:45:33:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
20:45:33:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
20:45:33:WU00:FS00:0xa7:   Platform: win32 10
20:45:33:WU00:FS00:0xa7:       Bits: 64
20:45:33:WU00:FS00:0xa7:       Mode: Release
20:45:33:WU00:FS00:0xa7:************************************ System ************************************
20:45:33:WU00:FS00:0xa7:        CPU: AMD Ryzen 7 3800X 8-Core Processor
20:45:33:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
20:45:33:WU00:FS00:0xa7:       CPUs: 16
20:45:33:WU00:FS00:0xa7:     Memory: 31.95GiB
20:45:33:WU00:FS00:0xa7:Free Memory: 25.72GiB
20:45:33:WU00:FS00:0xa7:    Threads: WINDOWS_THREADS
20:45:33:WU00:FS00:0xa7: OS Version: 6.2
20:45:33:WU00:FS00:0xa7:Has Battery: false
20:45:33:WU00:FS00:0xa7: On Battery: false
20:45:33:WU00:FS00:0xa7: UTC Offset: -4
20:45:33:WU00:FS00:0xa7:        PID: 16916
20:45:33:WU00:FS00:0xa7:        CWD: C:\Users\xxxxx\AppData\Roaming\FAHClient\work
20:45:33:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
20:45:33:WU00:FS00:0xa7:    Version: 0.0.18
20:45:33:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:45:33:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
20:45:33:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
20:45:33:WU00:FS00:0xa7:       Date: Oct 26 2019
20:45:33:WU00:FS00:0xa7:       Time: 01:52:30
20:45:33:WU00:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
20:45:33:WU00:FS00:0xa7:     Branch: master
20:45:33:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
20:45:33:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
20:45:33:WU00:FS00:0xa7:   Platform: win32 10
20:45:33:WU00:FS00:0xa7:       Bits: 64
20:45:33:WU00:FS00:0xa7:       Mode: Release
20:45:33:WU00:FS00:0xa7:************************************ Build *************************************
20:45:33:WU00:FS00:0xa7:       SIMD: avx_256
20:45:33:WU00:FS00:0xa7:********************************************************************************
20:45:33:WU00:FS00:0xa7:Project: 13833 (Run 0, Clone 3814, Gen 10)
20:45:33:WU00:FS00:0xa7:Unit: 0x0000000b80fccb095e6e55b22adad799
20:45:33:WU00:FS00:0xa7:Reading tar file core.xml
20:45:33:WU00:FS00:0xa7:Reading tar file frame10.tpr
20:45:33:WU00:FS00:0xa7:Digital signatures verified
20:45:33:WU00:FS00:0xa7:Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpt 15 -nt 10
20:45:33:WU00:FS00:0xa7:Steps: first=2500000 total=250000
20:45:33:WU00:FS00:0xa7:ERROR:
20:45:33:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
20:45:33:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
20:45:33:WU00:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
20:45:33:WU00:FS00:0xa7:ERROR:
20:45:33:WU00:FS00:0xa7:ERROR:Fatal error:
20:45:33:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
20:45:33:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
20:45:33:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
20:45:33:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
20:45:33:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
20:45:33:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
20:45:38:WU00:FS00:0xa7:WARNING:Unexpected exit() call
20:45:38:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
20:45:38:WU00:FS00:0xa7:Saving result file ..\logfile_01.txt
20:45:38:WU00:FS00:0xa7:Saving result file md.log
20:45:38:WU00:FS00:0xa7:Saving result file science.log
20:45:38:WU00:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: The process cannot access the file because it is being used by another process: "01/md.log"
20:45:38:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
20:45:38:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:45:38:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:13833 run:0 clone:3814 gen:10 core:0xa7 unit:0x0000000b80fccb095e6e55b22adad799
20:45:38:WU00:FS00:Uploading 19.00KiB to 128.252.203.9
20:45:38:WU00:FS00:Connecting to 128.252.203.9:8080
20:45:38:WU02:FS00:Connecting to 65.254.110.245:8080
20:45:38:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:45:38:WU02:FS00:Connecting to 18.218.241.186:80
20:45:39:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:45:39:ERROR:WU02:FS00:Exception: Could not get an assignment
20:45:39:WU02:FS00:Connecting to 65.254.110.245:8080
20:45:39:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:45:39:WU02:FS00:Connecting to 18.218.241.186:80
20:45:39:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:45:39:ERROR:WU02:FS00:Exception: Could not get an assignment
20:45:42:WU00:FS00:Upload complete
20:45:42:WU00:FS00:Server responded WORK_ACK (400)
20:45:42:WU00:FS00:Cleaning up
Image
Ryzen 5900x 12T - RTX 4070 TI
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Possible bad WU?

Post by bruce »

FAH uses analysis code from a couple of open software sites, including GROMACS.

Gromacs works well on corporate computers where there are tight controls between the hardware being allocated and the scientist/researcher. FAH has adapted it for home computers. Your computer is donating the resources of 10 CPU threads but this particular project doesn't like that configuration. The project owner needs to reconfigrue the assignment logic so that protein doesn't get assigned to people who happen to have 10 CPU threads. (I'll bet it would work with either 8 or 9) ... Other proteins with work with 10.
Joe_H
Site Admin
Posts: 7990
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Possible bad WU?

Post by Joe_H »

The Gromacs code used in the CPU folding core has problems with "large" prime numbers and their multiples. Usually "large" means 7 and larger, but some projects have issues with 5 and its multiples. That appears to be the problem here, it was trying to run on 10 CPU threads and failed.

During internal and beta testing before projects are released to all, they try to identify which projects have problems with CPU thread counts like this and set them to only assign to other CPU settings. However sometimes this is missed or the problem only effects some small number of WU's within a project. As Bruce mentions, if there are enough reports of this for a particular project, the project owner will be requested to adjust assignment to avoid this.

P.S. for those who notice their system requests for example 7 threads of an 8 core CPU, the servers and the current client have code to change that and provide a WU that will use 6 CPU threads instead. Older versions of the client don't have that code.
Image
Kebast
Posts: 386
Joined: Thu Aug 06, 2015 5:21 pm

Re: Possible bad WU?

Post by Kebast »

Thanks. I lowered the thread count to 8 and immediately got another project. It's folding fine.

So what is the best number of threads to assign? I don't really need 8 free, would assigning 12 work as well as 8, or should I just leave it alone for now?
Image
Ryzen 5900x 12T - RTX 4070 TI
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Possible bad WU?

Post by bruce »

GROMACS says the number of threads should have factors of 2 or 3, FAH has attempted to fix that by backing off for known factors. 14=2*7 will be reduce to 13 (which also fails) and then to 12 which should work. The factor 5 is often acceptable, but not always, so 10=2*5 may fold or may not but it doesn't automatically get reduced to 9=3*3
Kebast
Posts: 386
Joined: Thu Aug 06, 2015 5:21 pm

Re: Possible bad WU?

Post by Kebast »

Thanks for the great explanation!
Image
Ryzen 5900x 12T - RTX 4070 TI
Darth_Peter_dualxeon
Posts: 46
Joined: Fri Mar 20, 2020 3:13 am
Hardware configuration: EVGA SR-2 motherboard
2x Xeon x5670 CPU
64 GB ECC DDR3
Nvidia RTX 2070

Re: Possible bad WU?

Post by Darth_Peter_dualxeon »

Hi,
I have a server workstation with lots of threads. (2 CPUs, 24 threads altogether, but I keep 1 for the GPU.)
this is what I see:

Code: Select all

12:47:32:WU03:FS00:0xa7:Reducing thread count from 23 to 22 to avoid domain decomposition by a prime number > 3
12:47:32:WU03:FS00:0xa7:Reducing thread count from 22 to 21 to avoid domain decomposition with large prime factor 11
So I'm bit confused. 23 is prime >3, so it goes to 22. then, 22 is 2*11, 11 is large prime, so it goes to 21.

In the end it is at 21, which is 3*7. And 7 is a prime>3 and that is what it wanted to avoid in the first place ...

So, now what? Should I set it to 2*3*3=18 threads? Then, I'd have a lot of unused threads. (by the way, I did not get any errors due to this, yet. All what my pc does gets accepted as it seems.)
Joe_H
Site Admin
Posts: 7990
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Possible bad WU?

Post by Joe_H »

The auto reducing works, to a point. What some do is et up second CPU slot to use the "extra" threads on a separate project. That may result in reduced overall throughput, YMMV. Some projects will work with 20 by being large enough or having the right geometry, but it can take some testing to find which can so it is a bit hard to automate.
Image
Post Reply