Page 1 of 3

Issues, perhaps bad WU? - 16417

Posted: Tue Apr 07, 2020 3:55 pm
by MMaatttt
I'm seeing issues recently. Could this be a bad WU? My system is stable and not overclocked and I've not seen this issue before.

Code: Select all

15:37:20:WU01:FS00:Starting
15:37:20:WU01:FS00:Removing old file './work/01/logfile_01-20200407-150655.txt'
15:37:20:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 2089 -checkpoint 15 -np 24
15:37:20:WU01:FS00:Started FahCore on PID 8993
15:37:20:WU01:FS00:Core PID:8997
15:37:20:WU01:FS00:FahCore 0xa7 started
15:37:21:WU01:FS00:0xa7:*********************** Log Started 2020-04-07T15:37:20Z ***********************
15:37:21:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:37:21:WU01:FS00:0xa7:       Type: 0xa7
15:37:21:WU01:FS00:0xa7:       Core: Gromacs
15:37:21:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8993 -checkpoint 15 -np
15:37:21:WU01:FS00:0xa7:             24
15:37:21:WU01:FS00:0xa7:************************************ CBang *************************************
15:37:21:WU01:FS00:0xa7:       Date: Nov 5 2019
15:37:21:WU01:FS00:0xa7:       Time: 06:06:57
15:37:21:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
15:37:21:WU01:FS00:0xa7:     Branch: master
15:37:21:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:37:21:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
15:37:21:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:37:21:WU01:FS00:0xa7:       Bits: 64
15:37:21:WU01:FS00:0xa7:       Mode: Release
15:37:21:WU01:FS00:0xa7:************************************ System ************************************
15:37:21:WU01:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
15:37:21:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:37:21:WU01:FS00:0xa7:       CPUs: 24
15:37:21:WU01:FS00:0xa7:     Memory: 31.37GiB
15:37:21:WU01:FS00:0xa7:Free Memory: 9.22GiB
15:37:21:WU01:FS00:0xa7:    Threads: POSIX_THREADS
15:37:21:WU01:FS00:0xa7: OS Version: 5.3
15:37:21:WU01:FS00:0xa7:Has Battery: false
15:37:21:WU01:FS00:0xa7: On Battery: false
15:37:21:WU01:FS00:0xa7: UTC Offset: 1
15:37:21:WU01:FS00:0xa7:        PID: 8997
15:37:21:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
15:37:21:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
15:37:21:WU01:FS00:0xa7:    Version: 0.0.18
15:37:21:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:37:21:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
15:37:21:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
15:37:21:WU01:FS00:0xa7:       Date: Nov 5 2019
15:37:21:WU01:FS00:0xa7:       Time: 06:13:26
15:37:21:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
15:37:21:WU01:FS00:0xa7:     Branch: master
15:37:21:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
15:37:21:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
15:37:21:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:37:21:WU01:FS00:0xa7:       Bits: 64
15:37:21:WU01:FS00:0xa7:       Mode: Release
15:37:21:WU01:FS00:0xa7:************************************ Build *************************************
15:37:21:WU01:FS00:0xa7:       SIMD: avx_256
15:37:21:WU01:FS00:0xa7:********************************************************************************
15:37:21:WU01:FS00:0xa7:Project: 16417 (Run 1957, Clone 3, Gen 11)
15:37:21:WU01:FS00:0xa7:Unit: 0x0000000b96880e6e5e8a605c572322d6
15:37:21:WU01:FS00:0xa7:Reading tar file core.xml
15:37:21:WU01:FS00:0xa7:Reading tar file frame11.tpr
15:37:21:WU01:FS00:0xa7:Digital signatures verified
15:37:21:WU01:FS00:0xa7:Calling: mdrun -s frame11.tpr -o frame11.trr -x frame11.xtc -cpt 15 -nt 24
15:37:22:WU01:FS00:0xa7:Steps: first=2750000 total=250000
15:37:22:WU01:FS00:0xa7:ERROR:
15:37:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:37:22:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:37:22:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
15:37:22:WU01:FS00:0xa7:ERROR:
15:37:22:WU01:FS00:0xa7:ERROR:Fatal error:
15:37:22:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
15:37:22:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
15:37:22:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
15:37:22:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:37:22:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:37:22:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
15:37:27:WU01:FS00:0xa7:WARNING:Unexpected exit() call
15:37:27:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
15:37:27:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
15:37:27:WU01:FS00:0xa7:Saving result file md.log
15:37:27:WU01:FS00:0xa7:Saving result file science.log
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:27:WU01:FS00:0xa7:Caught signal SIGSEGV(11) on PID 8997
15:37:28:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 4:23 pm
by Neil-B
If you search on 16417 you will see there are a number of threads … Looks like it may be an issue with higher core count slots for this WU

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 5:39 pm
by MMaatttt
I've seen a thread saying that the WU didn't work for a >64 thread machine with someone saying the WU had been fixed. After that post, some people were saying that smaller machines may still have problems. I've not seen anything about a fix for them and clearly I'm still having problems.

It looks to me like this WU is still broken. I'll try limiting my thread count, but that would be a workaround and not a fix.

OK, so after trying some things:

Setting cores to 10 still fails. Setting to 8 seems to be running.

Code: Select all

17:35:24:WU01:FS00:0xa7:Calling: mdrun -s frame11.tpr -o frame11.trr -x frame11.xtc -cpt 15 -nt 8
17:35:24:WU01:FS00:0xa7:Steps: first=2750000 total=250000
17:35:25:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
17:36:03:WU01:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
17:36:37:WU01:FS00:0xa7:Completed 5000 out of 250000 steps (2%)
Could this please be reported back to the WU owner?

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 6:49 pm
by Joe_H
Seems this project has an issue with multiples of 5 (10 & 20 in your tries). I can pass that on to the researcher.

As for other reports, I did respond to one where Project 16417 WU failed on 96 CPU threads, but worked on 64. It is possible the WU would work for you on 18, 16, or 12 threads.

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 7:31 pm
by MMaatttt
My original run (the one that failed) was started with 24 cpus (12 cores, 24 threads).

I tried setting different core counts to see if I could confirm your suspicion, but after testing 1-6 then 24 threads, it seems to run happily. I think it may be capping threads at the number I first got the job to run at as even though I've taken the tread limit off, it's running with this command:

Code: Select all

Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpi state.cpt -cpt 15 -nt 8

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 9:41 pm
by Joe_H
Possibly you finished the WU in the original post and downloaded another. Once a WU has been downloaded at a specific thread setting, it can be decreased, and even returned to that number of threads. But it can not be set to a number of threads higher than when it was downloaded.

The researcher who is running this project did take a look at your log, and is doing some checking. What is there shows that the WU in your first post was set to run at 24, but once started it immediately switched to 20. Not what it should have done, so possibly a setting is off somewhere.

P.S. He spotted that, I missed it having seen the "20 ranks" decomposition error first.

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 9:49 pm
by bruce
MMaatttt wrote:I think it may be capping threads at the number I first got the job to run at as even though I've taken the tread limit off, it's running with this command:

Code: Select all

Calling: mdrun -s frame10.tpr -o frame10.trr -x frame10.xtc -cpi state.cpt -cpt 15 -nt 8
When you download a WU, FAH reports your equipment, and it caps the number of threads at whatever the slot that was seeking work reported. You can decrease the number of threads but cannot increase it above that initial value.

Re: Issues, perhaps bad WU?

Posted: Tue Apr 07, 2020 10:09 pm
by MMaatttt
Thanks for all the help.

Re: Issues, perhaps bad WU?

Posted: Wed Apr 08, 2020 1:59 am
by _r2w_ben
Joe_H wrote:The researcher who is running this project did take a look at your log, and is doing some checking. What is there shows that the WU in your first post was set to run at 24, but once started it immediately switched to 20. Not what it should have done, so possibly a setting is off somewhere.

P.S. He spotted that, I missed it having seen the "20 ranks" decomposition error first.
GROMACS may have thought it was worthwhile to use PME ranks. It sounds like it split 24 threads into 20 PP ranks and 4 PME ranks.

Re: Issues, perhaps bad WU?

Posted: Wed Apr 08, 2020 12:15 pm
by MMaatttt
I came across another 16417. This time I set threads to 23 and it works.

Code: Select all

12:09:23:WU00:FS00:0xa7:Reducing thread count from 23 to 22 to avoid domain decomposition by a prime number > 3
12:09:23:WU00:FS00:0xa7:Reducing thread count from 22 to 21 to avoid domain decomposition with large prime factor 11
12:09:23:WU00:FS00:0xa7:Calling: mdrun -s frame27.tpr -o frame27.trr -x frame27.xtc -cpt 15 -nt 21

Re: Issues, perhaps bad WU? - 16417

Posted: Wed Apr 08, 2020 3:17 pm
by Neil-B
Unfortunately is not a great number - hopefully you will get away with it but in general:

(cropped from another post) F@H has difficulty with large primes and their multiples number of CPUs .. 7 is always large (and any prime above 7), 5 is sometimes large, and 3 is never large .. Try to choose a number that is a multiple of 2 and/or 3 and no other prime .. 2, 3, 4, 6, 8, 9, 12, 16, 18, 24, 27, etc. are good numbers of CPUs to choose .. 5, 10, 15, 20, etc may work most of the time .. Other numbers may/will bite you.

Sometimes the software will be able to save you by reducing thread count but there are limits to how clever it is and some WUs are more sensitive to certain primes than others it seems.

Re: Issues, perhaps bad WU? - 16417

Posted: Wed Apr 08, 2020 5:20 pm
by Joe_H
At 23 it stepped down to actually using 21:

Code: Select all

12:09:23:WU00:FS00:0xa7:Calling: mdrun -s frame27.tpr -o frame27.trr -x frame27.xtc -cpt 15 -nt 21
Which is normally not allowed as it is a multiple of 7. But the geometry of some of the projects being worked on happens to work with a decomposition 03x7x1. More often than not, a multiple of 7 just does not work, or they get too many errors.

Issues, perhaps bad WU? - 14523

Posted: Thu Apr 09, 2020 3:13 am
by WA2RKN
Caught the 14523 WU repeatedly starting, then immediately faulting. Repeat.
LOTS of FS00:0xa7:Caught signal SIGSEGV(11) on PID ...

Code: Select all

02:36:04:WU02:FS00:0xa7:*********************** Log Started 2020-04-09T02:36:04Z ***********************
02:36:04:WU02:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
02:36:04:WU02:FS00:0xa7:       Type: 0xa7
02:36:04:WU02:FS00:0xa7:       Core: Gromacs
02:36:04:WU02:FS00:0xa7:       Args: -dir 02 -suffix 01 -version 705 -lifeline 8694 -checkpoint 15 -np
02:36:04:WU02:FS00:0xa7:             10
02:36:04:WU02:FS00:0xa7:************************************ CBang *************************************
02:36:04:WU02:FS00:0xa7:       Date: Nov 5 2019
02:36:04:WU02:FS00:0xa7:       Time: 06:06:57
02:36:04:WU02:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
02:36:04:WU02:FS00:0xa7:     Branch: master
02:36:04:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
02:36:04:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
02:36:04:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
02:36:04:WU02:FS00:0xa7:       Bits: 64
02:36:04:WU02:FS00:0xa7:       Mode: Release
02:36:04:WU02:FS00:0xa7:************************************ System ************************************
02:36:04:WU02:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
02:36:04:WU02:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
02:36:04:WU02:FS00:0xa7:       CPUs: 12
02:36:04:WU02:FS00:0xa7:     Memory: 31.27GiB
02:36:04:WU02:FS00:0xa7:Free Memory: 21.35GiB
02:36:04:WU02:FS00:0xa7:    Threads: POSIX_THREADS
02:36:04:WU02:FS00:0xa7: OS Version: 5.3
02:36:04:WU02:FS00:0xa7:Has Battery: false
02:36:04:WU02:FS00:0xa7: On Battery: false
02:36:04:WU02:FS00:0xa7: UTC Offset: -4
02:36:04:WU02:FS00:0xa7:        PID: 8698
02:36:04:WU02:FS00:0xa7:        CWD: /var/lib/fahclient/work
02:36:04:WU02:FS00:0xa7:******************************** Build - libFAH ********************************
02:36:04:WU02:FS00:0xa7:    Version: 0.0.18
02:36:04:WU02:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:36:04:WU02:FS00:0xa7:  Copyright: 2019 foldingathome.org
02:36:04:WU02:FS00:0xa7:   Homepage: https://foldingathome.org/
02:36:04:WU02:FS00:0xa7:       Date: Nov 5 2019
02:36:04:WU02:FS00:0xa7:       Time: 06:13:26
02:36:04:WU02:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
02:36:04:WU02:FS00:0xa7:     Branch: master
02:36:04:WU02:FS00:0xa7:   Compiler: GNU 8.3.0
02:36:04:WU02:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
02:36:04:WU02:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
02:36:04:WU02:FS00:0xa7:       Bits: 64
02:36:04:WU02:FS00:0xa7:       Mode: Release
02:36:04:WU02:FS00:0xa7:************************************ Build *************************************
02:36:04:WU02:FS00:0xa7:       SIMD: avx_256
02:36:04:WU02:FS00:0xa7:********************************************************************************
02:36:04:WU02:FS00:0xa7:Project: 14523 (Run 846, Clone 0, Gen 11)
02:36:04:WU02:FS00:0xa7:Unit: 0x0000001580fccb0a5e459bc77ec4f6fc
02:36:04:WU02:FS00:0xa7:Reading tar file core.xml
02:36:04:WU02:FS00:0xa7:Reading tar file frame11.tpr
02:36:04:WU02:FS00:0xa7:Digital signatures verified
02:36:04:WU02:FS00:0xa7:Calling: mdrun -s frame11.tpr -o frame11.trr -x frame11.xtc -cpt 15 -nt 10
02:36:04:WU02:FS00:0xa7:Steps: first=2750000 total=250000
02:36:04:WU02:FS00:0xa7:ERROR:
02:36:04:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
02:36:04:WU02:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
02:36:04:WU02:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
02:36:04:WU02:FS00:0xa7:ERROR:
02:36:04:WU02:FS00:0xa7:ERROR:Fatal error:
02:36:04:WU02:FS00:0xa7:ERROR:There is no domain decomposition for 10 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
02:36:04:WU02:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
02:36:04:WU02:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
02:36:04:WU02:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
02:36:04:WU02:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
02:36:04:WU02:FS00:0xa7:ERROR:-------------------------------------------------------
02:36:09:WU02:FS00:0xa7:WARNING:Unexpected exit() call
02:36:09:WU02:FS00:0xa7:WARNING:Unexpected exit from science code
02:36:09:WU02:FS00:0xa7:Saving result file ../logfile_01.txt
02:36:09:WU02:FS00:0xa7:Saving result file md.log
02:36:09:WU02:FS00:0xa7:Saving result file science.log
02:36:09:WU01:FS01:0x22:Completed 890000 out of 1000000 steps (89%)
02:36:09:WU02:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
02:36:14:FS00:Paused

Re: Issues, perhaps bad WU? - 14523

Posted: Thu Apr 09, 2020 4:32 am
by WA2RKN
New here; found the reference to setting the core count lower. In my case to 8. That allowed my system to process the WU and finally get rid of it. When I first discovered the problem, the ETA was at 5 days...
Funny that by the time my post was approved, the WU had just finished. Glad it's resolved (for me anyway) now. I'll remember that trick for the future.

Re: Issues, perhaps bad WU?

Posted: Thu Apr 09, 2020 11:38 am
by _r2w_ben
MMaatttt wrote:I came across another 16417. This time I set threads to 23 and it works.
Deep inside the work folder is a file named md.log. Next time you get one of these scenarios can you search that file for decomposition and post this section?

Code: Select all

Initializing Domain Decomposition on 2 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.434 nm, LJ-14, atoms 4251 4256
  multi-body bonded interactions: 0.434 nm, Proper Dih., atoms 4251 4256
Minimum cell size due to bonded interactions: 0.477 nm
Maximum distance for 13 constraints, at 120 deg. angles, all-trans: 0.219 nm
Estimated maximum distance required for P-LINCS: 0.219 nm
Using 0 separate PME ranks, as there are too few total
 ranks for efficient splitting
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 2 cells with a minimum initial size of 0.596 nm
The maximum allowed number of cells is: X 13 Y 13 Z 13
Domain decomposition grid 2 x 1 x 1, separate PME ranks 0
PME domain decomposition: 2 x 1 x 1
Domain decomposition rank 0, coordinates 0 0 0

Using 2 MPI threads
Domain decomposition grid is how the work unit was split based on particles. If separate PME ranks is greater than 0 then some threads are dedicated to a portion of the energy calculations.

21 threads might be 3x7x1 as Joe_H mentioned. It could also be 2x3x3 (18) + 3 PME ranks = 21.