Page 1 of 2

Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri Apr 17, 2020 6:06 pm
by HendricksSA
Here we go again. Believe this is a probable bad WU. Details follow:

Code: Select all

17:43:34:WU00:FS00:Server responded WORK_ACK (400)
17:43:34:WU00:FS00:Final credit estimate, 8603.00 points
17:43:34:WU00:FS00:Cleaning up
17:44:26:WU01:FS00:Connecting to 65.254.110.245:8080
[93m17:44:27:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration[0m
17:44:27:WU01:FS00:Connecting to 18.218.241.186:80
[93m17:44:27:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration[0m
[91m17:44:27:ERROR:WU01:FS00:Exception: Could not get an assignment[0m
17:46:03:WU01:FS00:Connecting to 65.254.110.245:8080
[93m17:46:04:WARNING:WU01:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration[0m
17:46:04:WU01:FS00:Connecting to 18.218.241.186:80
[93m17:46:04:WARNING:WU01:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration[0m
[91m17:46:04:ERROR:WU01:FS00:Exception: Could not get an assignment[0m
17:48:41:WU01:FS00:Connecting to 65.254.110.245:8080
17:48:41:WU01:FS00:Assigned to work server 40.114.52.201
17:48:41:WU01:FS00:Requesting new work unit for slot 00: READY cpu:48 from 40.114.52.201
17:48:41:WU01:FS00:Connecting to 40.114.52.201:8080
17:48:42:WU01:FS00:Downloading 1.23MiB
17:48:42:WU01:FS00:Download complete
17:48:42:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:14576 run:0 clone:2096 gen:48 core:0xa7 unit:0x0000003e287234c95e7b86f1156830e7
17:48:42:WU01:FS00:Starting
17:48:42:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 1869 -checkpoint 15 -np 48
17:48:42:WU01:FS00:Started FahCore on PID 33838
17:48:42:WU01:FS00:Core PID:33842
17:48:42:WU01:FS00:FahCore 0xa7 started
17:48:43:WU01:FS00:0xa7:*********************** Log Started 2020-04-17T17:48:42Z ***********************
17:48:43:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
17:48:43:WU01:FS00:0xa7:       Type: 0xa7
17:48:43:WU01:FS00:0xa7:       Core: Gromacs
17:48:43:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 33838 -checkpoint 15 -np
17:48:43:WU01:FS00:0xa7:             48
17:48:43:WU01:FS00:0xa7:************************************ CBang *************************************
17:48:43:WU01:FS00:0xa7:       Date: Nov 5 2019
17:48:43:WU01:FS00:0xa7:       Time: 06:06:57
17:48:43:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
17:48:43:WU01:FS00:0xa7:     Branch: master
17:48:43:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
17:48:43:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
17:48:43:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:48:43:WU01:FS00:0xa7:       Bits: 64
17:48:43:WU01:FS00:0xa7:       Mode: Release
17:48:43:WU01:FS00:0xa7:************************************ System ************************************
17:48:43:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
17:48:43:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
17:48:43:WU01:FS00:0xa7:       CPUs: 48
17:48:43:WU01:FS00:0xa7:     Memory: 62.80GiB
17:48:43:WU01:FS00:0xa7:Free Memory: 59.03GiB
17:48:43:WU01:FS00:0xa7:    Threads: POSIX_THREADS
17:48:43:WU01:FS00:0xa7: OS Version: 5.3
17:48:43:WU01:FS00:0xa7:Has Battery: false
17:48:43:WU01:FS00:0xa7: On Battery: false
17:48:43:WU01:FS00:0xa7: UTC Offset: -5
17:48:43:WU01:FS00:0xa7:        PID: 33842
17:48:43:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
17:48:43:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
17:48:43:WU01:FS00:0xa7:    Version: 0.0.18
17:48:43:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:48:43:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
17:48:43:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
17:48:43:WU01:FS00:0xa7:       Date: Nov 5 2019
17:48:43:WU01:FS00:0xa7:       Time: 06:13:26
17:48:43:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
17:48:43:WU01:FS00:0xa7:     Branch: master
17:48:43:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
17:48:43:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
17:48:43:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:48:43:WU01:FS00:0xa7:       Bits: 64
17:48:43:WU01:FS00:0xa7:       Mode: Release
17:48:43:WU01:FS00:0xa7:************************************ Build *************************************
17:48:43:WU01:FS00:0xa7:       SIMD: avx_256
17:48:43:WU01:FS00:0xa7:********************************************************************************
17:48:43:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 2096, Gen 48)
17:48:43:WU01:FS00:0xa7:Unit: 0x0000003e287234c95e7b86f1156830e7
17:48:43:WU01:FS00:0xa7:Reading tar file core.xml
17:48:43:WU01:FS00:0xa7:Reading tar file frame48.tpr
17:48:43:WU01:FS00:0xa7:Digital signatures verified
17:48:43:WU01:FS00:0xa7:Calling: mdrun -s frame48.tpr -o frame48.trr -x frame48.xtc -cpt 15 -nt 48
17:48:43:WU01:FS00:0xa7:Steps: first=24000000 total=500000
17:48:43:WU01:FS00:0xa7:ERROR:
17:48:43:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:48:43:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
17:48:43:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
17:48:43:WU01:FS00:0xa7:ERROR:
17:48:43:WU01:FS00:0xa7:ERROR:Fatal error:
17:48:43:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 40 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
17:48:43:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
17:48:43:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
17:48:43:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:48:43:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:48:43:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:48:48:WU01:FS00:0xa7:WARNING:Unexpected exit() call
17:48:48:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
17:48:48:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
17:48:48:WU01:FS00:0xa7:Saving result file md.log
17:48:48:WU01:FS00:0xa7:Saving result file science.log
17:48:48:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:48:48:WU01:FS00:Starting
17:48:48:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 705 -lifeline 1869 -checkpoint 15 -np 48
17:48:48:WU01:FS00:Started FahCore on PID 33894
17:48:48:WU01:FS00:Core PID:33898
17:48:48:WU01:FS00:FahCore 0xa7 started
17:48:49:WU01:FS00:0xa7:*********************** Log Started 2020-04-17T17:48:48Z ***********************
17:48:49:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
17:48:49:WU01:FS00:0xa7:       Type: 0xa7
17:48:49:WU01:FS00:0xa7:       Core: Gromacs
17:48:49:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 33894 -checkpoint 15 -np
17:48:49:WU01:FS00:0xa7:             48
17:48:49:WU01:FS00:0xa7:************************************ CBang *************************************
17:48:49:WU01:FS00:0xa7:       Date: Nov 5 2019
17:48:49:WU01:FS00:0xa7:       Time: 06:06:57
17:48:49:WU01:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
17:48:49:WU01:FS00:0xa7:     Branch: master
17:48:49:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
17:48:49:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
17:48:49:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:48:49:WU01:FS00:0xa7:       Bits: 64
17:48:49:WU01:FS00:0xa7:       Mode: Release
17:48:49:WU01:FS00:0xa7:************************************ System ************************************
17:48:49:WU01:FS00:0xa7:        CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
17:48:49:WU01:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
17:48:49:WU01:FS00:0xa7:       CPUs: 48
17:48:49:WU01:FS00:0xa7:     Memory: 62.80GiB
17:48:49:WU01:FS00:0xa7:Free Memory: 59.03GiB
17:48:49:WU01:FS00:0xa7:    Threads: POSIX_THREADS
17:48:49:WU01:FS00:0xa7: OS Version: 5.3
17:48:49:WU01:FS00:0xa7:Has Battery: false
17:48:49:WU01:FS00:0xa7: On Battery: false
17:48:49:WU01:FS00:0xa7: UTC Offset: -5
17:48:49:WU01:FS00:0xa7:        PID: 33898
17:48:49:WU01:FS00:0xa7:        CWD: /var/lib/fahclient/work
17:48:49:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
17:48:49:WU01:FS00:0xa7:    Version: 0.0.18
17:48:49:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:48:49:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
17:48:49:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
17:48:49:WU01:FS00:0xa7:       Date: Nov 5 2019
17:48:49:WU01:FS00:0xa7:       Time: 06:13:26
17:48:49:WU01:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
17:48:49:WU01:FS00:0xa7:     Branch: master
17:48:49:WU01:FS00:0xa7:   Compiler: GNU 8.3.0
17:48:49:WU01:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
17:48:49:WU01:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
17:48:49:WU01:FS00:0xa7:       Bits: 64
17:48:49:WU01:FS00:0xa7:       Mode: Release
17:48:49:WU01:FS00:0xa7:************************************ Build *************************************
17:48:49:WU01:FS00:0xa7:       SIMD: avx_256
17:48:49:WU01:FS00:0xa7:********************************************************************************
17:48:49:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 2096, Gen 48)
17:48:49:WU01:FS00:0xa7:Unit: 0x0000003e287234c95e7b86f1156830e7
17:48:49:WU01:FS00:0xa7:Reading tar file core.xml
17:48:49:WU01:FS00:0xa7:Reading tar file frame48.tpr
17:48:49:WU01:FS00:0xa7:Digital signatures verified
17:48:49:WU01:FS00:0xa7:Calling: mdrun -s frame48.tpr -o frame48.trr -x frame48.xtc -cpt 15 -nt 48
17:48:49:WU01:FS00:0xa7:Steps: first=24000000 total=500000
17:48:49:WU01:FS00:0xa7:ERROR:
17:48:49:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:48:49:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
17:48:49:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
17:48:49:WU01:FS00:0xa7:ERROR:
17:48:49:WU01:FS00:0xa7:ERROR:Fatal error:
17:48:49:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 40 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
17:48:49:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
17:48:49:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
17:48:49:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:48:49:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:48:49:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:48:53:WU01:FS00:0xa7:WARNING:Unexpected exit() call
17:48:53:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
17:48:53:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
17:48:53:WU01:FS00:0xa7:Saving result file md.log
17:48:53:WU01:FS00:0xa7:Saving result file science.log
17:48:54:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:49:48:WU01:FS00:Starting

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri Apr 17, 2020 6:27 pm
by foldy
I know that error. It has issues with your CPU slot using 48 threads. The work unit fails with ERROR:There is no domain decomposition for 40 ranks

Workaround is to limit your CPU slot to 32 threads and create a second CPU slot with 16 threads. Or 2x cpu slots with 24 threads each.

Then this work unit will work and others coming too.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri Apr 17, 2020 6:37 pm
by HendricksSA
foldy, all true. Last time this happened to me was a Project: 16417 (Run 1642, Clone 2, Gen 2) a few days ago. My CPU machine is generally unattended and I lost 4 days of folding while the client hung endlessly repeating the same thing. There should be a smarter solution than to drop thread counts.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri Apr 17, 2020 6:51 pm
by bruce
I agree, there should be, but that issue belongs to gromacs.org. It's open software and FAH uses it with very minor alterations.

the part of the issue that I'm persuing right now is why 48 was reduced to 40 before it failed. The official word from gromacs is that the number of threads must have factors of 2 or 3 only The value of 40 contains a factor of 5 which apparently is the problem, but 48 would not have that problem. I'd like an explanation. It's somehow related to PME.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Fri Apr 17, 2020 10:10 pm
by _r2w_ben
bruce wrote:I agree, there should be, but that issue belongs to gromacs.org. It's open software and FAH uses it with very minor alterations.

the part of the issue that I'm persuing right now is why 48 was reduced to 40 before it failed. The official word from gromacs is that the number of threads must have factors of 2 or 3 only The value of 40 contains a factor of 5 which apparently is the problem, but 48 would not have that problem. I'd like an explanation. It's somehow related to PME.
I don't believe this is a GROMACS issue. The parameters FAHclient passes to mdrun results in PME being used. It allows for better utilization of high thread counts. PME could be disabled by passing -npme = 0 but would cause this problem to occur more often.

For this work unit, 48 was split into 40 PP ranks and 8 PME. The only ways to factor 40 are 8x5x1 and 4x5x2. If the work unit is small, resulting in 4 as the largest factor allowed, then it will fail. Limits on the size are recorded in md.log.

Code: Select all

The maximum allowed number of cells is: X 4 Y 4 Z 3
FAHClient needs to be smarter: detect the domain decomposition error and retry with n - 1 threads until the work unit runs to completion. The next work unit can be assigned at the full number of threads set on the slot. This new work unit might be successful or it might reduce the threads until it runs.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Sat Apr 18, 2020 5:52 pm
by Joe_H
The the code to do that would probably need to be in FAHCoreWrapper as that handles running the FAHCore_nn process.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Sat Apr 18, 2020 6:28 pm
by _r2w_ben
Joe_H wrote:The the code to do that would probably need to be in FAHCoreWrapper as that handles running the FAHCore_nn process.
It could potentially go there as well or even within FahCore_a7.exe. Wherever the current logic is that outputs the following lines would be ideal:

Code: Select all

12:09:23:WU00:FS00:0xa7:Reducing thread count from 23 to 22 to avoid domain decomposition by a prime number > 3
12:09:23:WU00:FS00:0xa7:Reducing thread count from 22 to 21 to avoid domain decomposition with large prime factor 11
12:09:23:WU00:FS00:0xa7:Calling: mdrun -s frame27.tpr -o frame27.trr -x frame27.xtc -cpt 15 -nt 21
This happens prior to calling GROMACS (mdrun) and is not mentioned in the FAH's fork. On failure, retry with nt -1, reducing further if necessary due to primes, and then call mdrun.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Sat Apr 18, 2020 6:49 pm
by HendricksSA
md.log contents follow. They show what you expected to see.

Code: Select all

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.406 nm, LJ-14, atoms 876 884
  multi-body bonded interactions: 0.406 nm, Proper Dih., atoms 876 884
Minimum cell size due to bonded interactions: 0.447 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.098 nm
Estimated maximum distance required for P-LINCS: 1.098 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.372 nm
The maximum allowed number of cells is: X 4 Y 4 Z 3

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Sat Apr 18, 2020 9:18 pm
by _r2w_ben
Thank you for the log. I have a 4x4x3 with PME load of 0.18 in my data set.
PME of 0.18 used 48 threads as 36 + 12 PME. 36 is factored as 4x3x3 so it stayed within the limits.

If you still have the work unit or you come across this scenario again, change the numbers of CPUs assigned to the slot.
Try each of these until it starts folding: 45, 44, 42, 40, 35, 32.
After the work unit completes you can set it back to 48 CPUs.

Edit: I now have data for p14576.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Sun Apr 19, 2020 1:32 am
by HendricksSA
_r2w_ben, I will give that a try next time I get stuck on a 14576. In case it is useful to you for your research, here is the relevant md.log portion of the 16417 (Run 1642, Clone 2, Gen 2) WU that failed on April 9th.

Code: Select all

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.426 nm, LJ-14, atoms 4153 4162
  multi-body bonded interactions: 0.426 nm, Proper Dih., atoms 4153 4162
Minimum cell size due to bonded interactions: 0.469 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.138 nm
Estimated maximum distance required for P-LINCS: 1.138 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.423 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Mon Apr 20, 2020 2:16 am
by Frogging101
My md.log for 24 cores on Project: 14576 (Run 0, Clone 4596, Gen 47):

Code: Select all

Initializing Domain Decomposition on 24 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.419 nm, LJ-14, atoms 876 884
  multi-body bonded interactions: 0.419 nm, Proper Dih., atoms 876 884
Minimum cell size due to bonded interactions: 0.460 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.098 nm
Estimated maximum distance required for P-LINCS: 1.098 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 20 particle-particle and 4 PME only ranks
This is a guess, check the performance at the end of the log file
Using 4 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 20 cells with a minimum initial size of 1.372 nm
The maximum allowed number of cells is: X 4 Y 4 Z 3
Sometimes, it would segfault too. My logs show that it caught signal SIGSEGV(11) 9 times out of out of 306 attempted runs.

I then changed my slot to 23 CPU threads, and it began folding (after FAHClient automatically reduced the threads to 21):

Code: Select all

02:14:53:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 4596, Gen 47)
02:14:53:WU01:FS00:0xa7:Unit: 0x0000003d287234c95e7b867920c4ee21
02:14:53:WU01:FS00:0xa7:Reading tar file core.xml
02:14:53:WU01:FS00:0xa7:Reading tar file frame47.tpr
02:14:53:WU01:FS00:0xa7:Digital signatures verified
02:14:53:WU01:FS00:0xa7:Reducing thread count from 23 to 22 to avoid domain decomposition by a prime number > 3
02:14:53:WU01:FS00:0xa7:Reducing thread count from 22 to 21 to avoid domain decomposition with large prime factor 11
02:14:53:WU01:FS00:0xa7:Calling: mdrun -s frame47.tpr -o frame47.trr -x frame47.xtc -cpt 15 -nt 21
02:14:53:WU01:FS00:0xa7:Steps: first=23500000 total=500000
02:14:54:WU01:FS00:0xa7:Completed 1 out of 500000 steps (0%)
Edit: Completed successfully https://apps.foldingathome.org/wu#proje ... 596&gen=47

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Mon Apr 20, 2020 5:23 am
by HendricksSA
Picked up another 14576 and it failed with 48 threads. I tried 45 threads as suggested. Project: 14576 (Run 0, Clone 2358, Gen 53) then ran perfectly. md.log has more information at the end of the file than it did during the previous failure. Also of note, the WU seems to be processing slightly faster. I seem to remember 18 second frames and now they are down to 15 seconds. md log:

Code: Select all

Initializing Domain Decomposition on 45 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.415 nm, LJ-14, atoms 876 884
  multi-body bonded interactions: 0.415 nm, Proper Dih., atoms 876 884
Minimum cell size due to bonded interactions: 0.456 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.098 nm
Estimated maximum distance required for P-LINCS: 1.098 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 36 particle-particle and 9 PME only ranks
This is a guess, check the performance at the end of the log file
Using 9 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 36 cells with a minimum initial size of 1.372 nm
The maximum allowed number of cells is: X 4 Y 4 Z 3
Domain decomposition grid 3 x 4 x 3, separate PME ranks 9
PME domain decomposition: 3 x 3 x 1
Interleaving PP and PME ranks
This rank does only particle-particle work.

Domain decomposition rank 0, coordinates 0 0 0

Using 45 MPI threads

Detecting CPU SIMD instructions.
Present hardware specification:
Vendor: GenuineIntel
Brand:  Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Family:  6  Model: 63  Stepping:  2
Features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX_256

Binary not matching hardware - you might be losing performance.
SIMD instructions most likely to fit this hardware: AVX2_256
SIMD instructions selected at GROMACS compile time: AVX_256

The current CPU can measure timings more accurately than the code in
GROMACS was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding GROMACS with the GMX_USE_RDTSCP=OFF CMake option.

Will do PME sum in reciprocal space for electrostatic interactions.

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen 
A smooth particle mesh Ewald method
J. Chem. Phys. 103 (1995) pp. 8577-8592
-------- -------- --- Thank You --- -------- --------

Will do ordinary reciprocal space Ewald sum.
Using a Gaussian width (1/beta) of 0.288146 nm for Ewald
Cut-off's:   NS: 1.1   Coulomb: 0.9   LJ: 0.9
Long Range LJ corr.: <C6> 3.2708e-04
System total charge: 0.000
Generated table with 1050 data points for Ewald.
Tabscale = 500 points/nm
Generated table with 1050 data points for LJ6.
Tabscale = 500 points/nm
Generated table with 1050 data points for LJ12.
Tabscale = 500 points/nm
Generated table with 1050 data points for 1-4 COUL.
Tabscale = 500 points/nm
Generated table with 1050 data points for 1-4 LJ6.
Tabscale = 500 points/nm
Generated table with 1050 data points for 1-4 LJ12.
Tabscale = 500 points/nm

Using AVX_256 4x4 non-bonded kernels

Using Lorentz-Berthelot Lennard-Jones combination rule

Potential shift: LJ r^-12: -3.541e+00 r^-6: -1.882e+00, Ewald -1.000e-05
Initialized non-bonded Ewald correction tables, spacing: 8.85e-04 size: 2374

NOTE: The number of threads is not equal to the number of (logical) cores
      and the -pin option is set to auto: will not pin thread to cores.
      This can lead to significant performance degradation.
      Consider using -pin on (and -pinoffset in case you run multiple jobs).

Initializing Parallel LINear Constraint Solver

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
B. Hess
P-LINCS: A Parallel Linear Constraint Solver for molecular simulation
J. Chem. Theory Comput. 4 (2008) pp. 116-122
-------- -------- --- Thank You --- -------- --------

The number of constraints is 2563
There are inter charge-group constraints,
will communicate selected coordinates each lincs iteration

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
S. Miyamoto and P. A. Kollman
SETTLE: An Analytical Version of the SHAKE and RATTLE Algorithms for Rigid
Water Models
J. Comp. Chem. 13 (1992) pp. 952-962
-------- -------- --- Thank You --- -------- --------

Setting the maximum number of constraint warnings to -1
maxwarn < 0, will not stop on constraint errors

Linking all bonded interactions to atoms
There are 35002 inter charge-group exclusions,
will use an extra communication step for exclusion forces for PME

The initial number of communication pulses is: X 1 Y 1 Z 1
The initial domain decomposition cell size is: X 1.89 nm Y 1.41 nm Z 1.63 nm

The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.100 nm
(the following are initial values, they could change due to box deformation)
            two-body bonded interactions  (-rdd)   1.100 nm
          multi-body bonded interactions  (-rdd)   1.100 nm
  atoms separated by up to 7 constraints  (-rcon)  1.415 nm

When dynamic load balancing gets turned on, these settings will change to:
The maximum number of communication pulses is: X 1 Y 1 Z 1
The minimum size for domain decomposition cells is 1.100 nm
The requested allowed shrink of DD cells (option -dds) is: 0.80
The allowed shrink of domain decomposition cells is: X 0.58 Y 0.78 Z 0.67
The maximum allowed distance for charge groups involved in interactions is:
                 non-bonded interactions           1.100 nm
            two-body bonded interactions  (-rdd)   1.100 nm
          multi-body bonded interactions  (-rdd)   1.100 nm
  atoms separated by up to 7 constraints  (-rcon)  1.100 nm

Making 3D domain decomposition grid 3 x 4 x 3, home cell index 0 0 0

WARNING: Changing nstcomm from 5 to 10

Center of mass motion removal mode is Linear
We have the following groups for center of mass motion removal:
  0:  rest

++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++
G. Bussi, D. Donadio and M. Parrinello
Canonical sampling through velocity rescaling
J. Chem. Phys. 126 (2007) pp. 014101
-------- -------- --- Thank You --- -------- --------

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Tue Apr 21, 2020 12:19 pm
by Ahnilated
Picked up PRCG 14576 (0,4259,74), This won't run correctly on an AMD 2950X with the CPU's set to -1.

12:18:48:WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 4259, Gen 74)
12:18:48:WU01:FS00:0xa7:Unit: 0x0000005c287234c95e7b867f290d69d3
12:18:48:WU01:FS00:0xa7:Reading tar file core.xml
12:18:48:WU01:FS00:0xa7:Reading tar file frame74.tpr
12:18:48:WU01:FS00:0xa7:Digital signatures verified
12:18:48:WU01:FS00:0xa7:Reducing thread count from 29 to 28 to avoid domain decomposition by a prime number > 3
12:18:48:WU01:FS00:0xa7:Calling: mdrun -s frame74.tpr -o frame74.trr -x frame74.xtc -cpt 5 -nt 28
12:18:48:WU01:FS00:0xa7:Steps: first=37000000 total=500000
12:18:48:WU01:FS00:0xa7:ERROR:
12:18:48:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
12:18:48:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
12:18:48:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
12:18:48:WU01:FS00:0xa7:ERROR:
12:18:48:WU01:FS00:0xa7:ERROR:Fatal error:
12:18:48:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
12:18:48:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
12:18:48:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
12:18:48:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
12:18:48:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
12:18:48:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
12:18:52:WU01:FS00:0xa7:WARNING:Unexpected exit() call
12:18:52:WU01:FS00:0xa7:WARNING:Unexpected exit from science code
12:18:52:WU01:FS00:0xa7:Saving result file ../logfile_01.txt
12:18:52:WU01:FS00:0xa7:Saving result file md.log
12:18:52:WU01:FS00:0xa7:Saving result file science.log

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Wed Apr 22, 2020 12:10 am
by _r2w_ben
Ahnilated wrote:Picked up PRCG 14576 (0,4259,74), This won't run correctly on an AMD 2950X with the CPU's set to -1.
This particular project will run on the following core counts: 1,2,3,4,6,8,9,12,16,18,20,21,27,32,40,42,44,45,64

If you still have the work unit, please change the slot configuration from -1 CPUs to 27 and it should finish the work unit. You can change it back to -1 when it's complete to again use all cores.

Re: Project: 14576 (Run 0, Clone 2096, Gen 48)

Posted: Wed May 13, 2020 10:00 pm
by Zzyzx
Ran into one of these today. Folding on 45 until it passed. Project: 14576 (Run 0, Clone 3581, Gen 171). I do wonder if PantherX could work with the project owner to get this excluded from problematic core counts like with Project 16417.