Page 1 of 1

Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 1:06 am
by HendricksSA
Just letting you know that Project: 17201 (Run 0, Clone 345, Gen 0) is throwing domain decomposition errors with 48 threads. Lowered it to 45 and it is running fine. MD.LOG contents follow:

Code: Select all

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.433 nm, LJ-14, atoms 307 316
  multi-body bonded interactions: 0.433 nm, Proper Dih., atoms 307 316
Minimum cell size due to bonded interactions: 0.477 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.134 nm
Estimated maximum distance required for P-LINCS: 1.134 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.18
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.417 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4

Re: Project: 17201 (Run 0, Clone 345, Gen 0) Domain Decomp E

Posted: Wed Jul 08, 2020 8:36 am
by Neil-B
All Project 17201 WUs throw decomp errors on 24 thread slots so possibly not RCG specific - you may find that decomp errors on all Project 17201 WUs :(

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 8:46 pm
by HendricksSA
Project: 17201 (Run 0, Clone 492, Gen 4) is also throwing domain decomp errors with 48 threads. It runs fine with 45 threads. MD.LOG extract follows:

Code: Select all

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.428 nm, LJ-14, atoms 2164 2172
  multi-body bonded interactions: 0.428 nm, Proper Dih., atoms 2164 2172
Minimum cell size due to bonded interactions: 0.470 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.134 nm
Estimated maximum distance required for P-LINCS: 1.134 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.417 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 8:58 pm
by HendricksSA
Project: 17201 (Run 0, Clone 15, Gen 83) also fails with 48 threads. It runs fine with 45 threads. This project was identified during beta testing by Neil-B as having problems with domain decomposition. I have not figured out the math to determine which thread counts should be avoided. Perhaps _r2w_ben could identify them so that information could be passed to the folks who run the work servers. Hopefully this can be done soon as I have lost about 16 hours of folding time in two days due to my hanging up on these work units. In case it is useful here, the MD.LOG contents follow:

Code: Select all

Initializing Domain Decomposition on 48 ranks
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.426 nm, LJ-14, atoms 687 696
  multi-body bonded interactions: 0.426 nm, Proper Dih., atoms 687 696
Minimum cell size due to bonded interactions: 0.468 nm
Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.134 nm
Estimated maximum distance required for P-LINCS: 1.134 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.17
Will use 40 particle-particle and 8 PME only ranks
This is a guess, check the performance at the end of the log file
Using 8 separate PME ranks, as guessed by mdrun
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 40 cells with a minimum initial size of 1.417 nm
The maximum allowed number of cells is: X 4 Y 4 Z 4

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 9:04 pm
by bruce
Domain Decomposition errors are a pernicious problem for Gromacs (used in FAHCore_a7) particularly for large numbers of CPU threads. Thanks for the report. We do keep fixing the problems when they're reported but the only real fix is to prevent the assignment of the individual project to a client that chooses to invoke a specific number of threads.

I understand there's a newer version of GROMACS in the works which reduces the scope of the problem but I have no prediction for when we will see that new FAHCore.

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 9:05 pm
by HendricksSA
I'm assuming Project 17202 is similar to Project 17201. Project: 17202 (Run 0, Clone 105, Gen 23) ran perfectly with 48 threads and I did not notice any domain decomp errors. Thanks for the update Bruce. Any improvement will be appreciated ... I hate doing nothing when I could be folding.

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 9:19 pm
by bruce
Notice that your clip from the science load says: Guess for relative PME load: 0.17

We're pretty good at excluding thread counts for mall numbers, but at some point, GROMACS starts making guesses about PME loads which changes the entire spectrum of bad thread counts above about 16 or 18.

Re: Project: 17201 Domain Decomposition Errors

Posted: Wed Jul 08, 2020 9:42 pm
by _r2w_ben
p17201 - max 4x4x4 - PME load 0.17

Code: Select all

  2 = 2x1x1
  3 = 3x1x1
  4 = 4x1x1
  6 = 3x2x1
  8 = 4x2x1
  9 = 3x3x1
 12 = 4x3x1
 16 = 4x4x1
 18 = 3x3x2
 20 = 4x4x1  16 +  4 PME
 21 = 4x4x1  16 +  5 PME
 27 = 2x3x3  18 +  9 PME
 32 = 3x4x2  24 +  8 PME
 40 = 4x4x2  32 +  8 PME
 42 = 4x4x2  32 + 10 PME
 44 = 3x4x3  36 +  8 PME
 45 = 4x3x3  36 +  9 PME
 64 = 4x4x3  48 + 16 PME
 78 = 4x4x4  64 + 14 PME
 80 = 4x4x4  64 + 16 PME
p17202 - max 4x4x3 - PME load 0.18

Code: Select all

  2 = 2x1x1
  3 = 3x1x1
  4 = 4x1x1
  6 = 3x2x1
  8 = 4x2x1
  9 = 3x3x1
 12 = 4x3x1
 16 = 4x4x1
 18 = 3x3x2
 20 = 4x4x1  16 +  4 PME
 21 = 4x4x1  16 +  5 PME
 24 = 3x2x3  18 +  6 PME
 27 = 2x3x3  18 +  9 PME
 30 = 4x3x2  24 +  6 PME
 32 = 3x4x2  24 +  8 PME
 36 = 3x3x3  27 +  9 PME
 40 = 4x4x2  32 +  8 PME
 42 = 4x4x2  32 + 10 PME
 44 = 3x4x3  36 +  8 PME
 45 = 4x3x3  36 +  9 PME
 48 = 4x3x3  36 + 12 PME
 54 = 4x3x3  36 + 18 PME
 60 = 4x4x3  48 + 12 PME
 64 = 4x4x3  48 + 16 PME
17202 is a bit smaller physically and can't use 80 threads but is more flexible in the 24-60 range.

Re: Project: 17201 Domain Decomposition Errors

Posted: Thu Jul 09, 2020 3:43 am
by as666
Seeing issues here as well. The client actually seems to be segfaulting.

From /var/lib/log.txt:

Code: Select all

3:34:59:WU01:FS00:0xa7:Project: 17201 (Run 0, Clone 1436, Gen 1)
03:34:59:WU01:FS00:0xa7:Unit: 0x00000002031532b95efd3db64fceb316
03:34:59:WU01:FS00:0xa7:Reading tar file core.xml
03:34:59:WU01:FS00:0xa7:Reading tar file frame1.tpr
03:34:59:WU01:FS00:0xa7:Digital signatures verified
03:34:59:WU01:FS00:0xa7:Calling: mdrun -s frame1.tpr -o frame1.trr -x frame1.xtc -cpt 15 -nt 24
03:34:59:WU01:FS00:0xa7:Steps: first=250000 total=250000
03:34:59:WU01:FS00:0xa7:ERROR:
03:34:59:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
03:34:59:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
03:34:59:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
03:34:59:WU01:FS00:0xa7:ERROR:
03:34:59:WU01:FS00:0xa7:ERROR:Fatal error:
03:34:59:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.41717 nm
From dmesg:

Code: Select all

[  621.137379] FahCore_a7[116995]: segfault at 50 ip 000000000120aa3d sp 00007ffceaca99c0 error 4 in FahCore_a7[406000+10cc000]
[  621.137384] Code: 73 08 0f 84 83 00 00 00 48 c7 44 24 30 00 00 00 00 4c 8d 74 24 20 4c 89 74 24 08 f3 0f 7e 64 24 08 66 0f 6c e4 0f 29 64 24 20 <48> 8b 16 4c 39 f2 0f 84 07 01 00 00 4c 39 f6 0f 84 fe 00 00 00 4c
Running on AMD Ryzen 9 3900X 12-Core Processor, Ubuntu 20.04

Re: Project: 17201 Domain Decomposition Errors

Posted: Thu Jul 09, 2020 4:09 pm
by HendricksSA
as666, welcome to the fold. Wow, you just joined and ran into the domain decomposition problem. I usually follow the advice of the experts here to get past the offending work unit. Following _r2w_ben's advice, you should be able to lower your CPU count temporarily to 21, process the stalled work unit, and then raise the CPU count back to your normal (or the default -1). Then you can process on until you encounter this again. I usually see a cluster of problem work units, then staffers adjust the work servers and the situation clears up. I can process for weeks or months without running into this problem so don't give up. We appreciate your contribution!

Re: Project: 17201 Domain Decomposition Errors

Posted: Thu Jul 09, 2020 7:13 pm
by the901
The issue has been seen weekly for the past couple of months. I could only dream of going multiple weeks without seeing the problem. The issue needs to be prioritized.

Re: Project: 17201 Domain Decomposition Errors

Posted: Thu Jul 09, 2020 7:23 pm
by Neil-B
Part of the decomposition issues is a simple process issue ... with new style projects and attempts to maximise science the potential for these errors exists ... and with the rush to release new projects the chance to spot issues either in internal testing or beta/advanced testing is reduced - when they are reported the researchers have a chance to block issue to the affected thread counts ... 24 slot used to be fine and rarely had issues but recently a number of projects using pme slightly differently have not like 24 or as with this project 48 ... I was only able to report 24 issue after had been released to advanced so a bit late for the researcher to block before issued to lots of folders and not sure how easy it is to block thread counts once released but hopefully efforts are being made to block this to 24 and 48 thread slots asap

Re: Project: 17201 Domain Decomposition Errors

Posted: Thu Jul 09, 2020 9:07 pm
by as666
Thanks Hendriks, I joined the forums yesterday but I've been folding for a couple of days (CPU wise) now and this is the first issue I've ran into. I followed some advice that I read on another post and deleted the work unit from /var/lib/fahclient/work. That unblocked me, hopefully its not too detrimental to the work. I did find the FAH service to be particularly finicky to start/restart on Ubuntu.

Re: Project: 17201 Domain Decomposition Errors

Posted: Sat Jul 11, 2020 2:41 am
by HendricksSA
as666, that method of solving the domain decomposition problem will work ... but is not so great for a couple of reasons. One, it slows down the science because the work unit will wait until the timeout is reached before being reassigned. Depending on the work unit, that could be from one day to perhaps as long as a month. Second reason to avoid that solution when you are starting is the number of successful returns affects when you receive the quick return bonus (QRB). The method for changing your CPU count is pretty easy if your Linux install has the advanced control. I am not sure if you can do it using the web-based control. If you are using the command-line interface, it is a bit more complicated. You will find all those solutions here or you can just ask. We will be happy to assist.

Re: Project: 17201 Domain Decomposition Errors

Posted: Sat Jul 11, 2020 6:29 am
by bruce
There's a new version of FAHCore_a* being beta tested which may be released soon. It won't eliminate the DD errors but it's supposed to reduce their frequency. In fact, some core counts that were previously excluded will now work. I expect that'll take some settling time before it's a smooth process, though.