Page 1 of 1

Project: 14576 (0,3055,4) domain decomposition

Posted: Tue Mar 31, 2020 11:28 pm
by tedder
This is a CPU WU, info with project and unit at the top.

Code: Select all

WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:0xa7:Unit: 0x00000009287234c95e7b86cfe9c5549b
WU01:FS00:0xa7:Reading tar file core.xml
WU01:FS00:0xa7:Reading tar file frame4.tpr 
WU01:FS00:0xa7:Digital signatures verified 
WU01:FS00:0xa7:Calling: mdrun -s frame4.tpr -o frame4.trr -x frame4.xtc -cpt 15 -nt 24
WU01:FS00:0xa7:Steps: first=2000000 total=500000
WU01:FS00:0xa7:ERROR:
WU01:FS00:0xa7:ERROR:-------------------------------------------------------
WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
WU01:FS00:0xa7:ERROR:
WU01:FS00:0xa7:ERROR:Fatal error:
WU01:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
Likely it's a high-core-count issue, I'm curious if there's something I should do (I see references to 'excluding the unit') or if there's something that needs to be done on FAH's side or if it's a no-op. I can also post the full log.

Re: Project: 14576 (0,3055,4) domain decomposition

Posted: Tue Mar 31, 2020 11:34 pm
by tedder
Hmm, I must need to do something to work past it.

Code: Select all

$ cat log | egrep "Project.*Run.*Clone|INTERRUPTED" | cut -c 21-
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
WU01:FS00:0xa7:Project: 14576 (Run 0, Clone 3055, Gen 4)
WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: Project: 14576 (0,3055,4) domain decomposition

Posted: Wed Apr 01, 2020 3:30 am
by Joe_H
Try pausing the folding process and setting the CPU thread count count to 18 or 16 and see if the WU goes ahead. There are some projects which have problems with decompositions to multiple of 5, it was trying 20 in the log. Sometimes they have WU's in prerelease testing that do work at that setting, then later this problem pops up.

I will report this and the assignment to 20 threads can be restricted.

Re: Project: 14576 (0,3055,4) domain decomposition

Posted: Wed Apr 01, 2020 3:36 am
by tedder
Joe_H wrote:Try pausing the folding process and setting the CPU thread count count to 18 or 16 and see if the WU goes ahead. There are some projects which have problems with decompositions to multiple of 5, it was trying 20 in the log. Sometimes they have WU's in prerelease testing that do work at that setting, then later this problem pops up.

I will report this and the assignment to 20 threads can be restricted.
Thanks much! After realizing it had been failing for days on end I killed the process. I'll shuffle it around if it happens again, I didn't know that was an option.

Re: Project: 14576 (0,3055,4) domain decomposition

Posted: Wed Apr 01, 2020 4:41 am
by tedder
I went back and looked in my logs- I attempted that WU 2700 times over the past three days. doh!

Re: Project: 14576 (0,3055,4) domain decomposition

Posted: Wed Apr 01, 2020 9:17 am
by bruce
Doh.

FAH will exclude future assignments to configurations with thread-counts that are multiples of 5.