Page 1 of 1

Domain Decomp - the new norm?

Posted: Mon Apr 20, 2020 4:45 pm
by HendricksSA
I have been folding with a 48 thread server for about 4 years. I do not remember seeing frequent domain decomposition problems until recently. Anyone know what has changed in the Folding World to cause this headache? I used to let my computer run for weeks to months without much intervention. Now, I have to check it constantly to make sure it is not stalled with a decomp problem.

Re: Domain Decomp - the new norm?

Posted: Mon Apr 20, 2020 7:25 pm
by Joe_H
I think it is a combination of factors. First, new projects are spending less time in internal and beta testing before being released to wider distribution. Included in that is limited capacity to test on core counts higher than 32.

Second, the researchers are working on different protein assemblages than they normally do. So they are also on a learning curve of what works and what causes problems.

There're probably other factors involved.

I do know that the researchers do take notice of any reports we get here and pass on to them, and use those reports to refine what CPU thread assignments are used on the active projects.

Re: Domain Decomp - the new norm?

Posted: Thu Apr 23, 2020 6:32 am
by foldy
As workaround I would recommend to split your 48 threads into 3 CPU slots with 16 threads each.

Re: Domain Decomp - the new norm?

Posted: Thu Apr 23, 2020 7:16 am
by Neil-B
32 and a 16 would be much better

Re: Domain Decomp - the new norm?

Posted: Wed Jul 08, 2020 1:47 pm
by the901
As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.

Re: Domain Decomp - the new norm?

Posted: Wed Jul 08, 2020 5:50 pm
by foldy
Yes FahClient should solve this issue itself. But as workaround just split your 54 cores machines into 3 CPU slots with 16 threads each.

Re: Domain Decomp - the new norm?

Posted: Wed Jul 08, 2020 6:10 pm
by Neil-B
... and as previously mentioned a 32 and a 16 would be better :) ... but yes it would be nice if the cores could handle decomp issues a bit more elegantly :)

Re: Domain Decomp - the new norm?

Posted: Wed Jul 08, 2020 9:11 pm
by bruce
There's a new version of GROMACS in the works which reduces the scope of the Domain Decomposition errors but it's not predicted to prevent it entirely. I can't make a prediction when we might see that new version but for the time being, we can only fix one project at a time for a specific number of threads at a time when they're reported.

As foldy and neil-B have said, we've pretty much solved the problem for smaller numbers of threads.

Re: Domain Decomp - the new norm?

Posted: Wed Jul 08, 2020 9:30 pm
by _r2w_ben
the901 wrote:As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.
Please post the following log portion when one of these is assigned. Then the researchers can confirm that there are appropriate constraints in place for that project. There can be variations between runs within a project so thread counts that were tested initially and worked may need to be adjusted.

Code: Select all

09:38:18:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 128.252.203.1
09:38:18:WU01:FS00:Connecting to 128.252.203.1:8080
09:38:18:WU01:FS00:Downloading 5.77MiB
09:38:20:WU01:FS00:Download complete
09:38:20:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16446 run:189 clone:0 gen:33 core:0xa7 unit:0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:Starting
09:38:20:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\[]\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 705 -lifeline 1168 -checkpoint 15 -np 4
09:38:20:WU01:FS00:Started FahCore on PID 8788
09:38:20:WU01:FS00:Core PID:6892
09:38:20:WU01:FS00:FahCore 0xa7 started
09:38:20:WU01:FS00:0xa7:*********************** Log Started 2020-06-17T09:38:20Z ***********************
09:38:20:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
09:38:20:WU01:FS00:0xa7:       Type: 0xa7
09:38:20:WU01:FS00:0xa7:       Core: Gromacs
09:38:20:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8788 -checkpoint 15 -np 4
09:38:20:WU01:FS00:0xa7:************************************ CBang *************************************
09:38:20:WU01:FS00:0xa7:       Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7:       Time: 01:38:25
09:38:20:WU01:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
09:38:20:WU01:FS00:0xa7:     Branch: master
09:38:20:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7:   Platform: win32 10
09:38:20:WU01:FS00:0xa7:       Bits: 64
09:38:20:WU01:FS00:0xa7:       Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ System ************************************
09:38:20:WU01:FS00:0xa7:        CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
09:38:20:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 21 Model 48 Stepping 1
09:38:20:WU01:FS00:0xa7:       CPUs: 4
09:38:20:WU01:FS00:0xa7:     Memory: 6.94GiB
09:38:20:WU01:FS00:0xa7:Free Memory: 4.94GiB
09:38:20:WU01:FS00:0xa7:    Threads: WINDOWS_THREADS
09:38:20:WU01:FS00:0xa7: OS Version: 6.2
09:38:20:WU01:FS00:0xa7:Has Battery: false
09:38:20:WU01:FS00:0xa7: On Battery: false
09:38:20:WU01:FS00:0xa7: UTC Offset: -4
09:38:20:WU01:FS00:0xa7:        PID: 6892
09:38:20:WU01:FS00:0xa7:        CWD: C:\Users\[]\AppData\Roaming\FAHClient\work
09:38:20:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
09:38:20:WU01:FS00:0xa7:    Version: 0.0.18
09:38:20:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
09:38:20:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
09:38:20:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
09:38:20:WU01:FS00:0xa7:       Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7:       Time: 01:52:30
09:38:20:WU01:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
09:38:20:WU01:FS00:0xa7:     Branch: master
09:38:20:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7:   Platform: win32 10
09:38:20:WU01:FS00:0xa7:       Bits: 64
09:38:20:WU01:FS00:0xa7:       Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ Build *************************************
09:38:20:WU01:FS00:0xa7:       SIMD: avx_256
09:38:20:WU01:FS00:0xa7:********************************************************************************
09:38:20:WU01:FS00:0xa7:Project: 16446 (Run 189, Clone 0, Gen 33)
09:38:20:WU01:FS00:0xa7:Unit: 0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:0xa7:Reading tar file core.xml
09:38:20:WU01:FS00:0xa7:Reading tar file frame33.tpr
09:38:20:WU01:FS00:0xa7:Digital signatures verified
09:38:20:WU01:FS00:0xa7:Calling: mdrun -s frame33.tpr -o frame33.trr -x frame33.xtc -cpt 15 -nt 4
09:38:20:WU01:FS00:0xa7:Steps: first=8250000 total=250000
09:38:22:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)

Re: Domain Decomp - the new norm?

Posted: Thu Jul 09, 2020 1:08 am
by the901
I just shifted everything to 32 and 16 core slots. We'll see how it goes.

Re: Domain Decomp - the new norm?

Posted: Thu Jul 09, 2020 1:43 am
by bruce
!2 and 18 are also expected to be acceptable values if you really want to use all of your threads.

See the numbers provided here and notice that some number after 18 are modified by the diversion of threads to PME.