Domain Decomp - the new norm?

Moderators: Site Moderators, FAHC Science Team

Post Reply
HendricksSA
Posts: 336
Joined: Fri Jun 26, 2009 4:34 am

Domain Decomp - the new norm?

Post by HendricksSA »

I have been folding with a 48 thread server for about 4 years. I do not remember seeing frequent domain decomposition problems until recently. Anyone know what has changed in the Folding World to cause this headache? I used to let my computer run for weeks to months without much intervention. Now, I have to check it constantly to make sure it is not stalled with a decomp problem.
Joe_H
Site Admin
Posts: 7936
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Domain Decomp - the new norm?

Post by Joe_H »

I think it is a combination of factors. First, new projects are spending less time in internal and beta testing before being released to wider distribution. Included in that is limited capacity to test on core counts higher than 32.

Second, the researchers are working on different protein assemblages than they normally do. So they are also on a learning curve of what works and what causes problems.

There're probably other factors involved.

I do know that the researchers do take notice of any reports we get here and pass on to them, and use those reports to refine what CPU thread assignments are used on the active projects.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Domain Decomp - the new norm?

Post by foldy »

As workaround I would recommend to split your 48 threads into 3 CPU slots with 16 threads each.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Domain Decomp - the new norm?

Post by Neil-B »

32 and a 16 would be much better
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
the901
Posts: 3
Joined: Mon Apr 27, 2020 8:55 pm

Re: Domain Decomp - the new norm?

Post by the901 »

As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Domain Decomp - the new norm?

Post by foldy »

Yes FahClient should solve this issue itself. But as workaround just split your 54 cores machines into 3 CPU slots with 16 threads each.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Domain Decomp - the new norm?

Post by Neil-B »

... and as previously mentioned a 32 and a 16 would be better :) ... but yes it would be nice if the cores could handle decomp issues a bit more elegantly :)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Domain Decomp - the new norm?

Post by bruce »

There's a new version of GROMACS in the works which reduces the scope of the Domain Decomposition errors but it's not predicted to prevent it entirely. I can't make a prediction when we might see that new version but for the time being, we can only fix one project at a time for a specific number of threads at a time when they're reported.

As foldy and neil-B have said, we've pretty much solved the problem for smaller numbers of threads.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: Domain Decomp - the new norm?

Post by _r2w_ben »

the901 wrote:As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.
Please post the following log portion when one of these is assigned. Then the researchers can confirm that there are appropriate constraints in place for that project. There can be variations between runs within a project so thread counts that were tested initially and worked may need to be adjusted.

Code: Select all

09:38:18:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 128.252.203.1
09:38:18:WU01:FS00:Connecting to 128.252.203.1:8080
09:38:18:WU01:FS00:Downloading 5.77MiB
09:38:20:WU01:FS00:Download complete
09:38:20:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16446 run:189 clone:0 gen:33 core:0xa7 unit:0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:Starting
09:38:20:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\[]\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 705 -lifeline 1168 -checkpoint 15 -np 4
09:38:20:WU01:FS00:Started FahCore on PID 8788
09:38:20:WU01:FS00:Core PID:6892
09:38:20:WU01:FS00:FahCore 0xa7 started
09:38:20:WU01:FS00:0xa7:*********************** Log Started 2020-06-17T09:38:20Z ***********************
09:38:20:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
09:38:20:WU01:FS00:0xa7:       Type: 0xa7
09:38:20:WU01:FS00:0xa7:       Core: Gromacs
09:38:20:WU01:FS00:0xa7:       Args: -dir 01 -suffix 01 -version 705 -lifeline 8788 -checkpoint 15 -np 4
09:38:20:WU01:FS00:0xa7:************************************ CBang *************************************
09:38:20:WU01:FS00:0xa7:       Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7:       Time: 01:38:25
09:38:20:WU01:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
09:38:20:WU01:FS00:0xa7:     Branch: master
09:38:20:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7:   Platform: win32 10
09:38:20:WU01:FS00:0xa7:       Bits: 64
09:38:20:WU01:FS00:0xa7:       Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ System ************************************
09:38:20:WU01:FS00:0xa7:        CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
09:38:20:WU01:FS00:0xa7:     CPU ID: AuthenticAMD Family 21 Model 48 Stepping 1
09:38:20:WU01:FS00:0xa7:       CPUs: 4
09:38:20:WU01:FS00:0xa7:     Memory: 6.94GiB
09:38:20:WU01:FS00:0xa7:Free Memory: 4.94GiB
09:38:20:WU01:FS00:0xa7:    Threads: WINDOWS_THREADS
09:38:20:WU01:FS00:0xa7: OS Version: 6.2
09:38:20:WU01:FS00:0xa7:Has Battery: false
09:38:20:WU01:FS00:0xa7: On Battery: false
09:38:20:WU01:FS00:0xa7: UTC Offset: -4
09:38:20:WU01:FS00:0xa7:        PID: 6892
09:38:20:WU01:FS00:0xa7:        CWD: C:\Users\[]\AppData\Roaming\FAHClient\work
09:38:20:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
09:38:20:WU01:FS00:0xa7:    Version: 0.0.18
09:38:20:WU01:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
09:38:20:WU01:FS00:0xa7:  Copyright: 2019 foldingathome.org
09:38:20:WU01:FS00:0xa7:   Homepage: https://foldingathome.org/
09:38:20:WU01:FS00:0xa7:       Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7:       Time: 01:52:30
09:38:20:WU01:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
09:38:20:WU01:FS00:0xa7:     Branch: master
09:38:20:WU01:FS00:0xa7:   Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7:   Platform: win32 10
09:38:20:WU01:FS00:0xa7:       Bits: 64
09:38:20:WU01:FS00:0xa7:       Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ Build *************************************
09:38:20:WU01:FS00:0xa7:       SIMD: avx_256
09:38:20:WU01:FS00:0xa7:********************************************************************************
09:38:20:WU01:FS00:0xa7:Project: 16446 (Run 189, Clone 0, Gen 33)
09:38:20:WU01:FS00:0xa7:Unit: 0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:0xa7:Reading tar file core.xml
09:38:20:WU01:FS00:0xa7:Reading tar file frame33.tpr
09:38:20:WU01:FS00:0xa7:Digital signatures verified
09:38:20:WU01:FS00:0xa7:Calling: mdrun -s frame33.tpr -o frame33.trr -x frame33.xtc -cpt 15 -nt 4
09:38:20:WU01:FS00:0xa7:Steps: first=8250000 total=250000
09:38:22:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
the901
Posts: 3
Joined: Mon Apr 27, 2020 8:55 pm

Re: Domain Decomp - the new norm?

Post by the901 »

I just shifted everything to 32 and 16 core slots. We'll see how it goes.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Domain Decomp - the new norm?

Post by bruce »

!2 and 18 are also expected to be acceptable values if you really want to use all of your threads.

See the numbers provided here and notice that some number after 18 are modified by the diversion of threads to PME.
Post Reply