Domain Decomp - the new norm?
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 336
- Joined: Fri Jun 26, 2009 4:34 am
Domain Decomp - the new norm?
I have been folding with a 48 thread server for about 4 years. I do not remember seeing frequent domain decomposition problems until recently. Anyone know what has changed in the Folding World to cause this headache? I used to let my computer run for weeks to months without much intervention. Now, I have to check it constantly to make sure it is not stalled with a decomp problem.
-
- Site Admin
- Posts: 7922
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: Domain Decomp - the new norm?
I think it is a combination of factors. First, new projects are spending less time in internal and beta testing before being released to wider distribution. Included in that is limited capacity to test on core counts higher than 32.
Second, the researchers are working on different protein assemblages than they normally do. So they are also on a learning curve of what works and what causes problems.
There're probably other factors involved.
I do know that the researchers do take notice of any reports we get here and pass on to them, and use those reports to refine what CPU thread assignments are used on the active projects.
Second, the researchers are working on different protein assemblages than they normally do. So they are also on a learning curve of what works and what causes problems.
There're probably other factors involved.
I do know that the researchers do take notice of any reports we get here and pass on to them, and use those reports to refine what CPU thread assignments are used on the active projects.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
-
- Posts: 2040
- Joined: Sat Dec 01, 2012 3:43 pm
- Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441
Re: Domain Decomp - the new norm?
As workaround I would recommend to split your 48 threads into 3 CPU slots with 16 threads each.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: Domain Decomp - the new norm?
32 and a 16 would be much better
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: Domain Decomp - the new norm?
As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.
-
- Posts: 2040
- Joined: Sat Dec 01, 2012 3:43 pm
- Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441
Re: Domain Decomp - the new norm?
Yes FahClient should solve this issue itself. But as workaround just split your 54 cores machines into 3 CPU slots with 16 threads each.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: Domain Decomp - the new norm?
... and as previously mentioned a 32 and a 16 would be better ... but yes it would be nice if the cores could handle decomp issues a bit more elegantly
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: Domain Decomp - the new norm?
There's a new version of GROMACS in the works which reduces the scope of the Domain Decomposition errors but it's not predicted to prevent it entirely. I can't make a prediction when we might see that new version but for the time being, we can only fix one project at a time for a specific number of threads at a time when they're reported.
As foldy and neil-B have said, we've pretty much solved the problem for smaller numbers of threads.
As foldy and neil-B have said, we've pretty much solved the problem for smaller numbers of threads.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: Domain Decomp - the new norm?
Please post the following log portion when one of these is assigned. Then the researchers can confirm that there are appropriate constraints in place for that project. There can be variations between runs within a project so thread counts that were tested initially and worked may need to be adjusted.the901 wrote:As a ~10M point a day folder, I'm about ready to jump ship I'm so frustrated. I have 29 clients and a good % of those are running 54 cores. I've been seeing the issue with 58 cores as well. Today has been horrible for this issue. It's just been getting worse over time. I wish the client would adjust for the issue more then it does already. I don't have time to babysit it.
Code: Select all
09:38:18:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 128.252.203.1
09:38:18:WU01:FS00:Connecting to 128.252.203.1:8080
09:38:18:WU01:FS00:Downloading 5.77MiB
09:38:20:WU01:FS00:Download complete
09:38:20:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16446 run:189 clone:0 gen:33 core:0xa7 unit:0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:Starting
09:38:20:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\[]\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 01 -suffix 01 -version 705 -lifeline 1168 -checkpoint 15 -np 4
09:38:20:WU01:FS00:Started FahCore on PID 8788
09:38:20:WU01:FS00:Core PID:6892
09:38:20:WU01:FS00:FahCore 0xa7 started
09:38:20:WU01:FS00:0xa7:*********************** Log Started 2020-06-17T09:38:20Z ***********************
09:38:20:WU01:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
09:38:20:WU01:FS00:0xa7: Type: 0xa7
09:38:20:WU01:FS00:0xa7: Core: Gromacs
09:38:20:WU01:FS00:0xa7: Args: -dir 01 -suffix 01 -version 705 -lifeline 8788 -checkpoint 15 -np 4
09:38:20:WU01:FS00:0xa7:************************************ CBang *************************************
09:38:20:WU01:FS00:0xa7: Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7: Time: 01:38:25
09:38:20:WU01:FS00:0xa7: Revision: c46a1a011a24143739ac7218c5a435f66777f62f
09:38:20:WU01:FS00:0xa7: Branch: master
09:38:20:WU01:FS00:0xa7: Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7: Platform: win32 10
09:38:20:WU01:FS00:0xa7: Bits: 64
09:38:20:WU01:FS00:0xa7: Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ System ************************************
09:38:20:WU01:FS00:0xa7: CPU: AMD A8-7600 Radeon R7, 10 Compute Cores 4C+6G
09:38:20:WU01:FS00:0xa7: CPU ID: AuthenticAMD Family 21 Model 48 Stepping 1
09:38:20:WU01:FS00:0xa7: CPUs: 4
09:38:20:WU01:FS00:0xa7: Memory: 6.94GiB
09:38:20:WU01:FS00:0xa7:Free Memory: 4.94GiB
09:38:20:WU01:FS00:0xa7: Threads: WINDOWS_THREADS
09:38:20:WU01:FS00:0xa7: OS Version: 6.2
09:38:20:WU01:FS00:0xa7:Has Battery: false
09:38:20:WU01:FS00:0xa7: On Battery: false
09:38:20:WU01:FS00:0xa7: UTC Offset: -4
09:38:20:WU01:FS00:0xa7: PID: 6892
09:38:20:WU01:FS00:0xa7: CWD: C:\Users\[]\AppData\Roaming\FAHClient\work
09:38:20:WU01:FS00:0xa7:******************************** Build - libFAH ********************************
09:38:20:WU01:FS00:0xa7: Version: 0.0.18
09:38:20:WU01:FS00:0xa7: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
09:38:20:WU01:FS00:0xa7: Copyright: 2019 foldingathome.org
09:38:20:WU01:FS00:0xa7: Homepage: https://foldingathome.org/
09:38:20:WU01:FS00:0xa7: Date: Oct 26 2019
09:38:20:WU01:FS00:0xa7: Time: 01:52:30
09:38:20:WU01:FS00:0xa7: Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
09:38:20:WU01:FS00:0xa7: Branch: master
09:38:20:WU01:FS00:0xa7: Compiler: Visual C++ 2008
09:38:20:WU01:FS00:0xa7: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
09:38:20:WU01:FS00:0xa7: Platform: win32 10
09:38:20:WU01:FS00:0xa7: Bits: 64
09:38:20:WU01:FS00:0xa7: Mode: Release
09:38:20:WU01:FS00:0xa7:************************************ Build *************************************
09:38:20:WU01:FS00:0xa7: SIMD: avx_256
09:38:20:WU01:FS00:0xa7:********************************************************************************
09:38:20:WU01:FS00:0xa7:Project: 16446 (Run 189, Clone 0, Gen 33)
09:38:20:WU01:FS00:0xa7:Unit: 0x0000002e80fccb015eb9f92b47f30ae5
09:38:20:WU01:FS00:0xa7:Reading tar file core.xml
09:38:20:WU01:FS00:0xa7:Reading tar file frame33.tpr
09:38:20:WU01:FS00:0xa7:Digital signatures verified
09:38:20:WU01:FS00:0xa7:Calling: mdrun -s frame33.tpr -o frame33.trr -x frame33.xtc -cpt 15 -nt 4
09:38:20:WU01:FS00:0xa7:Steps: first=8250000 total=250000
09:38:22:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
Re: Domain Decomp - the new norm?
I just shifted everything to 32 and 16 core slots. We'll see how it goes.
Re: Domain Decomp - the new norm?
!2 and 18 are also expected to be acceptable values if you really want to use all of your threads.
See the numbers provided here and notice that some number after 18 are modified by the diversion of threads to PME.
See the numbers provided here and notice that some number after 18 are modified by the diversion of threads to PME.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.