There is no domain decomposition for 50 ranks that is compat

Moderators: Site Moderators, FAHC Science Team

Post Reply
craftit
Posts: 2
Joined: Thu Apr 23, 2020 8:30 am

There is no domain decomposition for 50 ranks that is compat

Post by craftit »

My CPU client keeps getting caught in infinite restart loop with the following error. The only way to get things working again is to delete the 'work' directory and restart. Is there a way of preventing this particular type of work unit running ? Others work fine.

Code: Select all

08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
08:33:22:WU03:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:33:22:WU03:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
08:33:22:WU03:FS00:0xa7:ERROR:
08:33:22:WU03:FS00:0xa7:ERROR:Fatal error:
08:33:22:WU03:FS00:0xa7:ERROR:There is no domain decomposition for 50 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
08:33:22:WU03:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
08:33:22:WU03:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
08:33:22:WU03:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:33:22:WU03:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:33:22:WU03:FS00:0xa7:ERROR:-------------------------------------------------------
My system:

Code: Select all

08:33:22:WU03:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:33:22:WU03:FS00:0xa7:       Type: 0xa7
08:33:22:WU03:FS00:0xa7:       Core: Gromacs
08:33:22:WU03:FS00:0xa7:       Args: -dir 03 -suffix 01 -version 705 -lifeline 16575 -checkpoint 15 -np
08:33:22:WU03:FS00:0xa7:             62
08:33:22:WU03:FS00:0xa7:************************************ CBang *************************************
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:06:57
08:33:22:WU03:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ System ************************************
08:33:22:WU03:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
08:33:22:WU03:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
08:33:22:WU03:FS00:0xa7:       CPUs: 64
08:33:22:WU03:FS00:0xa7:     Memory: 125.53GiB
08:33:22:WU03:FS00:0xa7:Free Memory: 115.55GiB
08:33:22:WU03:FS00:0xa7:    Threads: POSIX_THREADS
08:33:22:WU03:FS00:0xa7: OS Version: 4.15
08:33:22:WU03:FS00:0xa7:Has Battery: false
08:33:22:WU03:FS00:0xa7: On Battery: false
08:33:22:WU03:FS00:0xa7: UTC Offset: 1
08:33:22:WU03:FS00:0xa7:        PID: 16579
08:33:22:WU03:FS00:0xa7:        CWD: /home/charles/work
08:33:22:WU03:FS00:0xa7:******************************** Build - libFAH ********************************
08:33:22:WU03:FS00:0xa7:    Version: 0.0.18
08:33:22:WU03:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:33:22:WU03:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:33:22:WU03:FS00:0xa7:   Homepage: https://foldingathome.org/
08:33:22:WU03:FS00:0xa7:       Date: Nov 5 2019
08:33:22:WU03:FS00:0xa7:       Time: 06:13:26
08:33:22:WU03:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:33:22:WU03:FS00:0xa7:     Branch: master
08:33:22:WU03:FS00:0xa7:   Compiler: GNU 8.3.0
08:33:22:WU03:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:33:22:WU03:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:33:22:WU03:FS00:0xa7:       Bits: 64
08:33:22:WU03:FS00:0xa7:       Mode: Release
08:33:22:WU03:FS00:0xa7:************************************ Build *************************************
08:33:22:WU03:FS00:0xa7:       SIMD: avx_256
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: There is no domain decomposition for 50 ranks that is co

Post by Neil-B »

Actually if you reduce the number of cores on the slot you don't have to delete it, it will finish when you find an acceptable number … Looks like you are running a 62core slot so trying 60 looks like a good choice … See this thread for so research into what might/might not be good with core counts viewtopic.php?f=72&t=34350&p=328632&hil ... on#p328109.

If you regularly hit this issue then a permanent shift off 62core to 60core might help … or, w=even though I am an advocate for running the biggest slots possible you may find that two smaller slots might be necessary to keep you in WUs that don't have issues.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
anandhanju
Posts: 522
Joined: Mon Dec 03, 2007 4:33 am
Location: Australia

Re: There is no domain decomposition for 50 ranks that is co

Post by anandhanju »

Can you provide the entirety of the log that contains
a) the number of cores you've allocated to CPU folding (to confirm this is 50) and
b) the project number and Run, Clone, Gen identifiers for the work unit?

This will help the researchers block this project from getting allocated to clients that are using that many cores.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: There is no domain decomposition for 50 ranks that is co

Post by Neil-B »

Think you may find this is a 62core slot which has offloaded 12cores to PME and is having problems with the remaining 50 .. but let's see the full log
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
craftit
Posts: 2
Joined: Thu Apr 23, 2020 8:30 am

Re: There is no domain decomposition for 50 ranks that is co

Post by craftit »

Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.
_r2w_ben
Posts: 285
Joined: Wed Apr 23, 2008 3:11 pm

Re: There is no domain decomposition for 50 ranks that is co

Post by _r2w_ben »

craftit wrote:Yes, there are 62 cores allocated, though I see no obvious way of changing this ? I will try and find more information when it happens again.
On Linux, you'll need to edit /etc/fahclient/config.xml

Replace this part

Code: Select all

<slot id='0' type='CPU' />
with this

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='60'/>
</slot>
If you get another domain decomposition error in the future, change 60 to 45 and let the work unit finish. Then edit back to 60 to use all cores again on the following work unit.
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: There is no domain decomposition for 50 ranks that is co

Post by foldy »

Another option is to create 2 cpu slots with 30 threads each
Post Reply