Project 14584 - Fatal Error on High CPU Count Machine

Moderators: Site Moderators, FAHC Science Team

Post Reply
bpitts2
Posts: 2
Joined: Sat Mar 28, 2020 2:09 pm

Project 14584 - Fatal Error on High CPU Count Machine

Post by bpitts2 »

Hello,

My apologies if this isn't the correct place to post. I ran into an issue overnight with one of my servers running dual Intel Gold 5220's. Each time the machine went to run the work, it encountered this fatal error: There is no domain decomposition for 54 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm

I have been able to work around the issue by reducing the original CPU work slot to 36 cores, and then creating a second slot of 36 cores. I am hoping someone could provide best practice for these types of compute environments. Is it better to have one slot with all 72 cores, two with 36, or 4 with 18? Should I turn off hyper-threading and only allow the system to run on physical CPU cores?

We have 5 more of these machines sitting idle that I'd like to put to work towards F@H. Your help is appreciated!

Logs below - Thanks!

Code: Select all

05:59:12:WU00:FS00:0xa7:*********************** Log Started 2020-03-28T05:59:12Z ***********************
05:59:12:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
05:59:12:WU00:FS00:0xa7:       Type: 0xa7
05:59:12:WU00:FS00:0xa7:       Core: Gromacs
05:59:12:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 704 -lifeline 1427 -checkpoint 15 -np
05:59:12:WU00:FS00:0xa7:             72
05:59:12:WU00:FS00:0xa7:************************************ CBang *************************************
05:59:12:WU00:FS00:0xa7:       Date: Nov 5 2019
05:59:12:WU00:FS00:0xa7:       Time: 06:06:57
05:59:12:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
05:59:12:WU00:FS00:0xa7:     Branch: master
05:59:12:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
05:59:12:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
05:59:12:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
05:59:12:WU00:FS00:0xa7:       Bits: 64
05:59:12:WU00:FS00:0xa7:       Mode: Release
05:59:12:WU00:FS00:0xa7:************************************ System ************************************
05:59:12:WU00:FS00:0xa7:        CPU: Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz
05:59:12:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 85 Stepping 7
05:59:12:WU00:FS00:0xa7:       CPUs: 72
05:59:12:WU00:FS00:0xa7:     Memory: 15.65GiB
05:59:12:WU00:FS00:0xa7:Free Memory: 15.13GiB
05:59:12:WU00:FS00:0xa7:    Threads: POSIX_THREADS
05:59:12:WU00:FS00:0xa7: OS Version: 4.19
05:59:12:WU00:FS00:0xa7:Has Battery: false
05:59:12:WU00:FS00:0xa7: On Battery: false
05:59:12:WU00:FS00:0xa7: UTC Offset: 0
05:59:12:WU00:FS00:0xa7:        PID: 1431
05:59:12:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
05:59:12:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
05:59:12:WU00:FS00:0xa7:    Version: 0.0.18
05:59:12:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
05:59:12:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
05:59:12:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
05:59:12:WU00:FS00:0xa7:       Date: Nov 5 2019
05:59:12:WU00:FS00:0xa7:       Time: 06:13:26
05:59:12:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
05:59:12:WU00:FS00:0xa7:     Branch: master
05:59:12:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
05:59:12:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
05:59:12:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
05:59:12:WU00:FS00:0xa7:       Bits: 64
05:59:12:WU00:FS00:0xa7:       Mode: Release
05:59:12:WU00:FS00:0xa7:************************************ Build *************************************
05:59:12:WU00:FS00:0xa7:       SIMD: avx_256
05:59:12:WU00:FS00:0xa7:********************************************************************************
05:59:12:WU00:FS00:0xa7:Project: 14584 (Run 0, Clone 548, Gen 32)
05:59:12:WU00:FS00:0xa7:Unit: 0x000000210d5262775e7a6b6d7026244c
05:59:12:WU00:FS00:0xa7:Reading tar file core.xml
05:59:12:WU00:FS00:0xa7:Reading tar file frame32.tpr
05:59:12:WU00:FS00:0xa7:Digital signatures verified
05:59:12:WU00:FS00:0xa7:Calling: mdrun -s frame32.tpr -o frame32.trr -x frame32.xtc -cpt 15 -nt 72
05:59:12:WU00:FS00:0xa7:Steps: first=8000000 total=250000
05:59:12:WU00:FS00:0xa7:ERROR:
05:59:12:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
05:59:12:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
05:59:12:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
05:59:12:WU00:FS00:0xa7:ERROR:
05:59:12:WU00:FS00:0xa7:ERROR:Fatal error:
05:59:12:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 54 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
05:59:12:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
05:59:12:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
05:59:12:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
05:59:12:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
05:59:12:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
Joe_H
Site Admin
Posts: 8226
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Project 14584 - Fatal Error on High CPU Count Machine

Post by Joe_H »

Please see this topic - viewtopic.php?f=106&t=33511. I have given some information and suggestions there to another person with a high CPU count system.
Image
JimboPalmer
Posts: 2521
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Project 14584 - Fatal Error on High CPU Count Machine

Post by JimboPalmer »

I have step by step instructions for Windows, I am not a Linux user, sorry. I am going to post them in the hope you can interpolate from them.

In the taskbar to the lower right of the screen, you should see a F@H molecule icon, click it (you may need to click an Up Arrow to see it ^)

The second item in this menu is Advanced Control, click it

On this screen to the left is a Configure button, click it

Now you get a screen with a Slots tab, click it

On this white field should be a cpu item, click it and then click edit

By default F@H set the number of CPUs to -1 meaning let the software decide.

You can enter any number from 1 to the number of threads your CPU supports.

If you have GPUs, F@H reserves one CPU per GPU to feed it data across the PCIE bus.

F@H has difficulty with large primes and their multiples number of CPUs.
7 is always large, 5 is sometimes large, and 3 is never large. Try to choose a number that is a multiple of 2 and/or 3.
2, 3, 4, 6, 8, 9, 12, 16, 18, 24, 27, etc. are good numbers of CPUs to choose.
5. 10. 15, 20 etc may work most of the time. Other numbers will bite you

Type the number you want, and click save.

F@H may have issues with CPU counts over 32, if I had more than 32 CPUs, I would make multiple cpu slots. There is an add button on the Slots screen.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
bpitts2
Posts: 2
Joined: Sat Mar 28, 2020 2:09 pm

Re: Project 14584 - Fatal Error on High CPU Count Machine

Post by bpitts2 »

Thank you both! I think I understand what I need to do now!
Post Reply