F@H Client, Kubernetes and GROMACs errors

Muffin87 · Post by **Muffin87** » Wed Mar 25, 2020 6:16 pm

Hi everyone,

I’m not sure if I posted this in the right forum but let’s try it

I’m running F@H as Container in my Kubernetes cluster since there is some spare capacity, especially at night. I used CentOS 8 as a base and installed the provided rpm into it. I wrote a Helm chart and deployed the client onto my cluster. So far so good….

I started with only 12 CPU cores per Pod (that’s like a VM in the Kubernetes language) and some Gigs of Memory. That seemed to work fine so I increased the size of the Pod to over 30 cores and also scaled the number of Pods by a lot. I quickly realized that my disks went critical and almost to 100%. After a little bit of debugging I saw that there were a lot of core dumps located in the work folder. I checked the logs and saw this error:

Code: Select all

17:51:39:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:51:39:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
17:51:39:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
17:51:39:WU01:FS00:0xa7:ERROR:
17:51:39:WU01:FS00:0xa7:ERROR:Fatal error:
17:51:39:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 70 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
17:51:39:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
17:51:39:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
17:51:39:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:51:39:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:51:39:WU01:FS00:0xa7:ERROR:-------------------------------------------------------

After a bit of research, I found this:

Code: Select all

This means you tried to run a parallel calculation, and when mdrun tried to partition your simulation cell into chunks for each processor, it couldn't. The minimum cell size is controlled by the size of the largest charge group or bonded interaction and the largest of rvdw, rlist and rcoulomb, some other effects of bond constraints, and a safety margin. Thus it is not possible to run a small simulation with large numbers of processors. So, if grompp warned you about a large charge group, pay attention and reconsider its size. mdrun prints a breakdown of how it computed this minimum size in the .log file, so you can perhaps find a cause there.
If you didn't think you were running a parallel calculation, be aware that from 4.5, GROMACS uses thread-based parallelism by default. To prevent this, you can either give mdrun the "-nt 1" command line option, or build GROMACS so that it will not use threads. Otherwise, you might be using an MPI-enabled GROMACS and not be aware of the fact.

My guess is that the Folding client is not aware that it runs in a container. When the program starts I see that it scans the hardware of the system and it detected the hole resources.
Any idea how to solve this?

JimboPalmer · Post by **JimboPalmer** » Wed Mar 25, 2020 8:50 pm

OK several items.
GROMACS 'hates' large prime and multiples of large primes. 7 is always large, rarely 5 is large, 3 never is large.
So CPU slots that are multiples of 2 and/or 3 are safest and many Work units can work with multiples of 5s.

70 is 7 by 5 by 2, and that 7 dooms it. 64 would be a MUCH safer number of slots.

Additionally the researcher has to pick the slots he/she thinks will work, but frequently they forget there are Folders with 256 threads. Work drops off after 32, because they don't allow more than 32. So two slots of 32 may find more work than 1 slot of 64. (They really should do better. Epyc is out there)

Muffin87 · Post by **Muffin87** » Thu Mar 26, 2020 3:19 pm

Hi JimboPaimer,

okay now I get it.
Thanks for the explanation that helped a lot.

I reduced the number of cores per Pod to 32 and also set the --cpus parameter to 32.
Client also reports that:

Code: Select all

06:33:35:Enabled folding slot 00: READY cpu:32

No core dumps so far

!

Cheers!

JimboPalmer · Post by **JimboPalmer** » Thu Mar 26, 2020 4:12 pm

Great! I love feedback that lets me know I have some clue how all this works. (I love feedback that lets me know more about how it works, too)
The good news is that science happens!

anandhanju · Post by **anandhanju** » Thu Mar 26, 2020 4:15 pm

Stuff like this warms the cockles of my heart. Fantastic work Muffin87 and JimboPalmer.

Post by **Joe_H** » Fri Mar 27, 2020 6:00 am

Under Linux or OS X the folding Core_A7 can use more than 32 threads. The number 32 comes from Windows where there are hoops to jump through to have a process use more than 32 threads. On the other hand up until this push to work on COVID-19, most of the Protein folding systems that were large enough to fold on that many CPU threads were being configured to use GPU folding.

With what is going out now, do not know if there will be CPU projects that will scale to 64 or more threads, but there are CPU projects being b=prepared for release that will use more than 16. How much more, I don't know yet.

Folding Forum

F@H Client, Kubernetes and GROMACs errors

F@H Client, Kubernetes and GROMACs errors

Re: F@H Client, Kubernetes and GROMACs errors

Re: F@H Client, Kubernetes and GROMACs errors

Re: F@H Client, Kubernetes and GROMACs errors

Re: F@H Client, Kubernetes and GROMACs errors

Re: F@H Client, Kubernetes and GROMACs errors