I’m not sure if I posted this in the right forum but let’s try it
I’m running F@H as Container in my Kubernetes cluster since there is some spare capacity, especially at night. I used CentOS 8 as a base and installed the provided rpm into it. I wrote a Helm chart and deployed the client onto my cluster. So far so good….
I started with only 12 CPU cores per Pod (that’s like a VM in the Kubernetes language) and some Gigs of Memory. That seemed to work fine so I increased the size of the Pod to over 30 cores and also scaled the number of Pods by a lot. I quickly realized that my disks went critical and almost to 100%. After a little bit of debugging I saw that there were a lot of core dumps located in the work folder. I checked the logs and saw this error:
Code: Select all
17:51:39:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
17:51:39:WU01:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
17:51:39:WU01:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
17:51:39:WU01:FS00:0xa7:ERROR:
17:51:39:WU01:FS00:0xa7:ERROR:Fatal error:
17:51:39:WU01:FS00:0xa7:ERROR:There is no domain decomposition for 70 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
17:51:39:WU01:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
17:51:39:WU01:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
17:51:39:WU01:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
17:51:39:WU01:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
17:51:39:WU01:FS00:0xa7:ERROR:-------------------------------------------------------
Code: Select all
This means you tried to run a parallel calculation, and when mdrun tried to partition your simulation cell into chunks for each processor, it couldn't. The minimum cell size is controlled by the size of the largest charge group or bonded interaction and the largest of rvdw, rlist and rcoulomb, some other effects of bond constraints, and a safety margin. Thus it is not possible to run a small simulation with large numbers of processors. So, if grompp warned you about a large charge group, pay attention and reconsider its size. mdrun prints a breakdown of how it computed this minimum size in the .log file, so you can perhaps find a cause there.
If you didn't think you were running a parallel calculation, be aware that from 4.5, GROMACS uses thread-based parallelism by default. To prevent this, you can either give mdrun the "-nt 1" command line option, or build GROMACS so that it will not use threads. Otherwise, you might be using an MPI-enabled GROMACS and not be aware of the fact.
Any idea how to solve this?