18601 checkpoints too often

Alex_Atkin · Post by **Alex_Atkin** » Wed Nov 16, 2022 2:09 am

I'm noticing a waste of GPU resources on 18601 as it checkpoints every 25000 steps. On a 4090 that's under a minute, on a 3080 its about every 2 minutes. It seems to take a few seconds each time which surely adds up as a lot of wasted time over 24 hours.

Why have the option to set the checkpointing frequency if its ignored?

Conversely, I'm not seeing any checkpointing in the logs at all for aarch64 WUs although looking in the data folder they do seem to be written.

Post by **Joe_H** » Wed Nov 16, 2022 2:37 am

The checkpoints on GPU projects are set by the researcher. They happen when important data is collected and retained for later analysis after the WU is returned. It is also when a sanity check is done on the data returned to that point on the CPU to verify the GPU is properly calculating. That was found necessary as unstable GPUs may not give any indication that there are errors in the processing of the WU data.

The algorithms used in the CPU processing cores based on GROMACS are different and can be interrupted on a timed basis. In the latest versions they also will attempt to write out a checkpoint when folding is paused. The OpenMM code used in the GPU folding core needs to be interrupted at certain points to be able to write out a usable checkpoint.

Alex_Atkin · Post by **Alex_Atkin** » Wed Nov 16, 2022 6:28 am

Thanks, that's obviously more important than getting it done a little faster.

Post by **toTOW** » Wed Nov 16, 2022 7:21 pm

The checkpoints are usually set to not waste too much compute time when low end GPUs are interrupted ...

PaulTV · Post by **PaulTV** » Thu Nov 17, 2022 10:46 am

Would be nice if:
- Checkpoints could be written without interrupting calculations (dunno how hard that would be if at all possible), or
- There would be something like 'if last checkpoint was within x minutes, skip this one', with default of 5 or 15m, configurable with an advanced setting - that way there are still checkpoints on whole percentages but it would auto adjust to the speed of the card

I know, most effort is put in building the new client, so this might end up somewhere on the backlog with lower priority

Post by **toTOW** » Wed Nov 23, 2022 8:50 pm

Unfortunately, OpenMM core used on GPUs doesn't support triggeed checkpoints : it only works at a predefined frequency. OpenMM also performs checks (we call them sanity checks) between data computed on the GPU and data computed on the CPU before it writes a checkpoint, which explain why there are some interruptions in GPU load.

Gromacs core used on CPU is more flexible : you can set the checkpoint frequency, and it can write a checkpoint when the core is interrupted.

Folding Forum

18601 checkpoints too often

18601 checkpoints too often

Re: 18601 checkpoints too often

Re: 18601 checkpoints too often

Re: 18601 checkpoints too often

Re: 18601 checkpoints too often

Re: 18601 checkpoints too often