18601 checkpoints too often

Moderators: Site Moderators, FAHC Science Team

Post Reply
Alex_Atkin
Posts: 39
Joined: Mon Oct 24, 2022 4:32 am

18601 checkpoints too often

Post by Alex_Atkin »

I'm noticing a waste of GPU resources on 18601 as it checkpoints every 25000 steps. On a 4090 that's under a minute, on a 3080 its about every 2 minutes. It seems to take a few seconds each time which surely adds up as a lot of wasted time over 24 hours.

Why have the option to set the checkpointing frequency if its ignored?

Conversely, I'm not seeing any checkpointing in the logs at all for aarch64 WUs although looking in the data folder they do seem to be written.
Image
Joe_H
Site Admin
Posts: 7922
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: 18601 checkpoints too often

Post by Joe_H »

The checkpoints on GPU projects are set by the researcher. They happen when important data is collected and retained for later analysis after the WU is returned. It is also when a sanity check is done on the data returned to that point on the CPU to verify the GPU is properly calculating. That was found necessary as unstable GPUs may not give any indication that there are errors in the processing of the WU data.

The algorithms used in the CPU processing cores based on GROMACS are different and can be interrupted on a timed basis. In the latest versions they also will attempt to write out a checkpoint when folding is paused. The OpenMM code used in the GPU folding core needs to be interrupted at certain points to be able to write out a usable checkpoint.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Alex_Atkin
Posts: 39
Joined: Mon Oct 24, 2022 4:32 am

Re: 18601 checkpoints too often

Post by Alex_Atkin »

Thanks, that's obviously more important than getting it done a little faster.
Image
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 18601 checkpoints too often

Post by toTOW »

The checkpoints are usually set to not waste too much compute time when low end GPUs are interrupted ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
PaulTV
Posts: 201
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: 18601 checkpoints too often

Post by PaulTV »

Would be nice if:
- Checkpoints could be written without interrupting calculations (dunno how hard that would be if at all possible), or
- There would be something like 'if last checkpoint was within x minutes, skip this one', with default of 5 or 15m, configurable with an advanced setting - that way there are still checkpoints on whole percentages but it would auto adjust to the speed of the card

I know, most effort is put in building the new client, so this might end up somewhere on the backlog with lower priority
Image

Ryzen 5800X / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 18601 checkpoints too often

Post by toTOW »

Unfortunately, OpenMM core used on GPUs doesn't support triggeed checkpoints : it only works at a predefined frequency. OpenMM also performs checks (we call them sanity checks) between data computed on the GPU and data computed on the CPU before it writes a checkpoint, which explain why there are some interruptions in GPU load.

Gromacs core used on CPU is more flexible : you can set the checkpoint frequency, and it can write a checkpoint when the core is interrupted.
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Post Reply