Page 1 of 2
What about checkpoints?
Posted: Mon May 21, 2012 2:57 pm
by iceman1992
bruce wrote:The fundamental concept is that you need to record all the information about the current state of the protein. That's a list of numbers. They have to describe the exact condition of everything that's happening so that if you need to interrupt the folding process and later to restart the process at the exact point you previously stopped it so you can continue processing as if it had never been interrupted.
(A bit off topic)How many checkpoints does the client keep? I mean if the latest checkpoint somehow gets corrupted, can it rollback to the previous checkpoint? And if I pause the client, why doesn't it take a snapshot right at that moment so I don't lose any progress?
bruce wrote:If you turn that list of numbers into a series of graphic images and run them as a video, you'll see a movie of the folding process, so in that sense, a checkpoint is similar to a real snapshot but it actually contains more information.
Speaking of the folding process movie, any new developments on the FAHViewer?
Re: Runs, Clones, Gens
Posted: Mon May 21, 2012 3:52 pm
by bruce
The client doesn't actually keep the checkpoints, the FahCore does. That's significant because various projects may use various FahCores, and therefore the answer can be different for different projects.
Nevertheless, the short answer is: one.
In the past, that was the only answer. A couple of recent cores have started keeping two checkpoints but the second one isn't supported yet because they haven't written the code so that if the primary checkpoint is found to be corrupt, the software knows how to restart from the previous checkpoint. Nevertheless, there's hope for a future version.
There's a certain amount of data that is also retained and uploaded in the final result but it's probably not the entire contents of each checkpoint -- just enough to contain the important scientific data plus resume work from the final state.
Re: Runs, Clones, Gens
Posted: Tue May 22, 2012 4:33 am
by Joe_H
As for which cores keep two checkpoints, from looking at the work directories I know the SMP A3 and A4 cores do. Since the A5 core was developed from the A3 code, I would assume it does as well. Hopefully they will add the code or settings in the near future to use the secondary checkpoint if the primary is corrupt. I don't run into that often, but only losing 15 minutes of processing time versus starting over from the beginning would be much more preferable.
Re: Runs, Clones, Gens
Posted: Wed May 23, 2012 6:02 pm
by iceman1992
How much processing time is used when writing a checkpoint? If it is negligible, can't it write checkpoints more often (say, everytime I pause the client)? That way I can pause and not lose any work.
Re: Runs, Clones, Gens
Posted: Wed May 23, 2012 6:11 pm
by jimerickson
its not about processing time i believe that it is about disk activity. writing lots of check points requires lots of disk activity. thats why it defaults to 15 minutes. at least thats how i understand it.
Re: Runs, Clones, Gens
Posted: Wed May 23, 2012 6:14 pm
by bruce
The time to write a checkpoint depends on several factors, but it's not a large number so you can add more frequent checkpoints if you choose to.
The biggest disadvantage of too many checkpoints is that there's always a small risk of checkpoint corruption every time you stop and resume a WU. If you do that too often, you'll lose some work to corrupted checkpoints. This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
Re: What about checkpoints?
Posted: Wed May 23, 2012 6:24 pm
by iceman1992
jimerickson wrote:its not about processing time i believe that it is about disk activity. writing lots of check points requires lots of disk activity. thats why it defaults to 15 minutes. at least thats how i understand it.
Well other than the risk of corrupting the checkpoint that bruce mentioned, is there any other reason not to shorten the interval?
bruce wrote:The biggest disadvantage of too many checkpoints is that there's always a small risk of checkpoint corruption every time you stop and resume a WU.
That's why we need the fahcore to keep more than 1 checkpoint
bruce wrote:This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
Re: What about checkpoints?
Posted: Wed May 23, 2012 6:38 pm
by bruce
iceman1992 wrote:bruce wrote:This seems to be related to how quickly you either restart FAH or you shut down your computer. The active WU does take some time to shut down.
I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
Yes. (I don't remember the last time I had a corrupt checkpoint, but I rarely shut down and I'm careful to wait.)
Re: What about checkpoints?
Posted: Wed May 23, 2012 6:55 pm
by 7im
iceman1992 wrote:Well other than the risk of corrupting the checkpoint that bruce mentioned, is there any other reason not to shorten the interval?
Yes, back in the day, I tested checkpoints at 3 minutes, and check points at 30 minutes. Over a full day, it can add one to several minutes (depending on WU size) if you write a lot of check points. Minutes add up to a long time when you are folding thousands of WUs. If you don't stop and start your client very often, then you don't need checkpoints every 3 minutes. 15 is the recommended setting for a reason.
iceman1992 wrote:That's why we need the fahcore to keep more than 1 checkpoint
And that's why we've had that as a feature request for a VERY long time.
Gromacs only recently added support for that feature. FAH has yet to incorporate it.
Re: What about checkpoints?
Posted: Thu May 24, 2012 5:49 am
by Joe_H
iceman1992 wrote:I always make sure the log says "Interrupted" and I wait for the log to finish showing things before turning off my computer. Does that reduce the risk?
It helps with some FAHcores, but the log entry does not actually have a direct relationship with whether a checkpoint was written correctly. Beyond that, from various things I have noticed over the years of folding is that, different cores write checkpoints based on different criteria. Most will follow the settings done in the client, i.e. every 15 minutes, others write a checkpoint after a set number of steps. Some WU's and cores will correctly process an interrupt and write out a checkpoint before exiting, and others don't.
To be more certain on whether a checkpoint has been done and written to disk you need to look into the work directory for the WU and watch the last modification time of the checkpoint file. Since files are first written to memory cache from the running core, you do need to add a bit of time to be sure the entire file has been flushed to disk. That extra bit of time will depend on which OS you are folding on; Linux, Windows and OS X all handle flushing the cached file writes slightly differently and OS settings can modify that action. But generally waiting an extra minute or two past the file modification time is enough.
Re: What about checkpoints?
Posted: Thu May 24, 2012 5:57 am
by Jesse_V
Joe_H's recommendation really is an effective way to see if the checkpoint has been written. I've rarely had a problem with corrupted checkpoints, it seems like a pretty rare event, but if you want to absolutely sure that it doesn't occur for you, use that method.
Re: What about checkpoints?
Posted: Fri May 25, 2012 1:02 pm
by iceman1992
7im wrote:
Yes, back in the day, I tested checkpoints at 3 minutes, and check points at 30 minutes. Over a full day, it can add one to several minutes (depending on WU size) if you write a lot of check points. Minutes add up to a long time when you are folding thousands of WUs. If you don't stop and start your client very often, then you don't need checkpoints every 3 minutes. 15 is the recommended setting for a reason.
But if the client is paused a lot, minutes of lost work add up too. Thanks for the info, I'll keep it at 15 for now.
Joe_H wrote:To be more certain on whether a checkpoint has been done and written to disk you need to look into the work directory for the WU and watch the last modification time of the checkpoint file. Since files are first written to memory cache from the running core, you do need to add a bit of time to be sure the entire file has been flushed to disk. That extra bit of time will depend on which OS you are folding on; Linux, Windows and OS X all handle flushing the cached file writes slightly differently and OS settings can modify that action. But generally waiting an extra minute or two past the file modification time is enough.
How much time is generally required to write a checkpoint?
bruce wrote:Yes. (I don't remember the last time I had a corrupt checkpoint, but I rarely shut down and I'm careful to wait.)
Well I don't shut down often either. I usually hibernate.
To Jesse_V and Joe_H, where do I find the work directory on linux? Tried and couldn't find it.
Re: What about checkpoints?
Posted: Fri May 25, 2012 3:00 pm
by Joe_H
iceman1992 wrote:How much time is generally required to write a checkpoint?
From the small to medium sized WU's that I have been doing recently, just a few seconds. But that is written to cache in system RAM. It may take a bit longer when dealing with large and bigadv WU's. The vulnerable time is between when the checkpoint is written to the RAM and when the entire contents are flushed to the hard drive's media. There is also caching going on in the drive between the interface and the media that depends on the drive settings, that can also be an issue sometimes. Some of the file metadata is already updated at the time the checkpoint is written and in cache.
As for the time before cache contents are flushed to disk, as I said that depends. A recent version of OS X for instance had a system process that once a minute flushed all cached writes that were still pending. I have not checked if that has been changed in current versions. Windows and Linux have similar processes, it just has been a while since I looked up what their defaults were. With Linux the choice of filesystem type and its settings will also vary that even more.
The gap in time between a checkpoint write and it being fully flushed to disk is another reason to not do them too frequently. That increases the chance that you will interrupt and corrupt the checkpoint. To use the current minimum setting of 3 minutes as an example, each time it is done there would be up to a minute period on that version of OS X I mentioned where it was not on disk. Even if the average was 30 seconds, with 20 an hour that adds up to potentially 1/6th of the time you could corrupt the checkpoint. With 15 minutes it is reduced to less than 1 in 30. Anyways, when I do check first before shutting down folding or my machine, I have yet to get a corrupted checkpoint. I have had a few corrupted when I did not check.
To Jesse_V and Joe_H, where do I find the work directory on linux? Tried and couldn't find it.
It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
Re: What about checkpoints?
Posted: Fri May 25, 2012 3:19 pm
by iceman1992
I forgot to check last night. I was in a rush but fortunately the time it took me to type in the "sudo pm-hibernate" command and the password was enough
. The checkpoint was not corrupted. That was like 6-7 seconds after hitting pause. I fold normal WUs, not big or bigadv. But yeah I'm keeping it at 15.
Joe_H wrote:It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
Ah okay then I'll find it one day lol
Re: What about checkpoints?
Posted: Fri May 25, 2012 11:57 pm
by Stonecold
iceman1992 wrote:Joe_H wrote:It has been a while since I tried a Linux install in a VM, so no idea where current clients stick the work directory. At one time they used similar paths to what is used for OS X, but that may have changed.
Ah okay then I'll find it one day lol
For the *buntus, it is in the folder ".FAHClient" in your home directory (it's a hidden file, so you'll have to turn on "show hidden files". Under .FAHClient is "work", which is the work directory. .FAHClient also has all the configs, logs, cores, etc. I assume it's the same for other Linux distros but I'm not sure.