manually trigger checkpoint save

anonymoussuomynona · Post by **anonymoussuomynona** » Thu Jan 23, 2014 9:28 pm

On a particular Mac Mini most of the time it runs FAH 7.3.6 on full, but we cannot leave it on full or even medium constantly. It keeps the Mini from doing its day job, keeping other computers on a network happy. A few times a day a free AppleStore app, LaunchOnTime, automatically runs AppleScripts changing the folding power. e.g. do shell script "FAHClient --send-command 'option power full"

The scripts work great, but when I looked at the FAH log it appears that each time there is a new setting, the client needs to shut down & restart with a new config file. Without the client, FAHCore shuts down until the client is back running & restarts it. Each time we were losing up to 30 minutes of work it had completed. I had the checkpoint frequency set at 30 minutes, before I realized that we were losing a lot of work.

Is there a way to set things up so that FAH exits more gracefully & saves its work before it shuts down? On this machine there is no hurry, it can take its time to quit. I assume there is no way to trigger client to ask for a save before FAHCore dies & takes several minutes of computations with it, since any change to client causes it first to restart to run off of a new config file. Is it possible to trigger FAHCore directly?

This Mini has an uninterruptable power supply & normally runs for months 24/7 unless we need to restart it for some reason. I'd love to set the checkpoint frequency back to 30 minutes, since FAHControl says that setting a short checkpoint frequency decreases performance. I already tried pause instead of changing the power level, but either way FAHCore is shut down & computations are lost.

Alternatively, is there a way to know when the client saves to the HD, so I could change a setting or restart immediately after? Also, if I need to upgrade some software or something & restart, I'd like to time it so not to lose computations already done. I looked at logging options & didn't see anything that would be of help.

If there is no way to manually trigger a saving of data to the HD, how do we calculate how much performance we are losing each time there is a checkpoint save to the HD, so that that loss can be compared to the average number of minutes being lost each time the client restarts?

I assume there must be more than simply a quick write to the HD with some of the contents of ram. FAHCore uses very little ram & the drives in the Mini are 7200 rpm, 6 Gb/s. Is it safe to assume that FAHCore needs to do some number crunching to get things into a form that it can save, or why would there be a concern about performance? Since this is a Mac & therefore no gpu involvement, does that make any difference in how much performance is reduced every time FAH needs to save to the HD?

I'm sure I'm not even asking the right questions. Any insights would be most appreciated. Thanks in advance.

Post by **Joe_H** » Thu Jan 23, 2014 9:52 pm

At this time checkpoints and the code to write them at any particular time is taken care of by the folding core process, FAHCore_A3 or A4 on your Mac Mini. There is no way to manually trigger a checkpoint, at this time those two cores write one every time period passed to them by the settings from FAHControl. However, the difference in performance between the default setting of 15 minutes and 30 minutes is negligible from my tests a couple years ago. That said, I would not set the checkpoint frequency too low, then you will start to see performance issues from the frequent disk writes and file creation.

As for detecting when the checkpoint is written, I have determined that from examining the file modification time of the checkpoint file in the work directory. If you then wait about a minute to make certain the contents are fully flushed to the drive, then pausing the client will have the least effect on progress when restarted.

Post by **calxalot** » Thu Jan 23, 2014 9:53 pm

Every time you pause or change power level, the client will stop the core and core wrapper, losing all work since the last checkpoint.
There is no way to force a checkpoint to be written. The cores write one whenever they are in the right state.
You can set a requested checkpoint interval in FAHControl, but it will be ignored by most cores.

If you don't want to lose work, leave power level at full, and use

Code: Select all

FAHClient --send-finish

instead of pause or power level light.

Send

Code: Select all

FAHClient --send-unpause
FAHClient --send-finish

to resume folding/start a new work unit and pause when that unit is done.

Edit: I would only do the unpause+finish routine once per day, at end of business. If a WU is not done the next morning, do a pause.

Post by **calxalot** » Thu Jan 23, 2014 10:00 pm

If you want to check file mod times to guess when a checkpoint was written, your tool needs to be smart enough to know the queue numbers will change.
Eg, you can't just have launchd watch directory ../FAHClient/work/00/.

anonymoussuomynona · Post by **anonymoussuomynona** » Thu Jan 23, 2014 10:37 pm

Joe_H wrote:the difference in performance between the default setting of 15 minutes and 30 minutes is negligible from my tests a couple years ago. That said, I would not set the checkpoint frequency too low, then you will start to see performance issues from the frequent disk writes and file creation.

How low is too low when client shuts down FAHCore_a4 usually 2 to 4 times a day? Client runs full at night & medium or light during the day depending on the day of the week, except when it needs to be turned off or paused for an hour.

calxalot, I did use send-finish before I started using the AppleScripts. It did work great, but often that one WU would finish within a few hours & then the Mini would only use 35-60% of one of 4 cpus the rest of the night doing its regular job.

Post by **bruce** » Thu Jan 23, 2014 10:58 pm

The checkpoint interval can be set to values between 3 and 30. If we assume that we're talking about a FahCore which accepts that setting (some cores ignore that setting, but not these) then the average amount of time lost is going to be half of the checkpoint interval. Since the events are asynchronous, if you have it set to 30, half of the core shutdowns will be less that 15 and half will be more than 15.

Even if there were a way to pass a setting of "Checkpoint ASAP and shut down" to the FahCore, you'd have to shut it down and restart it in order to give it that instruction, so it can't be a useful setting.

This happens to be an issue dear to my heart. The best I can offer is an enhancement request which conceivably would give us access to the information. Ticket #1085. Unfortunately there's no way to predict when such an enhancement might appear in a future client.

anonymoussuomynona · Post by **anonymoussuomynona** » Thu Jan 23, 2014 11:44 pm

How involved is a checkpoint save? If it were to take a fraction of a second of the core's time, then I assume that they would simply set the core to make the save often & eliminate the option to change the checkpoint from the configuration menu. If we knew how time consuming the save was, then we could balance that against the average data lost each time the core restarted.

anonymoussuomynona · Post by **anonymoussuomynona** » Fri Jan 24, 2014 12:00 am

Concerning future enhancements: According to control's log, the core shuts down when "client is no longer detected." Wouldn't it be nice if core instead of immediately shutting down & abandoning work already done, it would first save?

Post by **bruce** » Fri Jan 24, 2014 12:18 am

The idea of immediately doing a save may seem reasonable, but it's really not. First, a save can only be done at specific times when the FahCore reaches specific points within the analysis. The FahClient doesn't know how to predict when the core will be in that state again so allowing the simulation to run until the next one may take a "long" time. Second, the time to actually write a checkpoint once that state has been reached depends on a number of "other" factors so the time until the checkpoint is completed is even "longer.".

The worst combination of events would be the request for a FahCore to shut down received from the OS (e.g.- during the first phase of a reboot). If the OS requests the FahCore to stop and the FahCore initiates a save, there's a good chance that the save will be incomplete when the OS finally times out and kills the FahCore because it has not responded yet. An incomplete save results in a corrupt checkpoint and recovery is to restart the Wu from zero since a valid checkpoint no longer exists.

sortofageek · Post by **sortofageek** » Fri Jan 24, 2014 12:33 am

anonymoussuomynona, please check your private messages.

Thanks.

anonymoussuomynona · Post by **anonymoussuomynona** » Fri Jan 24, 2014 2:03 am

Bruce, I appreciate your comments. Could you be more specific? If the save can only be done at specific times, then how does that tie into the checkpoint frequency? Currently I have the checkpoint set at 3 & I only lose a minute or two of data each time the client restarts, compared to an average of 15 minutes when I had the frequency set to 30. This specific time must come around fairly often.

I'm sure you are completely correct in saying that the time to write a checkpoint depends on a number of factors, more than just the size of the WU & the speed of the computer. It would be interesting to know even roughly what is involved & how long it takes. With everything average, does it steal a second of cpu time from the core's folding computations? Does it steal a minute? The minimum the frequency can be set is 3 minutes, so I assume under the worst case, a large WU, power setting: light, on a slow machine, a save takes less than 3 minutes. Even if a save does take up to three minutes, is most of that farmed off to another process running parallel, that uses little cpu time, or does the core spend a lot of cpu time crunching things into some format that can be saved? In other words, how is performance affected?

It has been decades since I was writing code for an Apple II being used to control equipment, but even back then we never trashed the old save until a new save was completed & verified. I find it unlikely that the writers of the code for FAH would let the core be confused by an incomplete save. This is assuming FAHCore cannot tell the difference between the OS telling it to stop & FAHCore finding that the "client is no longer detected" & take the time for a more orderly shut down. I'm not trying to belittle your assertion that what I'm suggesting is complex. I'm sure it is. If it were easy, I assume they would have done it already. Maybe I shouldn't have mentioned the ideal, it has nothing to do with how to configure things today.

Even if someday the Core can do a more orderly shut down, for today, we know that the core does not save before shutting down. How do I calculate what the most effective checkpoint frequency is? I know what the average loss is each time the client restarts depending on the checkpoint frequency, I have no info on how much each save slows down the core, even in the most general terms. Anyone have any info that would shine a light on this?

Post by **bruce** » Fri Jan 24, 2014 2:19 am

I think it does come around fairly often but as I implied, that frequency changes based on the complexity of the protein. If you specify 3 minutes, and if the next opportunity after 3 minutes isn't until 3.5 minutes, you might not notice if it just keeps running anyway. On the other hand, if you paused folding or asked the OS to shut down and it didn't start writing the checkpoint for half a minute, you would conceivably restart the client or power off the computer just about the time the previous invocation started writing the checkpoint. Add that to up to another minute before the FahCore is notified that a pause has been requested and you might notice it's "hung."

Good programming suggests that a Pause or Shutdown must be avoided between the time a checkpoint Starts being written and it Finishes writing it. Adding additional delays before actually Starting is not a good programming practice. Set the checkpoint interval shorter than the 30 minutes you were using (average loss, 15 minutes) and if possible, minimize the number of times per day you have to pause/resume.

Assuming the default: 15 minutes (average loss 7.5) and you're talking about half a percent productivity loss for those people who pause/resume twice a day and less for those who run longer periods of time.

anonymoussuomynona · Post by **anonymoussuomynona** » Fri Jan 24, 2014 3:10 am

Being that it is now 24 hours since I switched to 3 minute checkpoints, I just compared the last 24 hours with 4 restarts, to the days before with 30 minute checkpoints & 4 restarts. It is very hard to compare, especially on this computer since sometimes its day job keeps it busier on some days than others, but it is possible that 3 minute checkpoints may be worse than having 30 minute checkpoints even with client restarting 4 times a day. There isn't a drastic reduction. The difference could be because of other factors.

Bruce, currently the core stops almost instantaneously. Being that no one so far has been able to say that a save would add a second or a minute, I do not understand the purpose of your theorizing. According to the math I did before I started this tread, a day has 1,440 minutes & if I follow Joe_H's advice that performance has a negligible affect on 15 minute checkpoint compared to a 30 minutes. Four restarts, averaging 7.5 minutes equals 30 minutes. 30/1440 = a little over 2%.

I've realized from the beginning that losing a few percent is not going to end the world. A programmer thought that being able to set the checkpoint frequency was worth the work of adding it to the things we can reconfigure. If anyone out there can shed some light on how this option changes performance, I would be interested to hear what they have to say.

7im · Post by **7im** » Fri Jan 24, 2014 7:09 am

My PC runs for weeks without a shut down. Why is four restarts needed? That's more wasteful than any checkpoint method being discussed.

Yes, I did some checkpoint testing back in the day using a single core client. So take that with a grain of salt for this newer client with more data to write. I ran the same work unit at 3 minute intervals, and then again at 30. I could not see the difference watching a few frames, but I could measure it over a full days time. The result difference was a few minutes a day. May be a couple of points difference. Not worth much at first glance. But a few minutes a day, over many months, multiplied by 200,000 active clients really starts to add up to a lot of wasted time by setting the checkpoint to 3 minutes if it's not really needed.

For example, if a checkpoint adds 1 second of delay, a 3 minute setting adds 8 minutes of delay each day. At 15 minutes, that's only 1 minute 36 seconds delay.

If I restarted once a day, I would lose about the same amount of time, on average, with either setting. But a checkpoint can add more than 1 second.

The new GPU fahcore, as mentioned above, has no user configurable checkpoint. And the researcher can set the interval at each 1, 2, 5, or 10% completed, because those checkpoints last half a minute or longer. So fewer checkpoints are better for the science getting done faster, assuming you don't shut down the client very often.

On machines that fold full time, I set it to 30 minutes. On machines that don't fold full time, I set it to 7 minutes. It's a good compromise. But each situation is a little different. You can choose how often you reboot, and you can choose your interval. I never go lower than 7 min. 3 min seems like a waste to me. And like you said, the daily use of the computer likely has more impact on performance than worrying about a few minutes lost on a client restart. No need to put a lot of effort in to testing all this again, unless you have more than enough time to spare chasing minutiae.

I used to have that kind of time. Now I've learned the default setting is a good compromise for most people. And there are better ways of getting more performance than worrying about a few lost minutes on a restart, in my opinion. You do what you think best.

Folding Forum

manually trigger checkpoint save

manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save

Re: manually trigger checkpoint save