My computer was working on a project 7647 (Win7 host, Ubuntu VM), and the virtualbox window went all black, but FAH was still running.
Because I hadn't clicked finish yet, I had to power off and restart the VM. And I think the checkpoint got corrupted or something.
It's unfortunate since 7647 is one of the longer projects and I was about to get around 14K to 16K of points from the unit
Yes. I'm sure these things are not rare (although it was the first and only for me), which makes me think a secondary checkpoint at longer intervals might be a good idea.
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
I'll bet that's a Gromacs issue, not a FAH issue.
FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.
7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
I'll bet that's a Gromacs issue, not a FAH issue.
FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.
From Vijay Pande in 2008:
viewtopic.php?f=16&t=1912&p=16570#p16570
VijayPande wrote:
314159 wrote:
I want to log a formal protest:
1. Error trapping - why not save a checkpoint (i.e. two vs. one) and have the client attempt continuation from the former point if the error is as you described?
Note that I have baby-sat MANY 0x? SMP's to completion on their second run simply by backing up, exiting client, restarting client (several times). 100% success to date. If that additional checkpoint had been available, I would bet that the stack error or whatever it was that caused the 0x? would NOT have occurred.This technique might be something that could be applied to all classes of WUs besides the SMPs (perhaps not the PS3's). Coding would be trivial.
This has come up in our dev discussions. However, getting SMP/Windows to run more stably has taken precedence here. We've been putting dev time into this, since there are so manhy potential SMP/Windows clients out there, but we can't tap them until SMP/Windows becomes more stable. If you feel like this issue is as significant in donor base and donor interest as a stable SMP/Windows client, let me know and we can consider rearranging priorities.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
One would hope the SMP fahcores are considered much more stable, now 4 years later, with V7 and BigAdv rocking on Gromacs with versions that already support multiple check points. The Gromacs code in the fahcores have supported 2 checkpoints going back several revisions of the fahcores already. Like I said, FAH is yet to implement this feature.
Maybe V7 will bring in more processing power than having 2 check points would save. Also funny about giving stability a priority when having 2 checkpoints to fall back on would also help save the science. 2 checkpoints would have been a great band-aid 4 years ago while they worked out the stability problems.
Since the recent cores are already keeping 2 checkpoints, my guess is that an update to them might be enough to enable recovery from the older checkpoint if the newest was not usable. But that is only a guess based on the cores using recent enough Gromacs code to write double checkpoints. That I can see by looking at the files in my work directory, there is a current checkpoint file and one marked as "prev" that is 15 minutes older. I can see that with the A4 WU in process, and saw that in the past working on A3 WU's.
No, the cores write a checkpoint and rename the previous one. I would have to be able to look at the actual code to see how they have sequenced these operations as it happens too quickly to be certain from watching the directory. But it should be doing the rename and then writing the new checkpoint.
Ah okay then.. So it's just a case of the core reading the previous checkpoint? It doesn't sound too difficult to implement so let's hope it comes soon
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
It's never that simple... Assuming the first checkpoint is bad, for whatever reason, just reading the previous checkpoint is not enough. It probably should be tested extra well for corruption to determine what caused the previous checkpoint to not work. A bad work unit is a bad work unit, even if you can go back 1 frame before it went bad. Then you have to make sure it doesn't get stuck in a loop, etc.