Lost work at >95%

iceman1992 · Post by **iceman1992** » Tue Aug 07, 2012 6:24 am

My computer was working on a project 7647 (Win7 host, Ubuntu VM), and the virtualbox window went all black, but FAH was still running.
Because I hadn't clicked finish yet, I had to power off and restart the VM. And I think the checkpoint got corrupted or something.
It's unfortunate since 7647 is one of the longer projects and I was about to get around 14K to 16K of points from the unit

Code: Select all

*********************** Log Started 2012-08-06T17:21:42Z ***********************

17:21:42:************************* Folding@home Client *************************

17:21:42:    Website: http://folding.stanford.edu/

17:21:42:  Copyright: (c) 2009-2012 Stanford University

17:21:42:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>

17:21:42:       Args: --lifeline 1749 --command-port=36330

17:21:42:     Config: /home/ubuntu/.FAHClient/config.xml

17:21:42:******************************** Build ********************************

17:21:42:    Version: 7.1.52

17:21:42:       Date: Mar 20 2012

17:21:42:       Time: 13:19:11

17:21:42:    SVN Rev: 3515

17:21:42:     Branch: fah/trunk/client

17:21:42:   Compiler: GNU 4.6.2

17:21:42:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math

17:21:42:             -fno-unsafe-math-optimizations -msse2

17:21:42:   Platform: linux2 3.2.0-1-amd64

17:21:42:       Bits: 64

17:21:42:       Mode: Release

17:21:42:******************************* System ********************************

17:21:42:        CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz

17:21:42:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7

17:21:42:       CPUs: 4

17:21:42:     Memory: 995.25MiB

17:21:42:Free Memory: 298.65MiB

17:21:42:    Threads: POSIX_THREADS

17:21:42: On Battery: false

17:21:42: UTC offset: 7

17:21:42:        PID: 1756

17:21:42:        CWD: /home/ubuntu/.FAHClient

17:21:42:         OS: Linux 3.2.0-27-generic x86_64

17:21:42:    OS Arch: AMD64

17:21:42:       GPUs: 0

17:21:42:       CUDA: Not detected

17:21:42:***********************************************************************

17:21:42:<config>

17:21:42:  <!-- FahCore Control -->

17:21:42:  <core-priority v='low'/>

17:21:42:

17:21:42:  <!-- Network -->

17:21:42:  <proxy v=':8080'/>

17:21:42:

17:21:42:  <!-- User Information -->

17:21:42:  <passkey v='********************************'/>

17:21:42:  <team v='40051'/>

17:21:42:  <user v='iceman2992'/>

17:21:42:

17:21:42:  <!-- Folding Slots -->

17:21:42:  <slot id='0' type='SMP'/>

17:21:42:</config>

17:21:42:Trying to access database...

17:21:42:Successfully acquired database lock

17:21:42:Enabled folding slot 00: READY smp:4

17:21:42:WU00:FS00:Starting

17:21:42:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /home/ubuntu/.FAHClient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 701 -lifeline 1756 -checkpoint 15 -np 4

17:21:42:WU00:FS00:Started FahCore on PID 1764

17:21:42:WU00:FS00:Core PID:1768

17:21:42:WU00:FS00:FahCore 0xa4 started

17:21:43:WU00:FS00:0xa4:

17:21:43:WU00:FS00:0xa4:*------------------------------*

17:21:43:WU00:FS00:0xa4:Folding@Home Gromacs GB Core

17:21:43:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)

17:21:43:WU00:FS00:0xa4:

17:21:43:WU00:FS00:0xa4:Preparing to commence simulation

17:21:43:WU00:FS00:0xa4:- Ensuring status. Please wait.

17:21:47:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1

17:21:52:WU00:FS00:0xa4:- Looking at optimizations...

17:21:52:WU00:FS00:0xa4:- Working with standard loops on this execution.

17:21:52:WU00:FS00:0xa4:- Previous termination of core was improper.

17:21:52:WU00:FS00:0xa4:- Files status OK

17:21:52:WU00:FS00:0xa4:- Expanded 547904 -> 847524 (decompressed 154.6 percent)

17:21:52:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=547904 data_size=847524, decompressed_data_size=847524 diff=0

17:21:52:WU00:FS00:0xa4:- Digital signature verified

17:21:52:WU00:FS00:0xa4:

17:21:52:WU00:FS00:0xa4:Project: 7647 (Run 67, Clone 0, Gen 22)

17:21:52:WU00:FS00:0xa4:

17:21:52:WU00:FS00:0xa4:Entering M.D.

17:21:58:WU00:FS00:0xa4:Using Gromacs checkpoints

17:21:58:WU00:FS00:0xa4:mdrun returned 255

17:21:58:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=0

17:21:58:WU00:FS00:0xa4:Work fraction=0.0000 steps=0.

17:22:02:WU00:FS00:0xa4:logfile size=35885 infoLength=35885 edr=25 trr=1

17:22:02:WU00:FS00:0xa4:logfile size: 35885 info=35885 bed=25 hdr=1

17:22:02:WU00:FS00:0xa4:- Writing 36423 bytes of core data to disk...

17:22:02:WU00:FS00:0xa4:Done: 35911 -> 7237 (compressed to 20.1 percent)

17:22:02:WU00:FS00:0xa4:  ... Done.

17:22:10:WU00:FS00:0xa4:

17:22:10:WU00:FS00:0xa4:Folding@home Core Shutdown: UNSTABLE_MACHINE

17:22:10:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)

17:22:10:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:7647 run:67 clone:0 gen:22 core:0xa4 unit:0x0000001e664f2dcd4fa7fe55c3135bc2

17:22:10:WU00:FS00:Uploading 7.57KiB to 171.64.65.101

17:22:10:WU00:FS00:Connecting to 171.64.65.101:8080

17:22:11:WU01:FS00:Connecting to assign3.stanford.edu:8080

[93m17:22:31:WARNING:WU01:FS00:Failed to get assignment from 'assign3.stanford.edu:8080': Could not get IP address for assign3.stanford.edu: No address associated with hostname[0m

17:22:31:WU01:FS00:Connecting to assign4.stanford.edu:80

gwildperson · Post by **gwildperson** » Tue Aug 07, 2012 9:36 am

There's nothing that can be done about a corrupted checkpoint. Science can't accept results containing bad data.

iceman1992 · Post by **iceman1992** » Wed Aug 08, 2012 6:59 am

Yes. I'm sure these things are not rare (although it was the first and only for me), which makes me think a secondary checkpoint at longer intervals might be a good idea.

7im · Post by **7im** » Wed Aug 08, 2012 5:54 pm

Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.

gwildperson · Post by **gwildperson** » Thu Aug 09, 2012 12:20 am

7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.

I'll bet that's a Gromacs issue, not a FAH issue.

FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.

Post by **Jesse_V** » Thu Aug 09, 2012 12:58 am

gwildperson wrote:
7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
I'll bet that's a Gromacs issue, not a FAH issue.

FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.

From Vijay Pande in 2008:

viewtopic.php?f=16&t=1912&p=16570#p16570

VijayPande wrote:
314159 wrote:
I want to log a formal protest:
1. Error trapping - why not save a checkpoint (i.e. two vs. one) and have the client attempt continuation from the former point if the error is as you described?
Note that I have baby-sat MANY 0x? SMP's to completion on their second run simply by backing up, exiting client, restarting client (several times). 100% success to date. If that additional checkpoint had been available, I would bet that the stack error or whatever it was that caused the 0x? would NOT have occurred.This technique might be something that could be applied to all classes of WUs besides the SMPs (perhaps not the PS3's). Coding would be trivial.
This has come up in our dev discussions. However, getting SMP/Windows to run more stably has taken precedence here. We've been putting dev time into this, since there are so manhy potential SMP/Windows clients out there, but we can't tap them until SMP/Windows becomes more stable. If you feel like this issue is as significant in donor base and donor interest as a stable SMP/Windows client, let me know and we can consider rearranging priorities.

7im · Post by **7im** » Thu Aug 09, 2012 1:58 am

One would hope the SMP fahcores are considered much more stable, now 4 years later, with V7 and BigAdv rocking on Gromacs with versions that already support multiple check points. The Gromacs code in the fahcores have supported 2 checkpoints going back several revisions of the fahcores already. Like I said, FAH is yet to implement this feature.

Maybe V7 will bring in more processing power than having 2 check points would save. Also funny about giving stability a priority when having 2 checkpoints to fall back on would also help save the science. 2 checkpoints would have been a great band-aid 4 years ago while they worked out the stability problems.

iceman1992 · Post by **iceman1992** » Thu Aug 09, 2012 6:46 am

For fah to use this feature, do we have to wait for a new core (a6 ?) or can it be implemented in a3, a4 and a5?

Post by **Joe_H** » Thu Aug 09, 2012 3:52 pm

Since the recent cores are already keeping 2 checkpoints, my guess is that an update to them might be enough to enable recovery from the older checkpoint if the newest was not usable. But that is only a guess based on the cores using recent enough Gromacs code to write double checkpoints. That I can see by looking at the files in my work directory, there is a current checkpoint file and one marked as "prev" that is 15 minutes older. I can see that with the A4 WU in process, and saw that in the past working on A3 WU's.

iceman1992 · Post by **iceman1992** » Thu Aug 09, 2012 4:06 pm

And I'm wondering if the cores write both checkpoints at the same time?? Because then if something happens, both will get corrupted, right?

Post by **Joe_H** » Thu Aug 09, 2012 4:31 pm

No, the cores write a checkpoint and rename the previous one. I would have to be able to look at the actual code to see how they have sequenced these operations as it happens too quickly to be certain from watching the directory. But it should be doing the rename and then writing the new checkpoint.

iceman1992 · Post by **iceman1992** » Thu Aug 09, 2012 5:18 pm

Ah okay then.. So it's just a case of the core reading the previous checkpoint? It doesn't sound too difficult to implement so let's hope it comes soon

7im · Post by **7im** » Thu Aug 09, 2012 5:40 pm

It's never that simple... Assuming the first checkpoint is bad, for whatever reason, just reading the previous checkpoint is not enough. It probably should be tested extra well for corruption to determine what caused the previous checkpoint to not work. A bad work unit is a bad work unit, even if you can go back 1 frame before it went bad. Then you have to make sure it doesn't get stuck in a loop, etc.

Folding Forum

Lost work at >95%

Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%

Re: Lost work at >95%