Lost work at >95%

Moderators: Site Moderators, FAHC Science Team

Post Reply
iceman1992
Posts: 523
Joined: Fri Mar 23, 2012 5:16 pm

Lost work at >95%

Post by iceman1992 »

My computer was working on a project 7647 (Win7 host, Ubuntu VM), and the virtualbox window went all black, but FAH was still running.
Because I hadn't clicked finish yet, I had to power off and restart the VM. And I think the checkpoint got corrupted or something.
It's unfortunate since 7647 is one of the longer projects and I was about to get around 14K to 16K of points from the unit :(

Code: Select all

*********************** Log Started 2012-08-06T17:21:42Z ***********************

17:21:42:************************* Folding@home Client *************************

17:21:42:    Website: http://folding.stanford.edu/

17:21:42:  Copyright: (c) 2009-2012 Stanford University

17:21:42:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>

17:21:42:       Args: --lifeline 1749 --command-port=36330

17:21:42:     Config: /home/ubuntu/.FAHClient/config.xml

17:21:42:******************************** Build ********************************

17:21:42:    Version: 7.1.52

17:21:42:       Date: Mar 20 2012

17:21:42:       Time: 13:19:11

17:21:42:    SVN Rev: 3515

17:21:42:     Branch: fah/trunk/client

17:21:42:   Compiler: GNU 4.6.2

17:21:42:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math

17:21:42:             -fno-unsafe-math-optimizations -msse2

17:21:42:   Platform: linux2 3.2.0-1-amd64

17:21:42:       Bits: 64

17:21:42:       Mode: Release

17:21:42:******************************* System ********************************

17:21:42:        CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz

17:21:42:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7

17:21:42:       CPUs: 4

17:21:42:     Memory: 995.25MiB

17:21:42:Free Memory: 298.65MiB

17:21:42:    Threads: POSIX_THREADS

17:21:42: On Battery: false

17:21:42: UTC offset: 7

17:21:42:        PID: 1756

17:21:42:        CWD: /home/ubuntu/.FAHClient

17:21:42:         OS: Linux 3.2.0-27-generic x86_64

17:21:42:    OS Arch: AMD64

17:21:42:       GPUs: 0

17:21:42:       CUDA: Not detected

17:21:42:***********************************************************************

17:21:42:<config>

17:21:42:  <!-- FahCore Control -->

17:21:42:  <core-priority v='low'/>

17:21:42:

17:21:42:  <!-- Network -->

17:21:42:  <proxy v=':8080'/>

17:21:42:

17:21:42:  <!-- User Information -->

17:21:42:  <passkey v='********************************'/>

17:21:42:  <team v='40051'/>

17:21:42:  <user v='iceman2992'/>

17:21:42:

17:21:42:  <!-- Folding Slots -->

17:21:42:  <slot id='0' type='SMP'/>

17:21:42:</config>

17:21:42:Trying to access database...

17:21:42:Successfully acquired database lock

17:21:42:Enabled folding slot 00: READY smp:4

17:21:42:WU00:FS00:Starting

17:21:42:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /home/ubuntu/.FAHClient/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 701 -lifeline 1756 -checkpoint 15 -np 4

17:21:42:WU00:FS00:Started FahCore on PID 1764

17:21:42:WU00:FS00:Core PID:1768

17:21:42:WU00:FS00:FahCore 0xa4 started

17:21:43:WU00:FS00:0xa4:

17:21:43:WU00:FS00:0xa4:*------------------------------*

17:21:43:WU00:FS00:0xa4:Folding@Home Gromacs GB Core

17:21:43:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)

17:21:43:WU00:FS00:0xa4:

17:21:43:WU00:FS00:0xa4:Preparing to commence simulation

17:21:43:WU00:FS00:0xa4:- Ensuring status. Please wait.

17:21:47:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1

17:21:52:WU00:FS00:0xa4:- Looking at optimizations...

17:21:52:WU00:FS00:0xa4:- Working with standard loops on this execution.

17:21:52:WU00:FS00:0xa4:- Previous termination of core was improper.

17:21:52:WU00:FS00:0xa4:- Files status OK

17:21:52:WU00:FS00:0xa4:- Expanded 547904 -> 847524 (decompressed 154.6 percent)

17:21:52:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=547904 data_size=847524, decompressed_data_size=847524 diff=0

17:21:52:WU00:FS00:0xa4:- Digital signature verified

17:21:52:WU00:FS00:0xa4:

17:21:52:WU00:FS00:0xa4:Project: 7647 (Run 67, Clone 0, Gen 22)

17:21:52:WU00:FS00:0xa4:

17:21:52:WU00:FS00:0xa4:Entering M.D.

17:21:58:WU00:FS00:0xa4:Using Gromacs checkpoints

17:21:58:WU00:FS00:0xa4:mdrun returned 255

17:21:58:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=0

17:21:58:WU00:FS00:0xa4:Work fraction=0.0000 steps=0.

17:22:02:WU00:FS00:0xa4:logfile size=35885 infoLength=35885 edr=25 trr=1

17:22:02:WU00:FS00:0xa4:logfile size: 35885 info=35885 bed=25 hdr=1

17:22:02:WU00:FS00:0xa4:- Writing 36423 bytes of core data to disk...

17:22:02:WU00:FS00:0xa4:Done: 35911 -> 7237 (compressed to 20.1 percent)

17:22:02:WU00:FS00:0xa4:  ... Done.

17:22:10:WU00:FS00:0xa4:

17:22:10:WU00:FS00:0xa4:Folding@home Core Shutdown: UNSTABLE_MACHINE

17:22:10:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)

17:22:10:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:7647 run:67 clone:0 gen:22 core:0xa4 unit:0x0000001e664f2dcd4fa7fe55c3135bc2

17:22:10:WU00:FS00:Uploading 7.57KiB to 171.64.65.101

17:22:10:WU00:FS00:Connecting to 171.64.65.101:8080

17:22:11:WU01:FS00:Connecting to assign3.stanford.edu:8080

[93m17:22:31:WARNING:WU01:FS00:Failed to get assignment from 'assign3.stanford.edu:8080': Could not get IP address for assign3.stanford.edu: No address associated with hostname[0m

17:22:31:WU01:FS00:Connecting to assign4.stanford.edu:80
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: Lost work at >95%

Post by gwildperson »

There's nothing that can be done about a corrupted checkpoint. Science can't accept results containing bad data.
iceman1992
Posts: 523
Joined: Fri Mar 23, 2012 5:16 pm

Re: Lost work at >95%

Post by iceman1992 »

Yes. I'm sure these things are not rare (although it was the first and only for me), which makes me think a secondary checkpoint at longer intervals might be a good idea.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost work at >95%

Post by 7im »

Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: Lost work at >95%

Post by gwildperson »

7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
I'll bet that's a Gromacs issue, not a FAH issue.

FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: Lost work at >95%

Post by Jesse_V »

gwildperson wrote:
7im wrote:Gromacs already supports using multiple checkpoints. FAH is yet to use this feature.
I'll bet that's a Gromacs issue, not a FAH issue.

FAH Cores are built with a version of Gromacs that's active at the time and unless a new version of Gromacs provides a scientific upgrade that FAH needs, they probably don't update from an older version. It seems likely that Gromacs started writing multiple checkpoints but had not yet started reading multiple checkpoints. If somebody follows the Gromacs change-log, you might even find enough information to be able to say that today's Core-X is based on Gromacs version Y and needs to use Gromacs version Y+1.
From Vijay Pande in 2008:

viewtopic.php?f=16&t=1912&p=16570#p16570
VijayPande wrote:
314159 wrote:
I want to log a formal protest:
1. Error trapping - why not save a checkpoint (i.e. two vs. one) and have the client attempt continuation from the former point if the error is as you described?
Note that I have baby-sat MANY 0x? SMP's to completion on their second run simply by backing up, exiting client, restarting client (several times). 100% success to date. If that additional checkpoint had been available, I would bet that the stack error or whatever it was that caused the 0x? would NOT have occurred.This technique might be something that could be applied to all classes of WUs besides the SMPs (perhaps not the PS3's). Coding would be trivial.
This has come up in our dev discussions. However, getting SMP/Windows to run more stably has taken precedence here. We've been putting dev time into this, since there are so manhy potential SMP/Windows clients out there, but we can't tap them until SMP/Windows becomes more stable. If you feel like this issue is as significant in donor base and donor interest as a stable SMP/Windows client, let me know and we can consider rearranging priorities.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost work at >95%

Post by 7im »

One would hope the SMP fahcores are considered much more stable, now 4 years later, with V7 and BigAdv rocking on Gromacs with versions that already support multiple check points. The Gromacs code in the fahcores have supported 2 checkpoints going back several revisions of the fahcores already. Like I said, FAH is yet to implement this feature.

Maybe V7 will bring in more processing power than having 2 check points would save. Also funny about giving stability a priority when having 2 checkpoints to fall back on would also help save the science. 2 checkpoints would have been a great band-aid 4 years ago while they worked out the stability problems. 8-)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
iceman1992
Posts: 523
Joined: Fri Mar 23, 2012 5:16 pm

Re: Lost work at >95%

Post by iceman1992 »

For fah to use this feature, do we have to wait for a new core (a6 ?) or can it be implemented in a3, a4 and a5?
Joe_H
Site Admin
Posts: 7990
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Lost work at >95%

Post by Joe_H »

Since the recent cores are already keeping 2 checkpoints, my guess is that an update to them might be enough to enable recovery from the older checkpoint if the newest was not usable. But that is only a guess based on the cores using recent enough Gromacs code to write double checkpoints. That I can see by looking at the files in my work directory, there is a current checkpoint file and one marked as "prev" that is 15 minutes older. I can see that with the A4 WU in process, and saw that in the past working on A3 WU's.
Image
iceman1992
Posts: 523
Joined: Fri Mar 23, 2012 5:16 pm

Re: Lost work at >95%

Post by iceman1992 »

And I'm wondering if the cores write both checkpoints at the same time?? Because then if something happens, both will get corrupted, right?
Joe_H
Site Admin
Posts: 7990
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Lost work at >95%

Post by Joe_H »

No, the cores write a checkpoint and rename the previous one. I would have to be able to look at the actual code to see how they have sequenced these operations as it happens too quickly to be certain from watching the directory. But it should be doing the rename and then writing the new checkpoint.
Image
iceman1992
Posts: 523
Joined: Fri Mar 23, 2012 5:16 pm

Re: Lost work at >95%

Post by iceman1992 »

Ah okay then.. So it's just a case of the core reading the previous checkpoint? It doesn't sound too difficult to implement so let's hope it comes soon
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Lost work at >95%

Post by 7im »

It's never that simple... Assuming the first checkpoint is bad, for whatever reason, just reading the previous checkpoint is not enough. It probably should be tested extra well for corruption to determine what caused the previous checkpoint to not work. A bad work unit is a bad work unit, even if you can go back 1 frame before it went bad. Then you have to make sure it doesn't get stuck in a loop, etc.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Post Reply