Folding Forum

Posted: **Sat Sep 04, 2010 9:34 pm**

I am quite new to Folding Home, so perhaps my question has been raised before....

Today I had a power faillure and needed to restart my system, which is running 24/7 just to contribute to science. However, I noticed that my work wasn't continuing because of an error, I believe it is known already. So my question is, how to avoid this from happening again - besides no power faillures ofcourse? Is the work I had done (68%) completely lost? Is there a way to restart at a previous checkpoint? It's pitty that even the partial results weren't able to upload either....

Thanks in advance for your answer.

Below is my log:

Code: Select all

[12:49:44] Loaded queue successfully.
[12:49:44] 
[12:49:44] + Processing work unit
[12:49:44] Core required: FahCore_b4.exe
[12:49:44] Core found.
[12:49:44] Working on queue slot 03 [September 5 12:49:44 UTC]
[12:49:44] + Working ...
[12:49:56] *********************** Log Started 05/Sep/2010 12:49:55 ***********************
[12:49:56] ************************** ProtoMol Folding@Home Core **************************
[12:49:56]   Version: 25
[12:49:56]      Type: 180
[12:49:56]      Core: ProtoMol
[12:49:56]   Website: http://folding.stanford.edu/
[12:49:56] Copyright: (c) 2009 Stanford University
[12:49:56]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[12:49:56]      Args: -dir work/ -suffix 03 -cpu 90 -checkpoint 15 -service -lifeline 972
[12:49:56]            -version 623
[12:49:56] ************************************ Build *************************************
[12:49:56]      Date: May 18 2010
[12:49:56]      Time: 23:43:52
[12:49:56]  Revision: 1819
[12:49:56]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[12:49:56]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[12:49:56]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[12:49:56]   Defines: _CRT_SECURE_NO_WARNINGS NDEBUG HAVE_GEEKINFO BOOST_ALL_NO_LIB
[12:49:56]            XML_STATIC HAVE_EXPAT HAVE_OPENSSL HAVE_LIBFAH HAVE_SIMTK_LAPACK
[12:49:56]  Platform: Windows XP
[12:49:56]      Bits: 32
[12:49:56]      Mode: Release
[12:49:56] ************************************ System ************************************
[12:49:56]        OS: Microsoft Windows XP Professional
[12:49:56]       CPU: AMD Sempron(tm) Processor 2800+
[12:49:56]    CPU ID: AuthenticAMD Family 15 Model 44 Stepping 2
[12:49:56]      CPUs: 1 Logical, 1 Physical
[12:49:56]    Memory: 2.00 GB
[12:49:56]   Threads: Windows
[12:49:56] ********************************************************************************
[12:49:56] Project: 10031 (Run 14, Clone 0, Gen 39)
[12:49:56] Unit: 0x000000420001329c4bd49ced0000ea7d
[12:49:56] User: 0x00000000000000000000000000000000
[12:49:56] Machine: 1
[12:49:56] Digital signatures verified
[12:50:05] Completed 341900 out of 499375 steps (68%)
[12:52:56] ERROR: ProtoMol ERROR: Corrupt DCD file. Size is 3275268, should be >= 3281652.
[12:52:56] Saving result file logfile_03.txt
[12:52:56] Saving result file checkpt
[12:52:56] Saving result file checkpt.crc
[12:52:56] Saving result file log.txt
[12:53:05] Saving result file protomol.conf
[12:53:05] Saving result file ww.3839.pos
[12:53:05] Saving result file ww.3839.vel
[12:53:05] Saving result file ww.dcd
[12:53:07] WARNING: While cleaning up: 0: Failed to remove directory '03': boost::filesystem::remove: The process cannot access the file because it is being used by another process: "03\ww.dcd"
[12:53:07] Folding@home Core Shutdown: BAD_WORK_UNIT
[12:53:09] CoreStatus = 72 (114)
[12:53:09] Sending work to server
[12:53:09] Project: 10031 (Run 14, Clone 0, Gen 39)


[12:53:09] + Attempting to send results [September 5 12:53:09 UTC]
[12:56:42] - Couldn't send HTTP request to server
[12:56:42] + Could not connect to Work Server (results)
[12:56:42]     (129.74.85.15:8080)
[12:56:42] + Retrying using alternative port
[13:00:15] - Couldn't send HTTP request to server
[13:00:15] + Could not connect to Work Server (results)
[13:00:15]     (129.74.85.15:80)
[13:00:15] - Error: Could not transmit unit 03 (completed September 5) to work server.
[13:00:15]   Keeping unit 03 in queue.
[13:00:15] Project: 10031 (Run 14, Clone 0, Gen 39)


[13:00:15] + Attempting to send results [September 5 13:00:15 UTC]
[13:03:48] - Couldn't send HTTP request to server
[13:03:48] + Could not connect to Work Server (results)
[13:03:48]     (129.74.85.15:8080)
[13:03:48] + Retrying using alternative port
[13:04:09] - Couldn't send HTTP request to server
[13:04:09] + Could not connect to Work Server (results)
[13:04:09]     (129.74.85.15:80)
[13:04:09] - Error: Could not transmit unit 03 (completed September 5) to work server.


[13:04:09] + Attempting to send results [September 5 13:04:09 UTC]
[13:04:40] - Couldn't send HTTP request to server
[13:04:40] + Could not connect to Work Server (results)
[13:04:40]     (129.74.85.16:8080)
[13:04:40] + Retrying using alternative port
[13:04:50] - Couldn't send HTTP request to server
[13:04:50] + Could not connect to Work Server (results)
[13:04:50]     (129.74.85.16:80)
[13:04:50]   Could not transmit unit 03 to Collection server; keeping in queue.
[13:04:50] - Preparing to get new work unit...

Posted: **Sun Sep 05, 2010 1:15 am**

Welcome to the F@H Forum Dusty82,

Dusty82 wrote:...Is the work I had done (68%) completely lost? Is there a way to restart at a previous checkpoint? It's pitty that even the partial results weren't able to upload either....

As long as the data isn't corrupted and scientifically valid, it will be useful. However, since you got an error while processing the WU, the WU will be resent to some other donor.

The Client automatically reverts to the last valid checkpoint. You can't manually force it (at least I am not aware of any method for doing it)

The partial results will be uploaded: (from your FAHlog)
[13:04:50] Could not transmit unit 03 to Collection server; keeping in queue.
The Client will automatically retry to send the result every 6 hours. Once it will upload to the Servers and the Official Stats are updated, you may get partial credits depending on the percentage of work you have done. An Admin/Mod can look up the WU using the PRCG and can tell you how much credits you got.

Posted: **Sun Sep 05, 2010 5:27 am**

There is a bug that needs to be fixed:
http://en.fah-addict.net/news/news-0-21 ... -10039.php

"Warning: These projects use the Protomol v23 core, which is affected by a known bug that wipes out all progress on the unit if it or the F@H client is stopped and restarted. The development teams are working to resolve this problem, but aside from that, the projects are entirely completable. If you do not have a dedicated machine, then we advise you to restart your client with no -advmethods flag, until a fix is released."

Posted: **Sun Sep 05, 2010 8:07 am**

AFAIK, the ProtoMol v25 version fixes it in a majority of cases. I have restarted my Classic Client a couple of times and it didn't give me any kind of error and resumed from where it left. Part of my FAHlog:

Code: Select all

[12:09:13] Loaded queue successfully.
[12:09:13] 
[12:09:13] + Processing work unit
[12:09:13] Core required: FahCore_b4.exe
[12:09:13] Core found.
[12:09:13] - Autosending finished units... [September 3 12:09:13 UTC]
[12:09:13] Trying to send all finished work units
[12:09:13] + No unsent completed units remaining.
[12:09:13] - Autosend completed
[12:09:13] Working on queue slot 06 [September 3 12:09:13 UTC]
[12:09:13] + Working ...
[12:09:13] - Calling '.\FahCore_b4.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 4248 -version 623'

[12:09:17] *********************** Log Started 03/Sep/2010 12:09:17 ***********************
[12:09:17] ************************** ProtoMol Folding@Home Core **************************
[12:09:17]   Version: 25
[12:09:17]      Type: 180
[12:09:17]      Core: ProtoMol
[12:09:17]   Website: http://folding.stanford.edu/
[12:09:17] Copyright: (c) 2009 Stanford University
[12:09:17]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[12:09:17]      Args: -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 4248 -version
[12:09:17]            623
[12:09:17] ************************************ Build *************************************
[12:09:17]      Date: May 18 2010
[12:09:17]      Time: 23:43:52
[12:09:17]  Revision: 1819
[12:09:17]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[12:09:17]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[12:09:17]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[12:09:17]   Defines: _CRT_SECURE_NO_WARNINGS NDEBUG HAVE_GEEKINFO BOOST_ALL_NO_LIB
[12:09:17]            XML_STATIC HAVE_EXPAT HAVE_OPENSSL HAVE_LIBFAH HAVE_SIMTK_LAPACK
[12:09:17]  Platform: Windows XP
[12:09:17]      Bits: 32
[12:09:17]      Mode: Release
[12:09:17] ************************************ System ************************************
[12:09:17]        OS: Microsoft Windows 7 Ultimate
[12:09:17]       CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz
[12:09:17]    CPU ID: GenuineIntel Family 6 Model 23 Stepping 6
[12:09:17]      CPUs: 2 Logical, 1 Physical
[12:09:17]    Memory: 3.00 GB
[12:09:17]   Threads: Windows
[12:09:17] ********************************************************************************
[12:09:17] Project: 10057 (Run 954, Clone 0, Gen 7)
[12:09:17] Unit: 0x000000090001329c4c61917800003e55
[12:09:17] User: 0x00000000000000000000000000000000
[12:09:17] Machine: 1
[12:09:17] Digital signatures verified
[12:09:17] GUI Server started
[12:09:17] Completed 175700 out of 499375 steps (35%)
[12:16:47] Completed 179800 out of 499375 steps (36%)
SNIP
[21:31:20] Completed 484400 out of 499375 steps (97%)
[21:40:25] Completed 489400 out of 499375 steps (98%)
[21:49:28] Completed 494400 out of 499375 steps (99%)
[21:58:37] Completed 499300 out of 499375 steps (99%)
[21:58:47] GUI Server closing
[21:58:47] GUI Server exiting
[21:58:47] Saving result file logfile_06.txt
[21:58:47] Saving result file checkpt
[21:58:47] Saving result file checkpt.crc
[21:58:47] Saving result file log.txt
[21:58:51] Saving result file protomol.conf
[21:58:51] Saving result file ww.538.pos
[21:58:51] Saving result file ww.538.vel
[21:58:51] Saving result file ww.dcd
[21:58:52] Folding@home Core Shutdown: FINISHED_UNIT
[21:58:56] CoreStatus = 64 (100)
[21:58:56] Unit 6 finished with 97 percent of time to deadline remaining.
[21:58:56] Updated performance fraction: 0.980017
[21:58:56] Sending work to server
[21:58:56] Project: 10057 (Run 954, Clone 0, Gen 7)
[21:58:56] - Read packet limit of 540015616... Set to 524286976.


[21:58:56] + Attempting to send results [September 3 21:58:56 UTC]
[21:58:56] - Reading file work/wuresults_06.dat from core
[21:58:56]   (Read 4236306 bytes from disk)
[21:58:56] Connecting to http://129.74.85.15:8080/
[22:06:38] Posted data.
[22:06:38] Initial: 0000; - Uploaded at ~8 kB/s
[22:06:38] - Averaged speed for that direction ~6 kB/s
[22:06:38] + Results successfully sent
[22:06:38] Thank you for your contribution to Folding@Home.
[22:06:38] + Number of Units Completed: 52

Posted: **Sun Sep 05, 2010 9:58 am**

This might interest you:

John_Weatherman wrote:This from j coffland in reply to my asking about the b4 core problems,

I have a new version of the b4 core which I believe solves this problem. We were hoping to also solve a problem with SSE2 instructions in the 32-bit core. This issue is important but is not yet at the top of the list. Although you should expect a new b4 release in about a month. Sorry for the delays. Developer bandwidth is our limiting factor.

Joseph

Source

Posted: **Sun Sep 05, 2010 12:31 pm**

Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php

v25 fixed another kind of problems : http://en.fah-addict.net/news/news-0-22 ... ilable.php

Posted: **Sun Sep 05, 2010 1:49 pm**

If you have a power failure, you'll probably lose your work, regards of what type of WU you're working on. Often it'll start again from the beginning, if you're very lucky it'll carry on from the crash, and the Protomols will dump the WU and send in the results.
The reported error with the b4 core is from a normal Windows closedown, which produces the same error message.

Posted: **Mon Sep 06, 2010 5:34 pm**

toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php

If it was fixed, then some wouldn't be experiencing the checkpoint issue.

Posted: **Wed Sep 08, 2010 1:07 am**

lanbrown wrote:
toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php
If it was fixed, then some wouldn't be experiencing the checkpoint issue.

You're talking about two different things. Checkpoints which precede an orderly shutdown are inherently different that checkpoints which precede a power failure. If the OS is unable to complete an orderly shutdown, the checkpoint that has been "written" by the software might still be in cache and only the part of it that was actually written to disk is available after a power failure.

Posted: **Wed Sep 08, 2010 4:11 am**

gwildperson wrote:
lanbrown wrote:
toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php
If it was fixed, then some wouldn't be experiencing the checkpoint issue.
You're talking about two different things. Checkpoints which precede an orderly shutdown are inherently different that checkpoints which precede a power failure. If the OS is unable to complete an orderly shutdown, the checkpoint that has been "written" by the software might still be in cache and only the part of it that was actually written to disk is available after a power failure.

Uhhh no. The default is writing a checkpoint every 15 minutes. A power failure means that while the client was not gracefully shutdown, it still has a checkpoint it can go back too. The Protomol seems to be the only core that has this issue. It doesn't happen all of the time, but it DOES happen. It can even happen at a graceful shutdown and restart of JUST the client and NOT the system. The OS is not going to keep DISK IO in cache for any length of time. It may use cache to speed the request up. To further put holes in your cache explanation, it happens on a box running multiple clients and the checkpoints are not at the same time; they are minutes apart. No OS wil keep the write cache that long as it would lead to a very inconsistent file system.

Posted: **Wed Sep 08, 2010 5:38 am**

lanbrown wrote: A power failure means that while the client was not gracefully shutdown, it still has a checkpoint it can go back too. The Protomol seems to be the only core that has this issue. It doesn't happen all of the time, but it DOES happen. It can even happen at a graceful shutdown and restart of JUST the client and NOT the system.

I just lost a weeks work with a core 78 WU after a crash, so it's not just b4 core WUs. Protomol WUs are more sensitive, and keep working for a while even after the client is closed correctly.

Posted: **Wed Sep 08, 2010 8:48 pm**

lanbrown wrote:No OS wil keep the write cache that long as it would lead to a very inconsistent file system.

I don't know how you can claim to know about all possible variations in Operating Systems. let's assume whe have a write-through ramdisk with a harddisk backup. All reads and writes go to the ramdisk so the application always sees a consistent file system. During a shutdown, everything in RAM is synced to disk so the disk is consistent whenever the OS is shut down.

An increasing number of filesystems are keeping more an more information in RAM for longer and longer periods of time. Your assumptions about what constitutes "that long" may very well be correct, but they're on shaky ground as more and more file-system designer come up with more and more reasons to delay bothering the harddisk subsystem with "unnecessary" operations as long as they can assume that your UPS will provide enough time to write the last version of the data to disk.

Folding Forum

10031 (Run 14, Clone 0, Gen 39) Protomol

10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol