10031 (Run 14, Clone 0, Gen 39) Protomol

Moderators: Site Moderators, FAHC Science Team

Post Reply
Dusty82
Posts: 1
Joined: Sat Sep 04, 2010 8:46 pm

10031 (Run 14, Clone 0, Gen 39) Protomol

Post by Dusty82 »

I am quite new to Folding Home, so perhaps my question has been raised before....

Today I had a power faillure and needed to restart my system, which is running 24/7 just to contribute to science. However, I noticed that my work wasn't continuing because of an error, I believe it is known already. So my question is, how to avoid this from happening again - besides no power faillures ofcourse? Is the work I had done (68%) completely lost? Is there a way to restart at a previous checkpoint? It's pitty that even the partial results weren't able to upload either....

Thanks in advance for your answer.

Below is my log:

Code: Select all

[12:49:44] Loaded queue successfully.
[12:49:44] 
[12:49:44] + Processing work unit
[12:49:44] Core required: FahCore_b4.exe
[12:49:44] Core found.
[12:49:44] Working on queue slot 03 [September 5 12:49:44 UTC]
[12:49:44] + Working ...
[12:49:56] *********************** Log Started 05/Sep/2010 12:49:55 ***********************
[12:49:56] ************************** ProtoMol Folding@Home Core **************************
[12:49:56]   Version: 25
[12:49:56]      Type: 180
[12:49:56]      Core: ProtoMol
[12:49:56]   Website: http://folding.stanford.edu/
[12:49:56] Copyright: (c) 2009 Stanford University
[12:49:56]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[12:49:56]      Args: -dir work/ -suffix 03 -cpu 90 -checkpoint 15 -service -lifeline 972
[12:49:56]            -version 623
[12:49:56] ************************************ Build *************************************
[12:49:56]      Date: May 18 2010
[12:49:56]      Time: 23:43:52
[12:49:56]  Revision: 1819
[12:49:56]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[12:49:56]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[12:49:56]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[12:49:56]   Defines: _CRT_SECURE_NO_WARNINGS NDEBUG HAVE_GEEKINFO BOOST_ALL_NO_LIB
[12:49:56]            XML_STATIC HAVE_EXPAT HAVE_OPENSSL HAVE_LIBFAH HAVE_SIMTK_LAPACK
[12:49:56]  Platform: Windows XP
[12:49:56]      Bits: 32
[12:49:56]      Mode: Release
[12:49:56] ************************************ System ************************************
[12:49:56]        OS: Microsoft Windows XP Professional
[12:49:56]       CPU: AMD Sempron(tm) Processor 2800+
[12:49:56]    CPU ID: AuthenticAMD Family 15 Model 44 Stepping 2
[12:49:56]      CPUs: 1 Logical, 1 Physical
[12:49:56]    Memory: 2.00 GB
[12:49:56]   Threads: Windows
[12:49:56] ********************************************************************************
[12:49:56] Project: 10031 (Run 14, Clone 0, Gen 39)
[12:49:56] Unit: 0x000000420001329c4bd49ced0000ea7d
[12:49:56] User: 0x00000000000000000000000000000000
[12:49:56] Machine: 1
[12:49:56] Digital signatures verified
[12:50:05] Completed 341900 out of 499375 steps (68%)
[12:52:56] ERROR: ProtoMol ERROR: Corrupt DCD file. Size is 3275268, should be >= 3281652.
[12:52:56] Saving result file logfile_03.txt
[12:52:56] Saving result file checkpt
[12:52:56] Saving result file checkpt.crc
[12:52:56] Saving result file log.txt
[12:53:05] Saving result file protomol.conf
[12:53:05] Saving result file ww.3839.pos
[12:53:05] Saving result file ww.3839.vel
[12:53:05] Saving result file ww.dcd
[12:53:07] WARNING: While cleaning up: 0: Failed to remove directory '03': boost::filesystem::remove: The process cannot access the file because it is being used by another process: "03\ww.dcd"
[12:53:07] Folding@home Core Shutdown: BAD_WORK_UNIT
[12:53:09] CoreStatus = 72 (114)
[12:53:09] Sending work to server
[12:53:09] Project: 10031 (Run 14, Clone 0, Gen 39)


[12:53:09] + Attempting to send results [September 5 12:53:09 UTC]
[12:56:42] - Couldn't send HTTP request to server
[12:56:42] + Could not connect to Work Server (results)
[12:56:42]     (129.74.85.15:8080)
[12:56:42] + Retrying using alternative port
[13:00:15] - Couldn't send HTTP request to server
[13:00:15] + Could not connect to Work Server (results)
[13:00:15]     (129.74.85.15:80)
[13:00:15] - Error: Could not transmit unit 03 (completed September 5) to work server.
[13:00:15]   Keeping unit 03 in queue.
[13:00:15] Project: 10031 (Run 14, Clone 0, Gen 39)


[13:00:15] + Attempting to send results [September 5 13:00:15 UTC]
[13:03:48] - Couldn't send HTTP request to server
[13:03:48] + Could not connect to Work Server (results)
[13:03:48]     (129.74.85.15:8080)
[13:03:48] + Retrying using alternative port
[13:04:09] - Couldn't send HTTP request to server
[13:04:09] + Could not connect to Work Server (results)
[13:04:09]     (129.74.85.15:80)
[13:04:09] - Error: Could not transmit unit 03 (completed September 5) to work server.


[13:04:09] + Attempting to send results [September 5 13:04:09 UTC]
[13:04:40] - Couldn't send HTTP request to server
[13:04:40] + Could not connect to Work Server (results)
[13:04:40]     (129.74.85.16:8080)
[13:04:40] + Retrying using alternative port
[13:04:50] - Couldn't send HTTP request to server
[13:04:50] + Could not connect to Work Server (results)
[13:04:50]     (129.74.85.16:80)
[13:04:50]   Could not transmit unit 03 to Collection server; keeping in queue.
[13:04:50] - Preparing to get new work unit...
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by PantherX »

Welcome to the F@H Forum Dusty82,
Dusty82 wrote:...Is the work I had done (68%) completely lost? Is there a way to restart at a previous checkpoint? It's pitty that even the partial results weren't able to upload either....
As long as the data isn't corrupted and scientifically valid, it will be useful. However, since you got an error while processing the WU, the WU will be resent to some other donor.

The Client automatically reverts to the last valid checkpoint. You can't manually force it (at least I am not aware of any method for doing it)

The partial results will be uploaded: (from your FAHlog)
[13:04:50] Could not transmit unit 03 to Collection server; keeping in queue.
The Client will automatically retry to send the result every 6 hours. Once it will upload to the Servers and the Official Stats are updated, you may get partial credits depending on the percentage of work you have done. An Admin/Mod can look up the WU using the PRCG and can tell you how much credits you got.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
lanbrown
Posts: 104
Joined: Thu Jul 09, 2009 1:21 am

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by lanbrown »

There is a bug that needs to be fixed:
http://en.fah-addict.net/news/news-0-21 ... -10039.php

"Warning: These projects use the Protomol v23 core, which is affected by a known bug that wipes out all progress on the unit if it or the F@H client is stopped and restarted. The development teams are working to resolve this problem, but aside from that, the projects are entirely completable. If you do not have a dedicated machine, then we advise you to restart your client with no -advmethods flag, until a fix is released."
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by PantherX »

AFAIK, the ProtoMol v25 version fixes it in a majority of cases. I have restarted my Classic Client a couple of times and it didn't give me any kind of error and resumed from where it left. Part of my FAHlog:

Code: Select all

[12:09:13] Loaded queue successfully.
[12:09:13] 
[12:09:13] + Processing work unit
[12:09:13] Core required: FahCore_b4.exe
[12:09:13] Core found.
[12:09:13] - Autosending finished units... [September 3 12:09:13 UTC]
[12:09:13] Trying to send all finished work units
[12:09:13] + No unsent completed units remaining.
[12:09:13] - Autosend completed
[12:09:13] Working on queue slot 06 [September 3 12:09:13 UTC]
[12:09:13] + Working ...
[12:09:13] - Calling '.\FahCore_b4.exe -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 4248 -version 623'

[12:09:17] *********************** Log Started 03/Sep/2010 12:09:17 ***********************
[12:09:17] ************************** ProtoMol Folding@Home Core **************************
[12:09:17]   Version: 25
[12:09:17]      Type: 180
[12:09:17]      Core: ProtoMol
[12:09:17]   Website: http://folding.stanford.edu/
[12:09:17] Copyright: (c) 2009 Stanford University
[12:09:17]    Author: Joseph Coffland <joseph@cauldrondevelopment.com>
[12:09:17]      Args: -dir work/ -suffix 06 -checkpoint 15 -verbose -lifeline 4248 -version
[12:09:17]            623
[12:09:17] ************************************ Build *************************************
[12:09:17]      Date: May 18 2010
[12:09:17]      Time: 23:43:52
[12:09:17]  Revision: 1819
[12:09:17]  Compiler: Intel(R) C++ MSVC 1500 mode 1110
[12:09:17]   Options: /TP /nologo /EHsc /wd4297 /wd4103 /wd1786 /arch:IA32 /Ox
[12:09:17]            /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
[12:09:17]   Defines: _CRT_SECURE_NO_WARNINGS NDEBUG HAVE_GEEKINFO BOOST_ALL_NO_LIB
[12:09:17]            XML_STATIC HAVE_EXPAT HAVE_OPENSSL HAVE_LIBFAH HAVE_SIMTK_LAPACK
[12:09:17]  Platform: Windows XP
[12:09:17]      Bits: 32
[12:09:17]      Mode: Release
[12:09:17] ************************************ System ************************************
[12:09:17]        OS: Microsoft Windows 7 Ultimate
[12:09:17]       CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz
[12:09:17]    CPU ID: GenuineIntel Family 6 Model 23 Stepping 6
[12:09:17]      CPUs: 2 Logical, 1 Physical
[12:09:17]    Memory: 3.00 GB
[12:09:17]   Threads: Windows
[12:09:17] ********************************************************************************
[12:09:17] Project: 10057 (Run 954, Clone 0, Gen 7)
[12:09:17] Unit: 0x000000090001329c4c61917800003e55
[12:09:17] User: 0x00000000000000000000000000000000
[12:09:17] Machine: 1
[12:09:17] Digital signatures verified
[12:09:17] GUI Server started
[12:09:17] Completed 175700 out of 499375 steps (35%)
[12:16:47] Completed 179800 out of 499375 steps (36%)
SNIP
[21:31:20] Completed 484400 out of 499375 steps (97%)
[21:40:25] Completed 489400 out of 499375 steps (98%)
[21:49:28] Completed 494400 out of 499375 steps (99%)
[21:58:37] Completed 499300 out of 499375 steps (99%)
[21:58:47] GUI Server closing
[21:58:47] GUI Server exiting
[21:58:47] Saving result file logfile_06.txt
[21:58:47] Saving result file checkpt
[21:58:47] Saving result file checkpt.crc
[21:58:47] Saving result file log.txt
[21:58:51] Saving result file protomol.conf
[21:58:51] Saving result file ww.538.pos
[21:58:51] Saving result file ww.538.vel
[21:58:51] Saving result file ww.dcd
[21:58:52] Folding@home Core Shutdown: FINISHED_UNIT
[21:58:56] CoreStatus = 64 (100)
[21:58:56] Unit 6 finished with 97 percent of time to deadline remaining.
[21:58:56] Updated performance fraction: 0.980017
[21:58:56] Sending work to server
[21:58:56] Project: 10057 (Run 954, Clone 0, Gen 7)
[21:58:56] - Read packet limit of 540015616... Set to 524286976.


[21:58:56] + Attempting to send results [September 3 21:58:56 UTC]
[21:58:56] - Reading file work/wuresults_06.dat from core
[21:58:56]   (Read 4236306 bytes from disk)
[21:58:56] Connecting to http://129.74.85.15:8080/
[22:06:38] Posted data.
[22:06:38] Initial: 0000; - Uploaded at ~8 kB/s
[22:06:38] - Averaged speed for that direction ~6 kB/s
[22:06:38] + Results successfully sent
[22:06:38] Thank you for your contribution to Folding@Home.
[22:06:38] + Number of Units Completed: 52
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by PantherX »

This might interest you:
John_Weatherman wrote:This from j coffland in reply to my asking about the b4 core problems,

I have a new version of the b4 core which I believe solves this problem. We were hoping to also solve a problem with SSE2 instructions in the 32-bit core. This issue is important but is not yet at the top of the list. Although you should expect a new b4 release in about a month. Sorry for the delays. Developer bandwidth is our limiting factor.

Joseph
Source
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
toTOW
Site Moderator
Posts: 6455
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by toTOW »

Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php

v25 fixed another kind of problems : http://en.fah-addict.net/news/news-0-22 ... ilable.php
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
John_Weatherman
Posts: 289
Joined: Sun Dec 02, 2007 4:31 am
Location: Carrizo Plain National Monument, California
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by John_Weatherman »

If you have a power failure, you'll probably lose your work, regards of what type of WU you're working on. Often it'll start again from the beginning, if you're very lucky it'll carry on from the crash, and the Protomols will dump the WU and send in the results.
The reported error with the b4 core is from a normal Windows closedown, which produces the same error message.
lanbrown
Posts: 104
Joined: Thu Jul 09, 2009 1:21 am

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by lanbrown »

toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php
If it was fixed, then some wouldn't be experiencing the checkpoint issue.
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by gwildperson »

lanbrown wrote:
toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php
If it was fixed, then some wouldn't be experiencing the checkpoint issue.
You're talking about two different things. Checkpoints which precede an orderly shutdown are inherently different that checkpoints which precede a power failure. If the OS is unable to complete an orderly shutdown, the checkpoint that has been "written" by the software might still be in cache and only the part of it that was actually written to disk is available after a power failure.
lanbrown
Posts: 104
Joined: Thu Jul 09, 2009 1:21 am

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by lanbrown »

gwildperson wrote:
lanbrown wrote:
toTOW wrote:Checkpoints have been fixed in v24 : http://en.fah-addict.net/news/news-0-22 ... -fixed.php
If it was fixed, then some wouldn't be experiencing the checkpoint issue.
You're talking about two different things. Checkpoints which precede an orderly shutdown are inherently different that checkpoints which precede a power failure. If the OS is unable to complete an orderly shutdown, the checkpoint that has been "written" by the software might still be in cache and only the part of it that was actually written to disk is available after a power failure.
Uhhh no. The default is writing a checkpoint every 15 minutes. A power failure means that while the client was not gracefully shutdown, it still has a checkpoint it can go back too. The Protomol seems to be the only core that has this issue. It doesn't happen all of the time, but it DOES happen. It can even happen at a graceful shutdown and restart of JUST the client and NOT the system. The OS is not going to keep DISK IO in cache for any length of time. It may use cache to speed the request up. To further put holes in your cache explanation, it happens on a box running multiple clients and the checkpoints are not at the same time; they are minutes apart. No OS wil keep the write cache that long as it would lead to a very inconsistent file system.
John_Weatherman
Posts: 289
Joined: Sun Dec 02, 2007 4:31 am
Location: Carrizo Plain National Monument, California
Contact:

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by John_Weatherman »

lanbrown wrote: A power failure means that while the client was not gracefully shutdown, it still has a checkpoint it can go back too. The Protomol seems to be the only core that has this issue. It doesn't happen all of the time, but it DOES happen. It can even happen at a graceful shutdown and restart of JUST the client and NOT the system.
I just lost a weeks work with a core 78 WU after a crash, so it's not just b4 core WUs. Protomol WUs are more sensitive, and keep working for a while even after the client is closed correctly.
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Re: 10031 (Run 14, Clone 0, Gen 39) Protomol

Post by gwildperson »

lanbrown wrote:No OS wil keep the write cache that long as it would lead to a very inconsistent file system.
I don't know how you can claim to know about all possible variations in Operating Systems. let's assume whe have a write-through ramdisk with a harddisk backup. All reads and writes go to the ramdisk so the application always sees a consistent file system. During a shutdown, everything in RAM is synced to disk so the disk is consistent whenever the OS is shut down.

An increasing number of filesystems are keeping more an more information in RAM for longer and longer periods of time. Your assumptions about what constitutes "that long" may very well be correct, but they're on shaky ground as more and more file-system designer come up with more and more reasons to delay bothering the harddisk subsystem with "unnecessary" operations as long as they can assume that your UPS will provide enough time to write the last version of the data to disk.
Post Reply