Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Stonecold · Post by **Stonecold** » Thu Mar 01, 2012 10:42 pm

In summary, I was running p6097 (Run 0, Clone 40, Gen 132) on client v6.34 in Linux, but after it finished it tried running the same WU again. Here are the details of what happened:

Once the WU hit 100%, it stayed at "Done" without finishing or sending any data to Stanford. Here's a copy of the log:

Code: Select all

[18:10:24] Completed 420000 out of 500000 steps  (84%)
[18:23:00] Completed 425000 out of 500000 steps  (85%)
[18:35:35] Completed 430000 out of 500000 steps  (86%)
[18:48:11] Completed 435000 out of 500000 steps  (87%)
[19:00:46] Completed 440000 out of 500000 steps  (88%)
[19:13:26] Completed 445000 out of 500000 steps  (89%)
[19:26:02] Completed 450000 out of 500000 steps  (90%)
[19:38:38] Completed 455000 out of 500000 steps  (91%)
[19:51:15] Completed 460000 out of 500000 steps  (92%)
[20:03:58] Completed 465000 out of 500000 steps  (93%)
[20:17:39] Completed 470000 out of 500000 steps  (94%)
[20:32:03] Completed 475000 out of 500000 steps  (95%)
[20:47:01] Completed 480000 out of 500000 steps  (96%)
[21:01:46] Completed 485000 out of 500000 steps  (97%)
[21:16:58] Completed 490000 out of 500000 steps  (98%)
[21:31:20] Completed 495000 out of 500000 steps  (99%)
[21:45:16] Completed 500000 out of 500000 steps  (100%)
[21:45:18] DynamicWrapper: Finished Work Unit: sleep=10000
[21:45:28] 
[21:45:28] Finished Work Unit:
[21:45:28] - Reading up to 12102336 from "work/wudata_08.trr": Read 12102336
[21:45:28] trr file hash check passed.
[21:45:28] edr file hash check passed.
[21:45:28] logfile size: 61806
[21:45:28] Leaving Run
[21:45:29] - Writing 12197818 bytes of core data to disk...
[21:45:31] Done: 12197306 -> 11286509 (compressed to 92.5 percent)
[21:45:31]   ... Done.

It stayed at "Done" for way too long without doing anything. Eventually I closed the application and started it up again to see if that would work. This is what I got that time:

Code: Select all

--- Opening Log file [March 1 21:59:21 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/alex/Folding@home
Executable: ./fah6
Arguments: -smp 8 -verbosity 9 -advmethods 

[21:59:21] - Ask before connecting: No
[21:59:21] - User name: Stonecold (Team 214237)
[21:59:21] - User ID: 716CF0C42F1B31C2
[21:59:21] - Machine ID: 1
[21:59:21] 
[21:59:22] Loaded queue successfully.
[21:59:22] 
[21:59:22] + Processing work unit
[21:59:22] - Autosending finished units... [21:59:22]
[21:59:22] Core required: FahCore_a3.exe
[21:59:22] Trying to send all finished work units
[21:59:22] Core found.
[21:59:22] + No unsent completed units remaining.
[21:59:22] - Autosend completed
[21:59:22] Working on queue slot 08 [March 1 21:59:22 UTC]
[21:59:22] + Working ...
[21:59:22] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 8 -checkpoint 10 -verbose -lifeline 29313 -version 634'

[21:59:22] 
[21:59:22] *------------------------------*
[21:59:22] Folding@Home Gromacs SMP Core
[21:59:22] Version 2.27 (Dec. 15, 2010)
[21:59:22] 
[21:59:22] Preparing to commence simulation
[21:59:22] - Ensuring status. Please wait.
[21:59:32] - Looking at optimizations...
[21:59:32] - Working with standard loops on this execution.

It stayed at "Working with standard loops on this execution" without changing, so I shut the client down again and rebooted my computer. This time it actually started folding, but it was folding the exact same WU.

Code: Select all

--- Opening Log file [March 1 22:05:40 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/alex/Folding@home
Executable: ./fah6
Arguments: -smp 8 -verbosity 9 -advmethods 

[22:05:40] - Ask before connecting: No
[22:05:40] - User name: Stonecold (Team 214237)
[22:05:40] - User ID: 716CF0C42F1B31C2
[22:05:40] - Machine ID: 1
[22:05:40] 
[22:05:40] Loaded queue successfully.
[22:05:40] 
[22:05:40] - Autosending finished units... [March 1 22:05:40 UTC]
[22:05:40] + Processing work unit
[22:05:40] Trying to send all finished work units
[22:05:40] Core required: FahCore_a3.exe
[22:05:40] + No unsent completed units remaining.
[22:05:40] Core found.
[22:05:40] - Autosend completed
[22:05:41] Working on queue slot 08 [March 1 22:05:41 UTC]
[22:05:41] + Working ...
[22:05:41] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 8 -checkpoint 10 -verbose -lifeline 2760 -version 634'

[22:05:42] 
[22:05:42] *------------------------------*
[22:05:42] Folding@Home Gromacs SMP Core
[22:05:42] Version 2.27 (Dec. 15, 2010)
[22:05:42] 
[22:05:42] Preparing to commence simulation
[22:05:42] - Ensuring status. Please wait.
[22:05:51] - Looking at optimizations...
[22:05:51] - Working with standard loops on this execution.
[22:12:47] - Created dyn
[22:12:47] - Files status OK
[22:12:47] - Expanded 3811216 -> 4169428 (decompressed 109.3 percent)
[22:12:47] Called DecompressByteArray: compressed_data_size=3811216 data_size=4169428, decompressed_data_size=4169428 diff=0
[22:12:47] - Digital signature verified
[22:12:47] 
[22:12:47] Project: 6097 (Run 0, Clone 40, Gen 132)
[22:12:47] 
[22:12:47] Entering M.D.
[22:12:53] Mapping NT from 8 to 8 
[22:12:54] Completed 0 out of 500000 steps  (0%)

NOTE: Turning on dynamic load balancing

[22:27:02] Completed 5000 out of 500000 steps  (1%)

So what caused this? Is it this specific WU or is this just a one-time thing? I'll say whether this second run finishes correctly when the time comes, but right now it's only at 1%. I don't think it re-downloaded the WU though, just started it over from step 1.

Post by **Jesse_V** » Thu Mar 01, 2012 10:52 pm

Hmm. My guess is that there was some sort of file corruption that went on. After the "Done" step, it's been my observation that the client does some file closing operations, shuts down the core, and then sends the unit. I've had WUs corrupt by interrupted these operations, so you're just recrunching the same WU. It must have deleted the checkpoints or couldn't read them.

Post by **bruce** » Thu Mar 01, 2012 10:55 pm

What happened was that you got impatient.

Code: Select all

[10:32:17] ... Done.
[10:32:18] - Shutting down core
[10:32:18]
[10:32:18] Folding@home Core Shutdown: FINISHED_UNIT
[10:32:21] CoreStatus = 64 (100)

Notice that after the results from a WU are calculated and the log says "... Done" but it still has not shut down the core cleanly and closed all the files until you see a CoreStatus message. Since the files had not been prepared for upload, the WU was not finished. When you restarted the client, it discovered the unfinished WU but with most of the results missing (already prepared for upload) so it started the WU from the beginning.

Do you happpen to be running on an ext4 filesystem?

Stonecold · Post by **Stonecold** » Thu Mar 01, 2012 10:55 pm

Jesse_V wrote:Hmm. My guess is that there was some sort of file corruption that went on. After the "Done" step, it's been my observation that the client does some file closing operations, shuts down the core, and then sends the unit. I've had WUs corrupt by interrupted these operations, so you're just recrunching the same WU. It must have deleted the checkpoints or couldn't read them.

Is there any way to prevent this? I don't believe I was doing anything that could have caused data corruption. As far as I know, there wasn't any significant disk activity going on at the time (I was just reading a web page), at least not manually.

Stonecold · Post by **Stonecold** » Thu Mar 01, 2012 11:00 pm

bruce wrote:What happened was that you got impatient.
Code: Select all
[10:32:17] ... Done.
[10:32:18] - Shutting down core
[10:32:18]
[10:32:18] Folding@home Core Shutdown: FINISHED_UNIT
[10:32:21] CoreStatus = 64 (100)
Notice that after the results from a WU are calculated and the log says "... Done" but it still has not shut down the core cleanly and closed all the files until you see a CoreStatus message. Since the files had not been prepared for upload, the WU was not finished. When you restarted the client, it discovered the unfinished WU but with most of the results missing (already prepared for upload) so it started the WU from the beginning.

Do you happpen to be running on an ext4 filesystem?

Yes, I am. This is what I got using the "df" command in Terminal:

Code: Select all

alex@ubuntu:~$ df -T
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/loop0    ext4    48061028  17231100  28388524  38% /
udev      devtmpfs     4058372        12   4058360   1% /dev
tmpfs        tmpfs     1628056      1088   1626968   1% /run
none         tmpfs        5120         0      5120   0% /run/lock
none         tmpfs     4070132      2532   4067600   1% /run/shm
/dev/sda2  fuseblk   717197308 637191620  80005688  89% /host

And how patient do I have to be? I waited 15 minutes the first time (it usually takes several seconds), and 7 minutes the second time. It's never taken this long before, so I assumed something was wrong. Should I have waited even longer?

7im · Post by **7im** » Thu Mar 01, 2012 11:06 pm

Depending on the size of the WU, and if you were using the PC as the same time, yes, waiting is always better, especially when the client is at the end of a work unit. Restarting a client at the very end of a work unit is the worst time to restart the client.

As a comparison, go back and look at your logs for similar WUs. How long did those take to finish?

Stonecold · Post by **Stonecold** » Thu Mar 01, 2012 11:12 pm

7im wrote:Depending on the size of the WU, and if you were using the PC as the same time, yes, waiting is always better, especially when the client is at the end of a work unit. Restarting a client at the very end of a work unit is the worst time to restart the client.

As a comparison, go back and look at your logs for similar WUs. How long did those take to finish?

Okay, your right. For other WUs, it took between 10 and 20 minutes. I must have been thinking of a different step when I thought it was quicker. Does Windows take as long to finish? I often use Windows for FAH, and I haven't noticed anything taking that long on it (it might have, but if it did I never noticed it).

Post by **bruce** » Fri Mar 02, 2012 12:59 am

For large WUs, I've seen reports of it taking two hours to close the files.

There's a long topic on the subject: viewtopic.php?f=55&t=17972

Stonecold · Post by **Stonecold** » Thu Mar 08, 2012 10:45 pm

bruce wrote:For large WUs, I've seen reports of it taking two hours to close the files.

There's a long topic on the subject: foldingforum.org/viewtopic.php?f=55&t=17972

Okay, thank you. And just a question, if it takes so long but doesn't use much CPU, would it be possible for a second WU to start at that time? I'd think that would be much more productive, if it's even possible. If it's just an issue of the last WU needing to "clean up" the work directory before another can run, then couldn't the new WU be sent to a temporary directory until then? What is the core even doing that would take up so much time (you said "closing the files", but what exactly does that mean)? Sorry for all the questions, but some of this is kind of confusing to me.

Is there somewhere I can go to learn the smaller details about how FAH works (the Wiki doesn't seem to have been updated for quite a while) so I don't keep making these mistakes?

Post by **bruce** » Thu Mar 08, 2012 11:41 pm

You can't start a second WU within the same directory because they'd try to use the same file names and you would guarantee corruption. You can run another client/slot because they'd be running in another directory. Coordinating the two would be a nightmare.

The expert on what might be called the ext4 problem is "Tear" and he doesn't contribute to the Wiki. He has leveled some pretty strong criticisms at the way the FahCore writes the files and for all I know, he's probably right. I see no evidence that the Pande Group is doing anything about it. Frankly, I don't understand why it's a problem on linux ext4 but not a problem for ext3 or for windows NTFS since both are Journaled File Systems but there are lots of differences between file systems and I'm certainly not an expert.

Tear recommends using ext3 or adding a barriers=0 option to the filesystem mount command.

Stonecold · Post by **Stonecold** » Thu Mar 08, 2012 11:49 pm

bruce wrote:You can't start a second WU within the same directory because they'd try to use the same file names and you would guarantee corruption. You can run another client/slot because they'd be running in another directory. Coordinating the two would be a nightmare.

The expert on what might be called the ext4 problem is "Tear" and he doesn't contribute to the Wiki. He has leveled some pretty strong criticisms at the way the FahCore writes the files and for all I know, he's probably right. I see no evidence that the Pande Group is doing anything about it. Frankly, I don't understand why it's a problem on linux ext4 but not a problem for ext3 or for windows NTFS since both are Journaled File Systems but there are lots of differences between file systems and I'm certainly not an expert.

Tear recommends using ext3 or adding a barriers=0 option to the filesystem mount command.

Would switching to ext3 come with any downsides (as in, is ext3 and outdated version of ext4 or are they just two different types of filesystems)? Also, my Linux installation was through "Wubi" on Windows, so it's not on a different partition. All of the Linux is contained in a file called "root.disk". Because Windows is NTFS, would it be possible to set the working directory for FAH to be somewhere on my drive other than "root.disk" without doing any harm?

Folding Forum

Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice

Re: Project: 6097 (Run 0, Clone 40, Gen 132) running twice