Page 1 of 1

Project: 2665 (Run 2, Clone 39, Gen 30) *hung* at 99%

Posted: Sun Jul 20, 2008 9:42 am
by Great_Gig
Project: 2665 (Run 2, Clone 39, Gen 30) hung at 99%

OK guys. the above WU halted at 99%, I tried to restart the system but it didn't resume? I was running the 64bit SMP client, using Notfreds folding CD. The log states the core shutdown, there is no reason to my knowledge why this should have happened - could a power spike have caused it maybe?

The FAHlog.txt is below at the time it occurred and the full log is at the bottom. Can anything be done to salvage this WU? I see there is no FahCore_a1.exe in the folder on the USB Flash drive anymore - does this go when the core shuts down?

This is the log up to the time it shutdown:
[01:15:03] Completed 237500 out of 250000 steps (95 percent)
[01:29:49] Writing local files
[01:29:49] Completed 240000 out of 250000 steps (96 percent)
[01:44:36] Writing local files
[01:44:37] Completed 242500 out of 250000 steps (97 percent)
[01:48:38] - Autosending finished units...
[01:48:38] Trying to send all finished work units
[01:48:38] + No unsent completed units remaining.
[01:48:38] - Autosend completed
[01:59:22] Writing local files
[01:59:23] Completed 245000 out of 250000 steps (98 percent)
[02:14:09] Writing local files
[02:14:09] Completed 247500 out of 250000 steps (99 percent)
[02:17:15]
[02:17:15] Folding@home Core Shutdown: INTERRUPTED
[07:48:38] - Autosending finished units...
[07:48:38] Trying to send all finished work units
[07:48:38] + No unsent completed units remaining.
[07:48:38] - Autosend completed

Additionally, I had this error message being displayed:

FahCore_a1.exe [549]: segfault at 9db840 rip 5cc669 rsp 407Fdd60 error 4

Any ideas - can this be salvaged or is it lost?

Full FAHlog.txt for info.

Code: Select all

--- Opening Log file [July 19 02:48:37] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /etc/folding/1
Executable: ./fah6
Arguments: -local -forceasm -verbosity 9 -smp 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[02:48:37] - Ask before connecting: No
[02:48:37] - User name: GreatGig (Team 132987)
[02:48:37] - User ID not found locally
[02:48:37] + Requesting User ID from server
[02:48:37] - Getting ID from AS: 
[02:48:37] Connecting to http://assign.stanford.edu:8080/
[02:48:38] Posted data.
[02:48:38] Initial: 8F5D; - Received User ID = 5D8F39DD2CC5500F
[02:48:38] - Machine ID: 1
[02:48:38] 
[02:48:38] Work directory not found. Creating...
[02:48:38] Could not open work queue, generating new queue...
[02:48:38] - Autosending finished units...
[02:48:38] Trying to send all finished work units
[02:48:38] + No unsent completed units remaining.
[02:48:38] - Autosend completed
[02:48:38] - Preparing to get new work unit...
[02:48:38] + Attempting to get work packet
[02:48:38] - Will indicate memory of 2001 MB
[02:48:38] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 11
[02:48:38] - Connecting to assignment server
[02:48:38] Connecting to http://assign.stanford.edu:8080/
[02:48:39] Posted data.
[02:48:39] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[02:48:39] + News From Folding@Home: Welcome to Folding@Home
[02:48:39] Loaded queue successfully.
[02:48:39] Connecting to http://171.64.65.64:8080/
[02:48:44] Posted data.
[02:48:44] Initial: 0000; - Receiving payload (expected size: 4818659)
[01:49:04] - Downloaded at ~4194302 kB/s
[01:49:04] - Averaged speed for that direction ~4194302 kB/s
[01:49:04] + Received work.
[01:49:04] + Closed connections
[01:49:04] 
[01:49:04] + Processing work unit
[01:49:04] Core required: FahCore_a1.exe
[01:49:04] Core not found.
[01:49:04] - Core is not present or corrupted.
[01:49:04] - Attempting to download new core...
[01:49:04] + Downloading new core: FahCore_a1.exe
[01:49:04] Downloading core (/~pande/Linux/x86/Core_a1.fah from www.stanford.edu)
[01:49:04] Initial: AFDE; + 10240 bytes downloaded

*EDIT*

[01:49:14] Initial: 3A56; + 1484800 bytes downloaded
[01:49:14] Initial: D4FE; + 1490945 bytes downloaded
[01:49:14] Verifying core Core_a1.fah...
[01:49:14] Signature is VALID
[01:49:14] 
[01:49:14] Trying to unzip core FahCore_a1.exe
[01:49:15] Decompressed FahCore_a1.exe (3625104 bytes) successfully
[01:49:15] + Core successfully engaged
[01:49:20] 
[01:49:20] + Processing work unit
[01:49:20] Core required: FahCore_a1.exe
[01:49:20] Core found.
[01:49:20] Working on Unit 01 [July 19 01:49:20]
[01:49:20] + Working ...
[01:49:20] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 30 -forceasm -verbose -lifeline 525 -version 602'

[01:49:20] 
[01:49:20] *------------------------------*
[01:49:20] Folding@Home Gromacs SMP Core
[01:49:20] Version 1.74 (November 27, 2006)
[01:49:20] 
[01:49:20] Preparing to commence simulation
[01:49:20] - Ensuring status. Please wait.
[01:49:21] - Starting from initial work packet
[01:49:21] 
[01:49:21] Project: 2665 (Run 2, Clone 39, Gen 30)
[01:49:21] 
[01:49:21] Assembly optimizations on if available.
[01:49:21] Entering M.D.
[01:49:38]  on if available.
[01:49:38] Entering M.D.
[01:49:45] G with glycosylations
[01:49:45] osylations
[01:49:45] Writing local files
[01:49:45] Extra SSE boost OK.
[01:49:46] al files
[01:49:46] Completed 0 out of 250000 steps  (0 percent)
[02:04:34] Writing local files
[02:04:34] Completed 2500 out of 250000 steps  (1 percent)
[02:19:23] Writing local files
[02:19:23] Completed 5000 out of 250000 steps  (2 percent)
[02:34:11] Writing local files

*EDIT*

[01:15:03] Completed 237500 out of 250000 steps  (95 percent)
[01:29:49] Writing local files
[01:29:49] Completed 240000 out of 250000 steps  (96 percent)
[01:44:36] Writing local files
[01:44:37] Completed 242500 out of 250000 steps  (97 percent)
[01:48:38] - Autosending finished units...
[01:48:38] Trying to send all finished work units
[01:48:38] + No unsent completed units remaining.
[01:48:38] - Autosend completed
[01:59:22] Writing local files
[01:59:23] Completed 245000 out of 250000 steps  (98 percent)
[02:14:09] Writing local files
[02:14:09] Completed 247500 out of 250000 steps  (99 percent)
[02:17:15] 
[02:17:15] Folding@home Core Shutdown: INTERRUPTED
[07:48:38] - Autosending finished units...
[07:48:38] Trying to send all finished work units
[07:48:38] + No unsent completed units remaining.
[07:48:38] - Autosend completed

Re: Project: 2665 (Run 2, Clone 39, Gen 30) *hung* at 99%

Posted: Mon Jul 21, 2008 7:08 pm
by Great_Gig
Any ideas about this problem?

Re: Project: 2665 (Run 2, Clone 39, Gen 30) *hung* at 99%

Posted: Mon Jul 21, 2008 7:34 pm
by bruce
Great_Gig wrote:Any ideas about this problem?
Bumping is prohibited in our forums. People will answer when they're here if they have anything to offer (and sometimes when they don't have anything to offer). There are a lot more people who can answer questions during daytime hours in the USA. (The time in California is 8 hours earlier than yours so you asked at 3am for many of us.)

When there's a segfault, the core dies and the client tries to discard the WU and move on. How that interrupt is handled depends a great deal on the OS, and I'm not sure exactly how the code notfred is using deals with it. Most likely the WU was deleted and there's no chance of recovery.

Even if there were some manual operation that could help, I wouldn't know how to do it on notfred's system. The original goals of FAH were to design a client that runs unatteneded. That means that the client has to pick the best option that it can do to continue folding. In many cases, that means discarding problematic work and moving on to something new on the assumption that a new assignment will work correctly.

I'm going to assume that the segfault was a software bug even though there's a chance that it's an issue in your hardware. There still are some unresolved issues with FahCore_a1, and the segfaults are one of them. I don't know when fixes will be available.

Re: Project: 2665 (Run 2, Clone 39, Gen 30) *hung* at 99%

Posted: Tue Jul 22, 2008 10:30 pm
by Great_Gig
Thanks for the reply, didn't realise what I did was a problem and had to Google 'Bumping' to find out what you meant! Now I know and apologies for breaking the rules :oops: Won't happen again.

Guess I was unlucky with the WU, seems to happen to me..... but that's the price of SMP I guess, higher gain but higher risk of failure?

Re: Project: 2665 (Run 2, Clone 39, Gen 30) *hung* at 99%

Posted: Wed Jul 23, 2008 7:17 am
by bruce
Great_Gig wrote:..... but that's the price of SMP I guess, higher gain but higher risk of failure?
Yes . . . quite true.