Project 4433 (R84, C12, G16) & (R85, C2, G12)

Moderators: Site Moderators, FAHC Science Team

Post Reply
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Project 4433 (R84, C12, G16) & (R85, C2, G12)

Post by BrokenWolf »

I had to reboot my host system and upon restart this is what I received on 2 systems with 4433 WU's. I shut down the clients like I normally do (ctrl-C) and waited for any messages then shut the machines down.

OS is RHEL4.5 U4 running in VMware Workstation v6.5.1, 2vcpu 2GB ram and 12GB hdd size for each system. Host is Vista x64 Quad core 2.4 with 8GB ram also running GPU client in nVidia 8800 GTS.

Code: Select all

--- Opening Log file [March 16 04:22:41 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/timk/Fold
Executable: ./fah6
Arguments: -smp -advmethods -verbosity 9 

[04:22:41] - Ask before connecting: No
[04:22:41] - User name: BrokenWolf (Team 1971)
[04:22:41] - User ID: 5E4960574243B992
[04:22:41] - Machine ID: 1
[04:22:41] 
[04:22:42] Loaded queue successfully.
[04:22:42] 
[04:22:42] + Processing work unit
[04:22:42] At least 4 processors must be requested.Core required: FahCore_a2.exe
[04:22:42] Core found.
[04:22:42] - Autosending finished units... [March 16 04:22:42 UTC]
[04:22:42] Trying to send all finished work units
[04:22:42] + No unsent completed units remaining.
[04:22:42] - Autosend completed
[04:22:42] Working on queue slot 02 [March 16 04:22:42 UTC]
[04:22:42] + Working ...
[04:22:42] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 5523 -version 624'

[04:22:42] 
[04:22:42] *------------------------------*
[04:22:42] Folding@Home Gromacs SMP Core
[04:22:42] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[04:22:42] 
[04:22:42] Preparing to commence simulation
[04:22:42] - Ensuring status. Please wait.
[04:22:42] Working with standard loops on this execution.
[04:22:42] - Files status OK
[04:22:42] percent)
[04:22:42] Called DecompressByteArray: compressed_data_size=229076 data_size=1114469, decompressed_data_size=1114469 diff=0
[04:22:42] - Digital signature verified
[04:22:42] 
[04:22:42] Project: 4433 (Run 84, Clone 12, Gen 16)
[04:22:42] 
[04:22:42] Assembly optimizations on if available.
[04:22:42] Entering M.D.
[04:22:42] ing M.D.
[04:22:48] me from checkpoint file
[04:22:48] int file
[04:22:57] ill resume from checkpoint file
[04:22:58] Resuming from checkpoint
[04:22:58] fcSaveRestoreState: I/O failed dir=0, var=0000000000A0AB70, varsize=51372
[04:22:58] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore state.
and

From the shutdown and restart on the R85 WU,

Code: Select all

[01:58:38] Completed 1150002 out of 2500000 steps  (46%)
[02:08:18] Completed 1175002 out of 2500000 steps  (47%)
[02:17:58] Completed 1200002 out of 2500000 steps  (48%)
[02:27:37] Completed 1225002 out of 2500000 steps  (49%)
[02:37:18] Completed 1250002 out of 2500000 steps  (50%)
[02:44:15] ***** Got an Activate signal (2)
[02:44:15] Killing all core threads

Folding@Home Client Shutdown.


--- Opening Log file [March 16 04:23:28 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.24beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/timk/Folding
Executable: ./fah6
Arguments: -smp -verbosity 9 

[04:23:28] - Ask before connecting: No
[04:23:28] - User name: BrokenWolf (Team 1971)
[04:23:29] - User ID: 442758B329BB4AD7
[04:23:29] - Machine ID: 2
[04:23:29] 
[04:23:29] Loaded queue successfully.
[04:23:29] 
[04:23:29] + Processing work unit
[04:23:29] At least 4 processors must be requested.Core required: FahCore_a2.exe
[04:23:29] Core found.
[04:23:29] - Autosending finished units... [March 16 04:23:29 UTC]
[04:23:29] Trying to send all finished work units
[04:23:29] + No unsent completed units remaining.
[04:23:29] - Autosend completed
[04:23:29] Working on queue slot 00 [March 16 04:23:29 UTC]
[04:23:29] + Working ...
[04:23:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -priority 96 -checkpoint 15 -verbose -lifeline 5455 -version 624'

[04:23:29] 
[04:23:29] *------------------------------*
[04:23:29] Folding@Home Gromacs SMP Core
[04:23:29] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[04:23:29] 
[04:23:29] Preparing to commence simulation
[04:23:29] - Ensuring status. Please wait.
[04:23:29] Files status OK
[04:23:30] - Expanded 229041 -> 1114469 (decompressed 486.5 percent)
[04:23:30] Called DecompressByteArray: compressed_data_size=229041 data_size=1114469, decompressed_data_size=1114469 diff=0
[04:23:30] - Digital signature verified
[04:23:30] 
[04:23:30] Project: 4433 (Run 85, Clone 2, Gen 12)
[04:23:30] 
[04:23:30] Assembly optimizations on if available.
[04:23:30] Entering M.D.
[04:23:36] Will resume from checkpoint file
[04:23:39] ng M.D.
[04:23:45] Will resume from checkpoint file
[04:23:46] te: I/O failed dir=0, var=0000000000A0AB70, varsize=49704
[04:23:46] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore state.
Image
susato
Site Moderator
Posts: 513
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project 4433 (R84, C12, G16) & (R85, C2, G12)

Post by susato »

The Project: 4433 (Run 84, Clone 12, Gen 16) WU has not been completed by anyone else, but the previous generation was completed at 9:40 a.m. on 3/15, shortly before you got it. Similar results for the Project: 4433 (Run 85, Clone 2, Gen 12) WU.

I doubt that the problem is the work units as they don't have a history of failure - but I'll pass along the url of this thread to the developer of the Linux client who has been trying to fix checkpointing failures in the clients.
BrokenWolf
Posts: 126
Joined: Sat Aug 02, 2008 3:08 am

Re: Project 4433 (R84, C12, G16) & (R85, C2, G12)

Post by BrokenWolf »

Thank you my dear. :) My main concern was the not resuming from a checkpoint.

Broken
Image
susato
Site Moderator
Posts: 513
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project 4433 (R84, C12, G16) & (R85, C2, G12)

Post by susato »

That's my chief concern too (my dear :) ) and I presented the info on the units themselves in order to rule them out as the problem.

As far as I can tell you seem to have the latest (most robust) versions of the clients so i can't even advise you to upgrade! :wink:

Please post again if you see more checkpointing failures.
Post Reply