Page 1 of 1
Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Mon Apr 06, 2009 10:23 pm
by alpha754293
I have NO idea what the problem is with this one.
The process listing (ps -ef | grep Fah) showed the core processes were "<defunct>".
here's the fahlog output. No other messages were printed to console.
Code: Select all
[14:58:54]
[14:58:54] *------------------------------*
[14:58:54] Folding@Home Gromacs SMP Core
[14:58:54] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[14:58:54]
[14:58:54] Preparing to commence simulation
[14:58:54] - Ensuring status. Please wait.
[14:59:04] - Looking at optimizations...
[14:59:04] - Working with standard loops on this execution.
[14:59:04] - Files status OK
[14:59:05] - Expanded 4835356 -> 23974209 (decompressed 495.8 percent)
[14:59:05] Called DecompressByteArray: compressed_data_size=4835356 data_size=23974209, decompressed_data_size=23974209 diff=0
[14:59:05] - Digital signature verified
[14:59:05]
[14:59:05] Project: 2669 (Run 12, Clone 165, Gen 106)
[14:59:05]
[20:58:54] - Autosending finished units... [April 6 20:58:54 UTC]
[20:58:54] Trying to send all finished work units
[20:58:54] + No unsent completed units remaining.
You can see that it was stuck there for like some 6 hours without doing anything.
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Mon Apr 06, 2009 11:04 pm
by bruce
If this were MS Windows, I'd suggest that the installation of MPI had not been done properly and reinstalling the 3rd party app would probably clear the problem. Since this is *nix, that explanation doesn't apply.
Did you recently get a new copy of FahCore_a2 ?
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Mon Apr 06, 2009 11:55 pm
by road-runner
bruce wrote:If this were MS Windows, I'd suggest that the installation of MPI had not been done properly and reinstalling the 3rd party app would probably clear the problem. Since this is *nix, that explanation doesn't apply.
Did you recently get a new copy of FahCore_a2 ?
Not sure when 2.04 got in there Bruce, most of my rigs are still using 2.01 but I see this one isnt. I deleted everything in the work folder and the que and it got another WU and is working fine.
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Tue Apr 07, 2009 1:19 am
by alpha754293
bruce wrote:If this were MS Windows, I'd suggest that the installation of MPI had not been done properly and reinstalling the 3rd party app would probably clear the problem. Since this is *nix, that explanation doesn't apply.
Did you recently get a new copy of FahCore_a2 ?
No. It's been the same 2.04 since the last update was posted on here.
(ha ha...I guess I missed the 2.05 update that was posted on here).
There was nothing particularly "unusual" about that run except that I did restart it once prior to that and it had the same issue.
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Thu Apr 09, 2009 4:11 pm
by dmearns
I started seeing this same issue yesterday, on 2 machines. I restarted folding, and they started reprocessing the same unit from the beginning. The unit was not downloaded again, it reprocessed the existing unit. Today, these 2 machines were stuck in exactly the same way, and they were joined by 2 more machines. I started over in a new directory, and it downloaded the 2.04 version of the core (had been using 2.01). I am hoping the new units will process normally. All 4 machines have been running SMP folding for many months without incident.
Affected units:
Project: 2669 (Run 11, Clone 8, Gen 108)
Project: 2669 (Run 15, Clone 148, Gen 104)
Project: 2677 (Run 36, Clone 86, Gen 4)
Project: 2669 (Run 10, Clone 136, Gen 41)
Log from restart.
Code: Select all
--- Opening Log file [April 8 05:11:04]
# SMP Client ##################################################################
###############################################################################
Folding@Home Client Version 6.02
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: /home/dmearns/FAH
Executable: /home/dmearns/FAH/fah6
Arguments: -smp -verbosity 9 -forceasm
Warning:
By using the -forceasm flag, you are overriding
safeguards in the program. If you did not intend to
do this, please restart the program without -forceasm.
If work units are not completing fully (and particularly
if your machine is overclocked), then please discontinue
use of the flag.
[05:11:04] - Ask before connecting: No
[05:11:04] - User name: chiana (Team 13149)
[05:11:04] - User ID: 2B758B140504A7C3
[05:11:04] - Machine ID: 1
[05:11:04]
[05:11:04] Loaded queue successfully.
[05:11:04] - Autosending finished units...
[05:11:04] Trying to send all finished work units
[05:11:04] + No unsent completed units remaining.
[05:11:04] - Autosend completed
[05:11:04]
[05:11:04] + Processing work unit
[05:11:04] Core required: FahCore_a2.exe
[05:11:04] Core found.
[05:11:04] Working on Unit 00 [April 8 05:11:04]
[05:11:04] + Working ...
[05:11:04] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 31156 -version 602'
[05:11:05]
[05:11:05] *------------------------------*
[05:11:05] Folding@Home Gromacs SMP Core
[05:11:05] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[05:11:05]
[05:11:05] Preparing to commence simulation
[05:11:05] - Ensuring status. Please wait.
[05:11:14] - Assembly optimizations manually forced on.
[05:11:14] - Not checking prior termination.
[05:11:14] Need version 206
[05:11:14] Error: Work unit read from disk is invalid
[05:11:16] - Expanded 4838141 -> 23979009 (decompressed 495.6 percent)
[05:11:16] Called DecompressByteArray: compressed_data_size=4838141 data_size=23979009, decompressed_data_size=23979009 diff=0
[05:11:16] - Digital signature verified
[05:11:16]
[05:11:16] Project: 2669 (Run 10, Clone 136, Gen 41)
[05:11:16]
[05:11:16] Assembly optimizations on if available.
[05:11:16] Entering M.D.
[05:20:51] Completed 2509 out of 249999 steps (1%)
...
[20:57:38] Completed 247509 out of 249999 steps (99%)
[21:07:08] Completed 249999 out of 249999 steps (100%)
[21:08:10]
[21:08:10] Finished Work Unit:
[21:08:25] - Reading up to 17602080 from "work/wudata_00.trr": Read 17602080
[21:08:25] trr file hash check passed.
[21:08:25] - Reading up to 4414924 from "work/wudata_00.xtc": Read 4414924
[21:08:25] xtc file hash check passed.
[21:08:25] edr file hash check passed.
[21:08:25] logfile size: 179443
[21:08:25] Leaving Run
[21:08:25] - Writing 22423711 bytes of core data to disk...
[21:08:25] ... Done.
[21:08:25] - Shutting down core
[23:11:04] - Autosending finished units...
[23:11:04] Trying to send all finished work units
[23:11:04] + No unsent completed units remaining.
[23:11:04] - Autosend completed
[05:11:04] - Autosending finished units...
[05:11:04] Trying to send all finished work units
[05:11:04] + No unsent completed units remaining.
[05:11:04] - Autosend completed
[11:11:04] - Autosending finished units...
[11:11:04] Trying to send all finished work units
[11:11:04] + No unsent completed units remaining.
[11:11:04] - Autosend completed
[15:11:30] ***** Got a SIGTERM signal (15)
[15:11:30] Killing all core threads
Folding@Home Client Shutdown.
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Thu Apr 09, 2009 4:30 pm
by bruce
dmearns wrote:[05:11:05] Folding@Home Gromacs SMP Core
[05:11:05] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[05:11:14] Need version 206
The best approach in this situation is to stop Fah-SMP (at the end of a WU, if possible), delete the FahCore manually, and restart. That should make sure that you're using the current version. Updates to the FahCore are supposed to happen automatically, but if they don't, you can always force an update using this method.
Re: Project: 2669 (Run 12, Clone 165, Gen 106)
Posted: Thu Apr 09, 2009 6:34 pm
by susato
Checking those affected units:
Project: 2669 (Run 11, Clone 8, Gen 108)
No data in the database, prior generation finished 2009-04-07 10:21:03
Project: 2669 (Run 15, Clone 148, Gen 104)
Finished successfully on 2009-04-08 04:09:41
Project: 2677 (Run 36, Clone 86, Gen 4)
No data in the database, prior generation was completed 2009-04-06 16:06:09
Project: 2669 (Run 10, Clone 136, Gen 41)
Finished successfully on 2009-04-07 00:05:58