Page 1 of 1

Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Fri Dec 12, 2008 7:05 pm
by error10
:shock: I got one of these about two hours ago and it's so far printed no frames. It also hasn't updated unitinfo.txt. So I have no idea if it's making any progress. I can see it chewing up CPU in 'top' though. (A normal 1920 point WU will do about one frame in 9:25 or so on this particular computer.) How can I know what the status of this work unit is?

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Fri Dec 12, 2008 10:57 pm
by toTOW
Try to restart the client to see if that helps to get some log or screen output ...

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Sat Dec 13, 2008 6:22 am
by GTron
I am having a problem with this WU as well. It is running SLOW. Section of the FAHlog.txt follows below, starting with the end of the previous WU (also a 2674 but Run 1, Clone 42, Gen 86) for comparison. This system has been stable for some time now. I am going to kill and restart this to see if it makes a difference and report back.
Greg

Code: Select all

[00:13:59] Completed 225000 out of 250000 steps  (90%)
[00:19:31] Completed 227500 out of 250000 steps  (91%)
[00:25:03] Completed 230000 out of 250000 steps  (92%)
[00:30:36] Completed 232500 out of 250000 steps  (93%)
[00:36:09] Completed 235000 out of 250000 steps  (94%)
[00:41:42] Completed 237500 out of 250000 steps  (95%)
[00:47:14] Completed 240000 out of 250000 steps  (96%)
[00:52:47] Completed 242500 out of 250000 steps  (97%)
[00:58:20] Completed 245000 out of 250000 steps  (98%)
[01:03:52] Completed 247500 out of 250000 steps  (99%)
[01:09:24] Completed 250000 out of 250000 steps  (100%)
[01:10:25] 
[01:10:25] Finished Work Unit:
[01:10:25] - Reading up to 21144528 from "work/wudata_08.trr": Read 21144528
[01:10:25] trr file hash check passed.
[01:10:25] - Reading up to 4509196 from "work/wudata_08.xtc": Read 4509196
[01:10:25] xtc file hash check passed.
[01:10:25] edr file hash check passed.
[01:10:25] logfile size: 177178
[01:10:25] Leaving Run
[01:10:25] - Writing 26030806 bytes of core data to disk...
[01:10:25]   ... Done.
[01:10:28] - Shutting down core
[01:10:28] 
[01:10:28] Folding@home Core Shutdown: FINISHED_UNIT
[01:13:47] CoreStatus = 64 (100)
[01:13:47] Unit 8 finished with 87 percent of time to deadline remaining.
[01:13:47] Updated performance fraction: 0.869598
[01:13:47] Sending work to server


[01:13:47] + Attempting to send results
[01:13:47] - Reading file work/wuresults_08.dat from core
[01:13:47]   (Read 26030806 bytes from disk)
[01:13:47] Connecting to http://171.67.108.24:8080/
[01:27:16] Posted data.
[01:27:16] Initial: 0000; - Uploaded at ~31 kB/s
[01:27:24] - Averaged speed for that direction ~31 kB/s
[01:27:24] + Results successfully sent
[01:27:24] Thank you for your contribution to Folding@Home.
[01:27:24] + Number of Units Completed: 366

[01:27:30] - Warning: Could not delete all work unit files (8): Core file absent
[01:27:30] Trying to send all finished work units
[01:27:30] + No unsent completed units remaining.
[01:27:30] - Preparing to get new work unit...
[01:27:30] + Attempting to get work packet
[01:27:30] - Will indicate memory of 1536 MB
[01:27:30] - Connecting to assignment server
[01:27:30] Connecting to http://assign.stanford.edu:8080/
[01:27:30] Posted data.
[01:27:30] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[01:27:30] + News From Folding@Home: Welcome to Folding@Home
[01:27:30] Loaded queue successfully.
[01:27:30] Connecting to http://171.67.108.24:8080/
[01:27:36] Posted data.
[01:27:36] Initial: 0000; - Receiving payload (expected size: 4846156)
[01:27:49] - Downloaded at ~364 kB/s
[01:27:49] - Averaged speed for that direction ~425 kB/s
[01:27:49] + Received work.
[01:27:49] Trying to send all finished work units
[01:27:49] + No unsent completed units remaining.
[01:27:49] + Closed connections
[01:27:49] 
[01:27:49] + Processing work unit
[01:27:49] Core required: FahCore_a2.exe
[01:27:49] Core found.
[01:27:49] Working on Unit 09 [December 12 01:27:49]
[01:27:49] + Working ...
[01:27:49] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -checkpoint 20 -verbose -lifeline 7297 -version 602'

[01:27:49] 
[01:27:49] *------------------------------*
[01:27:49] Folding@Home Gromacs SMP Core
[01:27:49] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[01:27:49] 
[01:27:49] Preparing to commence simulation
[01:27:49] - Ensuring status. Please wait.
[01:27:50] Called DecompressByteArray: compressed_data_size=4845644 data_size=24004849, decompressed_data_size=24004849 diff=0
[01:27:50] - Digital signature verified
[01:27:50] 
[01:27:50] Project: 2674 (Run 2, Clone 185, Gen 69)
[01:27:50] 
[01:27:50] Assembly optimizations on if available.
[01:27:50] Entering M.D.
[01:28:00] Run 2, Clone 185, Gen 69)
[01:28:00] 
[01:28:00] Entering M.D.
[06:09:25] - Autosending finished units...
[06:09:25] Trying to send all finished work units
[06:09:25] + No unsent completed units remaining.
[06:09:25] - Autosend completed
[06:16:16] 1%)
[11:04:23] Completed 255008 out of 12750000 steps  (2%)
[12:09:25] - Autosending finished units...
[12:09:25] Trying to send all finished work units
[12:09:25] + No unsent completed units remaining.
[12:09:25] - Autosend completed
[15:52:51] Completed 382508 out of 12750000 steps  (3%)
[18:09:25] - Autosending finished units...
[18:09:25] Trying to send all finished work units
[18:09:25] + No unsent completed units remaining.
[18:09:25] - Autosend completed
[20:41:11] Completed 510008 out of 12750000 steps  (4%)
[00:09:25] - Autosending finished units...
[00:09:25] Trying to send all finished work units
[00:09:25] + No unsent completed units remaining.
[00:09:25] - Autosend completed
[01:29:33] Completed 637508 out of 12750000 steps  (5%)

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Sat Dec 13, 2008 10:08 am
by error10
I had already restarted it, an hour after I initially started it, and nothing happened. I let it run overnight, and woke up to this:

Code: Select all

[17:58:21] Project: 2674 (Run 2, Clone 185, Gen 69)
[17:58:21] 
[17:58:21] Entering M.D.
[17:58:27] Will resume from checkpoint file
[17:58:28] Resuming from checkpoint
[17:58:29] Verified work/wudata_02.log
[17:58:29] Verified work/wudata_02.trr
[17:58:29] Verified work/wudata_02.xtc
[17:58:29] Verified work/wudata_02.edr
[01:59:02] Completed 127508 out of 12750000 steps  (1%)
[09:55:40] Completed 255008 out of 12750000 steps  (2%)
I'm deleting this one.

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Sat Dec 13, 2008 10:51 am
by Ivoshiee
What are your system specifications and do you see anything CPU demanding running with the FAH client?

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Sat Dec 13, 2008 4:12 pm
by GTron
My restart of this WU has not made a difference, (FAHlog.txt of restart below). Nothing is running that shows above 0% on the process list for all users, except the cores. The 4 cores push total CPU utilization to close to 90%, so they are getting and using the CPU they should. The system has a Q6600 @2.88GHz, 2GB memory, on Ubuntu 8.04.

This WU will NOT meet anything close to deadline (local time is UTC-7):
issue: Thu Dec 11 18:26:47 2008; begin: Thu Dec 11 18:27:49 2008
expect: Wed Dec 31 21:47:19 2008; due: Sun Dec 14 18:27:49 2008 (3 days)
preferred: Sun Dec 14 18:27:49 2008 (3 days)

Perhaps the Pande Group should pull this WU back in house for investigation, or the researcher alerted at least.

Greg

Code: Select all

--- Opening Log file [December 13 06:28:33] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/smpfold/foldingathome/CPU1
Executable: /home/smpfold/foldingathome/CPU1/fah6
Arguments: -smp -verbosity 9 

[06:28:33] - Ask before connecting: No
[06:28:33] - User name: GTron (Team 0)
[06:28:33] - User ID: 76E5E3D439736F7C
[06:28:33] - Machine ID: 5
[06:28:33] 
[06:28:33] Loaded queue successfully.
[06:28:33] - Autosending finished units...
[06:28:33] Trying to send all finished work units
[06:28:33] + No unsent completed units remaining.
[06:28:33] - Autosend completed
[06:28:33] 
[06:28:33] + Processing work unit
[06:28:33] Core required: FahCore_a2.exe
[06:28:33] Core found.
[06:28:33] Working on Unit 09 [December 13 06:28:33]
[06:28:33] + Working ...
[06:28:33] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 09 -checkpoint 20 -verbose -lifeline 6105 -version 602'

[06:28:33] 
[06:28:33] *------------------------------*
[06:28:33] Folding@Home Gromacs SMP Core
[06:28:33] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[06:28:33] 
[06:28:33] Preparing to commence simulation
[06:28:33] - Ensuring status. Please wait.
[06:28:33] Files status OK
[06:28:34] - Expanded 4845644 -> 24004849 (decompressed 495.3 percent)
[06:28:34] Called DecompressByteArray: compressed_data_size=4845644 data_size=24004849, decompressed_data_size=24004849 diff=0
[06:28:34] - Digital signature verified
[06:28:34] 
[06:28:34] Project: 2674 (Run 2, Clone 185, Gen 69)
[06:28:34] 
[06:28:34] Assembly optimizations on if available.
[06:28:34] Entering M.D.
[06:28:40] Will resume from checkpoint file
[06:28:44] ng M.D.
[06:28:50] Will resume from checkpoint file
[06:28:51] Resuming from checkpoint
[06:28:51] Verified work/wudata_09.log
[06:28:52] Verified work/wudata_09.trr
[06:28:52] Verified work/wudata_09.xtc
[06:28:52] Verified work/wudata_09.edr
[06:28:52] Completed 765018 out of 12750000 steps  (6%)
[11:17:48] Completed 892508 out of 12750000 steps  (7%)
[12:28:33] - Autosending finished units...
[12:28:33] Trying to send all finished work units
[12:28:33] + No unsent completed units remaining.
[12:28:33] - Autosend completed
[16:05:22] Completed 1020008 out of 12750000 steps  (8%)
(edit for OS)

Re: Project: 2674 (Run 2, Clone 185, Gen 69)

Posted: Sun Dec 14, 2008 12:28 am
by kasson
This one has too many steps. We fixed a problem relating to this in the past, but one seems to have snuck past our checks. I'll pull and reformulate the work unit in the next day or two.