One of my machines downloaded this unit earlier this evening. I noticed that after >1 hour, it had not gotten past Writing local files. Task manager shows that the core is active and taking all of my CPU cycles, but there isn't anything to show that it is making any progress, even with -verbosity 9. The computer is a 1.8GHz Pentium M running XP SP2.
Dave
Project: 2494 (Run 0, Clone 1, Gen 0)
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 289
- Joined: Sun Dec 02, 2007 4:31 am
- Location: Carrizo Plain National Monument, California
- Contact:
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
This is one of the monster WU's - what check points have you? and it'll help to post the log.
(Edit - sorrry, monster as in, "takes a long time to complete on my old machines!"
)
(Edit - sorrry, monster as in, "takes a long time to complete on my old machines!"

Last edited by John_Weatherman on Thu Apr 09, 2009 10:52 am, edited 1 time in total.
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
No one has posted results for this WU yet - though P2494 Run 0 clones 2, 3, 4 in gen 0 completed back in October 2008.
Monster WU's? John_Weatherman, do you suspect that this is one of the units that comes up 10 to 100x the normal size?
Note that the WU's you had such trouble with last fall were p2492's with EUE's, and this one is a p2494 which stalls.
Daveb, are you getting any checkpoints? Here's a section of one of my old logs which shows what to look for:
If you are getting nothing after "Writing local files" try to stop and restart Folding. Then post the section of the log where you try to start up. If this is your very first work unit on the machine, it may be that you have the wrong client for your hardware. Without the log, it's hard to tell.
Monster WU's? John_Weatherman, do you suspect that this is one of the units that comes up 10 to 100x the normal size?
Note that the WU's you had such trouble with last fall were p2492's with EUE's, and this one is a p2494 which stalls.
Daveb, are you getting any checkpoints? Here's a section of one of my old logs which shows what to look for:
Code: Select all
[21:49:30] Protein: system
[21:49:19]
[21:49:19] Project: 2484 (Run 2, Clone 30, Gen 1)
[21:49:19]
[21:49:20] Assembly optimizations on if available.
[21:49:20] Entering M.D.
[21:49:30] Protein: system
[21:49:30]
[21:49:30] Writing local files
[21:49:38] Extra SSE boost OK.
[21:49:41] Writing local files
[21:49:41] Completed 0 out of 250000 steps (0%)
[21:52:15] - Autosending finished units... [November 20 21:52:15 UTC]
[21:52:15] Trying to send all finished work units
[21:52:15] + No unsent completed units remaining.
[21:52:15] - Autosend completed
[21:52:15] + Working...
[22:04:43] Timered checkpoint triggered.
[22:20:45] Timered checkpoint triggered.
[22:35:48] Timered checkpoint triggered.
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
This machine has completed hundreds of units, and I have never noticed it take an hour to hit the first checkpoint on a normal unit. Judging by the other 249x units I have completed, I would have expected this machine to have completed the first 1% by then. I do remember some problems with other badly formed units in the past where the core would not appear to be making any progress for long periods of time. In that case, Task manager would show the core consuming more and more memory (up to >1000MB), the machine became unresponsive and eventually an error was generated, That is not the case here, as the memory allocation stayed under 100MB on this unit. If this unit performed similarly when previously issued, I can understand why there were no returns in the past 6 months.
As for this being a monster unit, I noticed that the download payload was only ~1.3 MB versus the ~>2MB I remember seeing with other similar units. After expansion, it was only 14 MB, and as I mentioned above, the memory allocation for the core appeared normal.
> 35 minutes after Writing local files and still no sign of starting the unit.
> 1 hour and nothing appears to have happened yet
I transferred the unit to another computer (an old PowerMac G5), and it generated an EUE with an NaN almost immediately which it returned.
Dave
As for this being a monster unit, I noticed that the download payload was only ~1.3 MB versus the ~>2MB I remember seeing with other similar units. After expansion, it was only 14 MB, and as I mentioned above, the memory allocation for the core appeared normal.
Code: Select all
[05:12:53] Loaded queue successfully.
[05:12:53] + Benchmarking ...
[05:12:57] The benchmark result is 3612
[05:12:57] - Preparing to get new work unit...
[05:12:57] - Presenting message box asking to network.
[05:12:57] - Autosending finished units...
[05:12:57] Trying to send all finished work units
[05:12:57] + No unsent completed units remaining.
[05:12:57] - Autosend completed
[05:12:58] + Attempting to get work packet
[05:12:58] - Will indicate memory of 1015 MB
[05:12:58] - Connecting to assignment server
[05:12:58] Connecting to http://assign.stanford.edu:8080/
[05:12:59] Posted data.
[05:12:59] Initial: 41AB; - Successful: assigned to (171.65.103.160).
[05:12:59] + News From Folding@Home: Welcome to Folding@Home
[05:12:59] Loaded queue successfully.
[05:12:59] Connecting to http://171.65.103.160:8080/
[05:13:05] Posted data.
[05:13:05] Initial: 0000; - Receiving payload (expected size: 1318390)
[05:13:24] - Downloaded at ~67 kB/s
[05:13:24] - Averaged speed for that direction ~198 kB/s
[05:13:24] + Received work.
[05:13:26] + Connections closed: You may now disconnect
[05:13:26]
[05:13:26] + Processing work unit
[05:13:26] Core required: FahCore_78.exe
[05:13:26] Core found.
[05:13:26] Working on Unit 09 [April 8 05:13:26]
[05:13:26] + Working ...
[05:13:26] - Calling 'FahCore_78.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 3796 -version 504'
[05:13:26]
[05:13:26] *------------------------------*
[05:13:26] Folding@Home Gromacs Core
[05:13:26] Version 1.90 (March 8, 2006)
[05:13:26]
[05:13:26] Preparing to commence simulation
[05:13:26] - Looking at optimizations...
[05:13:26] - Created dyn
[05:13:26] - Files status OK
[05:13:35] - Expanded 1317878 -> 13819905 (decompressed 1048.6 percent)
[05:13:36] - Starting from initial work packet
[05:13:36]
[05:13:36] Project: 2494 (Run 0, Clone 1, Gen 0)
[05:13:36]
[05:13:47] Assembly optimizations on if available.
[05:13:47] Entering M.D.
[05:13:57] Protein: system
[05:13:57]
[05:13:58] Writing local files
[05:49:12] ***** Got a SIGTERM signal (2)
[05:49:12] Killing all core threads
Folding@Home Client Shutdown.
Code: Select all
[05:59:30] Loaded queue successfully.
[05:59:30] + Benchmarking ...
[05:59:34] The benchmark result is 3940
[05:59:34]
[05:59:34] + Processing work unit
[05:59:34] Core required: FahCore_78.exe
[05:59:34] Core found.
[05:59:34] - Autosending finished units...
[05:59:34] Trying to send all finished work units
[05:59:34] + No unsent completed units remaining.
[05:59:34] - Autosend completed
[05:59:34] Working on Unit 09 [April 8 05:59:34]
[05:59:34] + Working ...
[05:59:34] - Calling 'FahCore_78.exe -dir work/ -suffix 09 -checkpoint 15 -verbose -lifeline 3740 -version 504'
[05:59:34]
[05:59:34] *------------------------------*
[05:59:34] Folding@Home Gromacs Core
[05:59:34] Version 1.90 (March 8, 2006)
[05:59:34]
[05:59:34] Preparing to commence simulation
[05:59:34] - Looking at optimizations...
[05:59:34] - Files status OK
[05:59:45] - Expanded 1317878 -> 13819905 (decompressed 1048.6 percent)
[05:59:46]
[05:59:46] Project: 2494 (Run 0, Clone 1, Gen 0)
[05:59:46]
[05:59:57] Assembly optimizations on if available.
[05:59:57] Entering M.D.
[06:00:07] Protein: system
[06:00:07]
[06:00:08] Writing local files
[07:05:19] ***** Got a SIGTERM signal (2)
[07:05:19] Killing all core threads
I transferred the unit to another computer (an old PowerMac G5), and it generated an EUE with an NaN almost immediately which it returned.
Code: Select all
[13:34:44]
[13:34:44] *------------------------------*
[13:34:44] Folding@Home Gromacs Core
[13:34:44] Version 1.90 (March 8, 2006)
[13:34:44]
[13:34:44] Preparing to commence simulation
[13:34:44] - Looking at optimizations...
[13:34:44] - Files status OK
[13:34:45] - Expanded 1317878 -> 13819905 (decompressed 1048.6 percent)
[13:34:46] - Checksums don't match (work/wudata_09.bed)
[13:34:46] - Starting from initial work packet
[13:34:46]
[13:34:46] Project: 2494 (Run 0, Clone 1, Gen 0)
[13:34:46]
[13:34:46] Assembly optimizations on if available.
[13:34:46] Entering M.D.
Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.
[13:34:58] Protein: system
[13:34:58]
[13:34:58] Writing local files
[13:35:01] Testing CPU type...
[13:35:01] Done testing.
[13:35:01] Extra AltiVec boost OK.
[13:35:04] Writing local files
[13:35:05] Completed 0 out of 250000 steps (0)
[13:35:05] Quit 101 - Fatal error: NaN detected: (ener[13])
[13:35:05]
[13:35:05] Simulation instability has been encountered. The run has entered a
[13:35:05] state from which no further progress can be made.
[13:35:05] This may be the correct result of the simulation, however if you
[13:35:05] often see other project units terminating early like this
[13:35:05] too, you may wish to check the stability of your computer (issues
[13:35:05] such as high temperature, overclocking, etc.).
[13:35:05] Going to send back what have done.
[13:35:05] logfile size: 25992
[13:35:05] - Writing 26555 bytes of core data to disk...
[13:35:05] ... Done.
[13:35:09]
[13:35:09] Folding@home Core Shutdown: EARLY_UNIT_END
[13:35:19] CoreStatus = 72 (114)
[13:35:19] Sending work to server
[13:35:19] + Attempting to send results
[13:35:19] - Reading file work/wuresults_09.dat from core
[13:35:19] (Read 26555 bytes from disk)
[13:35:19] > Press "c" to connect to the server
c[13:38:49] - Establishing connection
[13:38:50] + Results successfully sent
[13:38:50] Thank you for your contribution to Folding@Home.
Last edited by daveb on Wed Apr 08, 2009 2:00 pm, edited 1 time in total.
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
Another one of my machines just downloaded the same unit, and appears to be stuck in the same place.
Dave
Dave
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
I've notified the Pande Group about this issue. I think there may be issues here that need more research than the Forum Mods can provide.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: Project: 2494 (Run 0, Clone 1, Gen 0)
Hi Dave and all,
Thanks for these reports. You encountered a Bad WU. Thanks for letting as know. I stopped it from trying to grow.
Hopefully noone will get it again. It might be reassigned once again if it is still in the stack, but it will clear out in the next minutes. Thanks once again.
Paula
Thanks for these reports. You encountered a Bad WU. Thanks for letting as know. I stopped it from trying to grow.
Hopefully noone will get it again. It might be reassigned once again if it is still in the stack, but it will clear out in the next minutes. Thanks once again.
Paula