Page 2 of 2

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sat Jun 13, 2009 7:03 am
by Foxbat
I picked it up one last time before it was regenerated. It was right after the 1,000th Work Unit on my Mac Pro, too! Thanks for taking care of this, kasson!

Project: 2671 (Run 3, Clone 82, Gen 42) URGENT!

Posted: Sat Jun 13, 2009 8:43 am
by 314159
I have had this SAME #$&## project assigned to EIGHT of my Quads over the past 30-40 hours or so.
These are ALL stock clocked, stable machines, completing two or three a2's per day, and operating in a temperature controlled environment. Their last EUEs were experienced ages ago.

This R/C/G fails immediately with CoreStatus = FF (255).
Several of the runs have stalled at their third attempt for several hours until I was able to detect the failures and dump the WU.

PLEASE NOTE THIS "STALL" AS A NASTY CLIENT/CORE BUG. :!:
(it is intermittent in nature - most of the 8 have immediately received new work after receiving the "bad packet" "nastygram" from the server.)

If "[06:43:51] Initial: 0000; - Error: Bad packet type from server, expected work assignment" affects future assignments to these machines in ANY way that would be what I consider a gross inequity due to its cause. :!:

PLEASE MARK THIS ONE AS A TRULY BAD, BAD, BAD WU AND REMOVE IT FROM CIRCULATION ASAP.

If you don't do this, I will post ALL eight logs. :D :D :egeek:

I would also be interested in why this particular WU is being assigned to MY machines on this frequency. :?:

Code: Select all

[06:41:52] + Number of Units Completed: 132
[06:41:53] - Warning: Could not delete all work unit files (0): Core file absent
[06:41:53] Trying to send all finished work units
[06:41:53] + No unsent completed units remaining.
[06:41:53] - Preparing to get new work unit...
[06:41:53] + Attempting to get work packet
[06:41:53] - Will indicate memory of 1000 MB
[06:41:53] - Connecting to assignment server
[06:41:53] Connecting to http://assign.stanford.edu:8080/
[06:41:53] Posted data.
[06:41:53] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:41:53] + News From Folding@Home: Welcome to Folding@Home
[06:41:54] Loaded queue successfully.
[06:41:54] Connecting to http://171.67.108.24:8080/
[06:41:59] Posted data.
[06:41:59] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:42:03] - Downloaded at ~1182 kB/s
[06:42:03] - Averaged speed for that direction ~1205 kB/s
[06:42:03] + Received work.
[06:42:03] Trying to send all finished work units
[06:42:03] + No unsent completed units remaining.
[06:42:03] + Closed connections
[06:42:03] 
[06:42:03] + Processing work unit
[06:42:03] Core required: FahCore_a2.exe
[06:42:03] Core found.
[06:42:03] Working on queue slot 01 [June 13 06:42:03 UTC]
[06:42:03] + Working ...
[06:42:03] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'

[06:42:03] 
[06:42:03] *------------------------------*
[06:42:03] Folding@Home Gromacs SMP Core
[06:42:03] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:42:03] 
[06:42:03] Preparing to commence simulation
[06:42:03] - Ensuring status. Please wait.
[06:42:13] - Assembly optimizations manually forced on.
[06:42:13] - Not checking prior termination.
[06:42:13] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:42:14] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:42:14] - Digital signature verified
[06:42:14] 
[06:42:14] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:14] 
[06:42:14] Assembly optimizations on if available.
[06:42:14] Entering M.D.
[06:42:22] Completed 0 out of 250000 steps  (0%)
[06:42:29] CoreStatus = FF (255)
[06:42:29] Sending work to server
[06:42:29] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:29] - Error: Could not get length of results file work/wuresults_01.dat
[06:42:29] - Error: Could not read unit 01 file. Removing from queue.
[06:42:29] Trying to send all finished work units
[06:42:29] + No unsent completed units remaining.
[06:42:29] - Preparing to get new work unit...
[06:42:29] + Attempting to get work packet
[06:42:29] - Will indicate memory of 1000 MB
[06:42:29] - Connecting to assignment server
[06:42:29] Connecting to http://assign.stanford.edu:8080/
[06:42:29] Posted data.
[06:42:29] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:42:29] + News From Folding@Home: Welcome to Folding@Home
[06:42:29] Loaded queue successfully.
[06:42:29] Connecting to http://171.67.108.24:8080/
[06:42:35] Posted data.
[06:42:35] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:42:39] - Downloaded at ~1182 kB/s
[06:42:39] - Averaged speed for that direction ~1201 kB/s
[06:42:39] + Received work.
[06:42:39] Trying to send all finished work units
[06:42:39] + No unsent completed units remaining.
[06:42:39] + Closed connections
[06:42:44] 
[06:42:44] + Processing work unit
[06:42:44] Core required: FahCore_a2.exe
[06:42:44] Core found.
[06:42:44] Working on queue slot 02 [June 13 06:42:44 UTC]
[06:42:44] + Working ...
[06:42:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'

[06:42:44] 
[06:42:44] *------------------------------*
[06:42:44] Folding@Home Gromacs SMP Core
[06:42:44] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:42:44] 
[06:42:44] Preparing to commence simulation
[06:42:44] - Ensuring status. Please wait.
[06:42:54] - Assembly optimizations manually forced on.
[06:42:54] - Not checking prior termination.
[06:42:54] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:42:55] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:42:55] - Digital signature verified
[06:42:55] 
[06:42:55] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:55] 
[06:42:55] Assembly optimizations on if available.
[06:42:55] Entering M.D.
[06:43:03] Completed 0 out of 250000 steps  (0%)
[06:43:09] CoreStatus = FF (255)
[06:43:09] Sending work to server
[06:43:09] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:09] - Error: Could not get length of results file work/wuresults_02.dat
[06:43:09] - Error: Could not read unit 02 file. Removing from queue.
[06:43:09] Trying to send all finished work units
[06:43:09] + No unsent completed units remaining.
[06:43:09] - Preparing to get new work unit...
[06:43:09] + Attempting to get work packet
[06:43:09] - Will indicate memory of 1000 MB
[06:43:09] - Connecting to assignment server
[06:43:09] Connecting to http://assign.stanford.edu:8080/
[06:43:10] Posted data.
[06:43:10] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:10] + News From Folding@Home: Welcome to Folding@Home
[06:43:10] Loaded queue successfully.
[06:43:10] Connecting to http://171.67.108.24:8080/
[06:43:16] Posted data.
[06:43:16] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:43:20] - Downloaded at ~1182 kB/s
[06:43:20] - Averaged speed for that direction ~1197 kB/s
[06:43:20] + Received work.
[06:43:20] Trying to send all finished work units
[06:43:20] + No unsent completed units remaining.
[06:43:20] + Closed connections
[06:43:25] 
[06:43:25] + Processing work unit
[06:43:25] Core required: FahCore_a2.exe
[06:43:25] Core found.
[06:43:25] Working on queue slot 03 [June 13 06:43:25 UTC]
[06:43:25] + Working ...
[06:43:25] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'

[06:43:25] 
[06:43:25] *------------------------------*
[06:43:25] Folding@Home Gromacs SMP Core
[06:43:25] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:43:25] 
[06:43:25] Preparing to commence simulation
[06:43:25] - Ensuring status. Please wait.
[06:43:35] - Assembly optimizations manually forced on.
[06:43:35] - Not checking prior termination.
[06:43:35] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:43:36] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:43:36] - Digital signature verified
[06:43:36] 
[06:43:36] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:36] 
[06:43:36] Assembly optimizations on if available.
[06:43:36] Entering M.D.
[06:43:44] Completed 0 out of 250000 steps  (0%)
[06:43:50] CoreStatus = FF (255)
[06:43:50] Sending work to server
[06:43:50] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:50] - Error: Could not get length of results file work/wuresults_03.dat
[06:43:50] - Error: Could not read unit 03 file. Removing from queue.
[06:43:50] Trying to send all finished work units
[06:43:50] + No unsent completed units remaining.
[06:43:50] - Preparing to get new work unit...
[06:43:50] + Attempting to get work packet
[06:43:50] - Will indicate memory of 1000 MB
[06:43:50] - Connecting to assignment server
[06:43:50] Connecting to http://assign.stanford.edu:8080/
[06:43:51] Posted data.
[06:43:51] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:51] + News From Folding@Home: Welcome to Folding@Home
[06:43:51] Loaded queue successfully.
[06:43:51] Connecting to http://171.67.108.24:8080/
[06:43:51] Posted data.
[06:43:51] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[06:43:52] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[06:43:58] + Attempting to get work packet
[06:43:58] - Will indicate memory of 1000 MB
[06:43:58] - Connecting to assignment server
[06:43:58] Connecting to http://assign.stanford.edu:8080/
[06:43:58] Posted data.
[06:43:58] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:58] + News From Folding@Home: Welcome to Folding@Home
[06:43:58] Loaded queue successfully.
[06:43:58] Connecting to http://171.67.108.24:8080/
[06:44:04] Posted data.
[06:44:04] Initial: 0000; - Receiving payload (expected size: 4837172)
[06:44:08] - Downloaded at ~1180 kB/s
[06:44:08] - Averaged speed for that direction ~1194 kB/s
[06:44:08] + Received work.
[06:44:08] Trying to send all finished work units
[06:44:08] + No unsent completed units remaining.
[06:44:08] + Closed connections
[06:44:13] 
[06:44:13] + Processing work unit
[06:44:13] Core required: FahCore_a2.exe
[06:44:13] Core found.
[06:44:13] Working on queue slot 04 [June 13 06:44:13 UTC]
[06:44:13] + Working ...
[06:44:13] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'

[06:44:13] 
[06:44:13] *------------------------------*
[06:44:13] Folding@Home Gromacs SMP Core
[06:44:13] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:44:13] 
[06:44:13] Preparing to commence simulation
[06:44:13] - Ensuring status. Please wait.
[06:44:22] - Assembly optimizations manually forced on.
[06:44:22] - Not checking prior termination.
[06:44:23] - Expanded 4836660 -> 24032501 (decompressed 496.8 percent)
[06:44:23] Called DecompressByteArray: compressed_data_size=4836660 data_size=24032501, decompressed_data_size=24032501 diff=0
[06:44:23] - Digital signature verified
[06:44:23] 
[06:44:23] Project: 2671 (Run 40, Clone 42, Gen 45)
[06:44:23] 
[06:44:23] Assembly optimizations on if available.
[06:44:23] Entering M.D.
[06:44:32] Completed 0 out of 250000 steps  (0%)
[06:50:52] Completed 2500 out of 250000 steps  (1%)
Here is a snippet from the actual console from one of the other failures if that will help.
My recollection is that all failures were similar if not identical in nature.

Code: Select all

[19:17:28] *------------------------------*
[19:17:28] Folding@Home Gromacs SMP Core
[19:17:28] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[19:17:28] 
[19:17:28] Preparing to commence simulation
[19:17:28] - Ensuring status. Please wait.
[19:17:37] - Assembly optimizations manually forced on.
[19:17:37] - Not checking prior termination.
[19:17:38] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[19:17:38] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[19:17:38] - Digital signature verified
[19:17:38] 
[19:17:38] Project: 2671 (Run 3, Clone 82, Gen 42)
[19:17:38] 
[19:17:38] Assembly optimizations on if available.
[19:17:38] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=L28QSMP
NNODES=4, MYRANK=2, HOSTNAME=L28QSMP
NNODES=4, MYRANK=3, HOSTNAME=L28QSMP
NNODES=4, MYRANK=1, HOSTNAME=L28QSMP
NODEID=0 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
NODEID=1 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps,  21500.0 ps (continuing from step 10500000,  21000.0 ps).
[19:17:47] Completed 0 out of 250000 steps  (0%)

t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

<snip>

Variable ci has value -2147483503. It should have been within [ 0 .. 2312 ]
<snip>

Variable ci has value -2147483519. It should have been within [ 0 .. 1800 ]

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sat Jun 13, 2009 8:59 am
by 314159
I picked up this thread subsequent to posting above. Sorry!

Note that the last failure that I had was for one of these monsters that was issued on Saturday, June 13 at 02:36:15 Eastern Daylight Time.

This is subsequent to your post of Fri Jun 12, 2009 4:05 pm, Sir kasson.

Please get out the hammer and try to fix it once again. :ewink:

Can it be withdrawn until such time as you are able to run it on a P.G. machine to verify its integrity? :!:

It is killing my contribution to science since my machines appear to have an affinity for it. :e(

Re: Project: 2671 (Run 3, Clone 82, Gen 42) URGENT!

Posted: Sat Jun 13, 2009 11:42 am
by parkut
I can confirm this particular WU has failed three times in a row on startup on several of my machines as well. I have not had an instance where any significant time has been lost, except for the time to process the failure and download over again.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) URGENT!

Posted: Sat Jun 13, 2009 6:02 pm
by 314159
I absolute do not believe this!!

Since my original posts, not too many hours ago, THREE more of my Quads have been assigned this identical WU with the same results!
(This is for my reference: L23/L30/L21)

Two progressed properly to another assignment after a three run attempt. No harm, no foul, many questions......

The other "stalled" for several hours at 0% (per FAHlog.txt) until I killed it - but had actually crashed per shell info at 0%.

The most recent one was issued at 12:46 PM EDT (9:46 AM PST); Saturday 13 June 2009.

I am now checking my refrigerator, microwave, TV's etc. to see if they have also been assigned this BAD, BAD, BAD WU. :roll:

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sat Jun 13, 2009 9:29 pm
by codysluder
314159 wrote:Please get out the hammer and try to fix it once again. :ewink:

Can it be withdrawn until such time as you are able to run it on a P.G. machine to verify its integrity? :!:
You probably know this, but if they do run it on a P.G. machine, you'll never see it again. If it runs, you'll only see Gen 43 or greater.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) URGENT!

Posted: Sat Jun 13, 2009 9:35 pm
by codysluder
This WU is already being discussed BY YOU in another topic. The Mods don't like duplicated posts or threads.
viewtopic.php?f=19&t=10270

EDIT by Mod:
Threads merged.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sat Jun 13, 2009 10:26 pm
by kasson
The WU has been removed.
There is a client bug we need to fix--when the core returns status FF, the client tries to report the WU as bad, but since there isn't a results file it doesn't upload. I think uncle_fungus was going to help us with this a while back--let me try to ping him on that. If that worked, all the FF returns would eventually lead the server to automatically mark the WU as bad. What happens right now is the server doesn't hear anything back, so it assigns it again.

Re: Project: 2671 (Run 3, Clone 82, Gen 42) URGENT!

Posted: Sat Jun 13, 2009 11:41 pm
by tear
314159 wrote:I absolute do not believe this!!
Take it easy, we heard you the first time.


tear

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sun Jun 14, 2009 12:22 am
by 314159
Hey codysluder,
I did communicate this when the other thread popped up.
I picked up this thread subsequent to posting above. Sorry!
:ewink:
I had also attempted to locate a thread discussing this issue.
Missed locating it since it was somewhat buried to its last post date. :oops:
Someone then posted in it and I saw it when "new posts" was refreshed.
The Mods don't like duplicated posts or threads.
Nor do the members but in a Forum as active as this one, stuff like this happens. :ewink:
You probably know this, but if they do run it on a P.G. machine, you'll never see it again. If it runs, you'll only see Gen 43 or greater.
Yup! I am TOTALLY aware of this. I am looking forward to it occuring. :ewink:
Note that I was also trying to convey info on the client bug.
That, to me, is a more important issue.

I would also question the AS and WS algorithms. The recent experience, at least for me, is unprecedented.

I "is" a good boy guys. I have also been around a looong time. (check my stats sometime) :)

Thank you Dr. Kasson! :!:

Re: Project: 2671 (Run 3, Clone 82, Gen 42) WMCNBS 100% reprod.

Posted: Sun Jun 14, 2009 3:15 pm
by Foxbat
Yep, I wondered why it was cool in the bedroom. I picked the bad WU up one more time. Let's hope the third times the charm. Thanks (again) kasson!

The silver lining is you've discovered a failure mode that might make the Mac/Linux SMP Client more reliable in the future!