I have had this
SAME #$&## project assigned to
EIGHT of my Quads over the past 30-40 hours or so.
These are ALL stock clocked, stable machines, completing two or three a2's per day, and operating in a temperature controlled environment. Their last EUEs were experienced ages ago.
This R/C/G fails immediately with CoreStatus = FF (255).
Several of the runs have stalled at their third attempt for several hours until I was able to detect the failures and dump the WU.
PLEASE NOTE THIS "STALL" AS A NASTY CLIENT/CORE BUG.
(it is intermittent in nature - most of the 8 have immediately received new work after receiving the "bad packet" "nastygram" from the server.)
If "[06:43:51] Initial: 0000; - Error: Bad packet type from server, expected work assignment" affects future assignments to these machines in ANY way that would be what I consider a gross inequity due to its cause.
PLEASE MARK THIS ONE AS A TRULY BAD, BAD, BAD WU AND REMOVE IT FROM CIRCULATION ASAP.
If you don't do this, I will post ALL eight logs.
I would also be interested in why this particular WU is being assigned to MY machines on this frequency.
Code: Select all
[06:41:52] + Number of Units Completed: 132
[06:41:53] - Warning: Could not delete all work unit files (0): Core file absent
[06:41:53] Trying to send all finished work units
[06:41:53] + No unsent completed units remaining.
[06:41:53] - Preparing to get new work unit...
[06:41:53] + Attempting to get work packet
[06:41:53] - Will indicate memory of 1000 MB
[06:41:53] - Connecting to assignment server
[06:41:53] Connecting to http://assign.stanford.edu:8080/
[06:41:53] Posted data.
[06:41:53] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:41:53] + News From Folding@Home: Welcome to Folding@Home
[06:41:54] Loaded queue successfully.
[06:41:54] Connecting to http://171.67.108.24:8080/
[06:41:59] Posted data.
[06:41:59] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:42:03] - Downloaded at ~1182 kB/s
[06:42:03] - Averaged speed for that direction ~1205 kB/s
[06:42:03] + Received work.
[06:42:03] Trying to send all finished work units
[06:42:03] + No unsent completed units remaining.
[06:42:03] + Closed connections
[06:42:03]
[06:42:03] + Processing work unit
[06:42:03] Core required: FahCore_a2.exe
[06:42:03] Core found.
[06:42:03] Working on queue slot 01 [June 13 06:42:03 UTC]
[06:42:03] + Working ...
[06:42:03] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'
[06:42:03]
[06:42:03] *------------------------------*
[06:42:03] Folding@Home Gromacs SMP Core
[06:42:03] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:42:03]
[06:42:03] Preparing to commence simulation
[06:42:03] - Ensuring status. Please wait.
[06:42:13] - Assembly optimizations manually forced on.
[06:42:13] - Not checking prior termination.
[06:42:13] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:42:14] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:42:14] - Digital signature verified
[06:42:14]
[06:42:14] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:14]
[06:42:14] Assembly optimizations on if available.
[06:42:14] Entering M.D.
[06:42:22] Completed 0 out of 250000 steps (0%)
[06:42:29] CoreStatus = FF (255)
[06:42:29] Sending work to server
[06:42:29] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:29] - Error: Could not get length of results file work/wuresults_01.dat
[06:42:29] - Error: Could not read unit 01 file. Removing from queue.
[06:42:29] Trying to send all finished work units
[06:42:29] + No unsent completed units remaining.
[06:42:29] - Preparing to get new work unit...
[06:42:29] + Attempting to get work packet
[06:42:29] - Will indicate memory of 1000 MB
[06:42:29] - Connecting to assignment server
[06:42:29] Connecting to http://assign.stanford.edu:8080/
[06:42:29] Posted data.
[06:42:29] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:42:29] + News From Folding@Home: Welcome to Folding@Home
[06:42:29] Loaded queue successfully.
[06:42:29] Connecting to http://171.67.108.24:8080/
[06:42:35] Posted data.
[06:42:35] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:42:39] - Downloaded at ~1182 kB/s
[06:42:39] - Averaged speed for that direction ~1201 kB/s
[06:42:39] + Received work.
[06:42:39] Trying to send all finished work units
[06:42:39] + No unsent completed units remaining.
[06:42:39] + Closed connections
[06:42:44]
[06:42:44] + Processing work unit
[06:42:44] Core required: FahCore_a2.exe
[06:42:44] Core found.
[06:42:44] Working on queue slot 02 [June 13 06:42:44 UTC]
[06:42:44] + Working ...
[06:42:44] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 02 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'
[06:42:44]
[06:42:44] *------------------------------*
[06:42:44] Folding@Home Gromacs SMP Core
[06:42:44] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:42:44]
[06:42:44] Preparing to commence simulation
[06:42:44] - Ensuring status. Please wait.
[06:42:54] - Assembly optimizations manually forced on.
[06:42:54] - Not checking prior termination.
[06:42:54] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:42:55] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:42:55] - Digital signature verified
[06:42:55]
[06:42:55] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:42:55]
[06:42:55] Assembly optimizations on if available.
[06:42:55] Entering M.D.
[06:43:03] Completed 0 out of 250000 steps (0%)
[06:43:09] CoreStatus = FF (255)
[06:43:09] Sending work to server
[06:43:09] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:09] - Error: Could not get length of results file work/wuresults_02.dat
[06:43:09] - Error: Could not read unit 02 file. Removing from queue.
[06:43:09] Trying to send all finished work units
[06:43:09] + No unsent completed units remaining.
[06:43:09] - Preparing to get new work unit...
[06:43:09] + Attempting to get work packet
[06:43:09] - Will indicate memory of 1000 MB
[06:43:09] - Connecting to assignment server
[06:43:09] Connecting to http://assign.stanford.edu:8080/
[06:43:10] Posted data.
[06:43:10] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:10] + News From Folding@Home: Welcome to Folding@Home
[06:43:10] Loaded queue successfully.
[06:43:10] Connecting to http://171.67.108.24:8080/
[06:43:16] Posted data.
[06:43:16] Initial: 0000; - Receiving payload (expected size: 4842125)
[06:43:20] - Downloaded at ~1182 kB/s
[06:43:20] - Averaged speed for that direction ~1197 kB/s
[06:43:20] + Received work.
[06:43:20] Trying to send all finished work units
[06:43:20] + No unsent completed units remaining.
[06:43:20] + Closed connections
[06:43:25]
[06:43:25] + Processing work unit
[06:43:25] Core required: FahCore_a2.exe
[06:43:25] Core found.
[06:43:25] Working on queue slot 03 [June 13 06:43:25 UTC]
[06:43:25] + Working ...
[06:43:25] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 03 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'
[06:43:25]
[06:43:25] *------------------------------*
[06:43:25] Folding@Home Gromacs SMP Core
[06:43:25] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:43:25]
[06:43:25] Preparing to commence simulation
[06:43:25] - Ensuring status. Please wait.
[06:43:35] - Assembly optimizations manually forced on.
[06:43:35] - Not checking prior termination.
[06:43:35] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[06:43:36] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[06:43:36] - Digital signature verified
[06:43:36]
[06:43:36] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:36]
[06:43:36] Assembly optimizations on if available.
[06:43:36] Entering M.D.
[06:43:44] Completed 0 out of 250000 steps (0%)
[06:43:50] CoreStatus = FF (255)
[06:43:50] Sending work to server
[06:43:50] Project: 2671 (Run 3, Clone 82, Gen 42)
[06:43:50] - Error: Could not get length of results file work/wuresults_03.dat
[06:43:50] - Error: Could not read unit 03 file. Removing from queue.
[06:43:50] Trying to send all finished work units
[06:43:50] + No unsent completed units remaining.
[06:43:50] - Preparing to get new work unit...
[06:43:50] + Attempting to get work packet
[06:43:50] - Will indicate memory of 1000 MB
[06:43:50] - Connecting to assignment server
[06:43:50] Connecting to http://assign.stanford.edu:8080/
[06:43:51] Posted data.
[06:43:51] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:51] + News From Folding@Home: Welcome to Folding@Home
[06:43:51] Loaded queue successfully.
[06:43:51] Connecting to http://171.67.108.24:8080/
[06:43:51] Posted data.
[06:43:51] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[06:43:52] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[06:43:58] + Attempting to get work packet
[06:43:58] - Will indicate memory of 1000 MB
[06:43:58] - Connecting to assignment server
[06:43:58] Connecting to http://assign.stanford.edu:8080/
[06:43:58] Posted data.
[06:43:58] Initial: 43AB; - Successful: assigned to (171.67.108.24).
[06:43:58] + News From Folding@Home: Welcome to Folding@Home
[06:43:58] Loaded queue successfully.
[06:43:58] Connecting to http://171.67.108.24:8080/
[06:44:04] Posted data.
[06:44:04] Initial: 0000; - Receiving payload (expected size: 4837172)
[06:44:08] - Downloaded at ~1180 kB/s
[06:44:08] - Averaged speed for that direction ~1194 kB/s
[06:44:08] + Received work.
[06:44:08] Trying to send all finished work units
[06:44:08] + No unsent completed units remaining.
[06:44:08] + Closed connections
[06:44:13]
[06:44:13] + Processing work unit
[06:44:13] Core required: FahCore_a2.exe
[06:44:13] Core found.
[06:44:13] Working on queue slot 04 [June 13 06:44:13 UTC]
[06:44:13] + Working ...
[06:44:13] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 04 -checkpoint 15 -forceasm -verbose -lifeline 10585 -version 624'
[06:44:13]
[06:44:13] *------------------------------*
[06:44:13] Folding@Home Gromacs SMP Core
[06:44:13] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[06:44:13]
[06:44:13] Preparing to commence simulation
[06:44:13] - Ensuring status. Please wait.
[06:44:22] - Assembly optimizations manually forced on.
[06:44:22] - Not checking prior termination.
[06:44:23] - Expanded 4836660 -> 24032501 (decompressed 496.8 percent)
[06:44:23] Called DecompressByteArray: compressed_data_size=4836660 data_size=24032501, decompressed_data_size=24032501 diff=0
[06:44:23] - Digital signature verified
[06:44:23]
[06:44:23] Project: 2671 (Run 40, Clone 42, Gen 45)
[06:44:23]
[06:44:23] Assembly optimizations on if available.
[06:44:23] Entering M.D.
[06:44:32] Completed 0 out of 250000 steps (0%)
[06:50:52] Completed 2500 out of 250000 steps (1%)
Here is a snippet from the actual console from one of the other failures if that will help.
My recollection is that all failures were similar if not identical in nature.
Code: Select all
[19:17:28] *------------------------------*
[19:17:28] Folding@Home Gromacs SMP Core
[19:17:28] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[19:17:28]
[19:17:28] Preparing to commence simulation
[19:17:28] - Ensuring status. Please wait.
[19:17:37] - Assembly optimizations manually forced on.
[19:17:37] - Not checking prior termination.
[19:17:38] - Expanded 4841613 -> 24004881 (decompressed 495.8 percent)
[19:17:38] Called DecompressByteArray: compressed_data_size=4841613 data_size=24004881, decompressed_data_size=24004881 diff=0
[19:17:38] - Digital signature verified
[19:17:38]
[19:17:38] Project: 2671 (Run 3, Clone 82, Gen 42)
[19:17:38]
[19:17:38] Assembly optimizations on if available.
[19:17:38] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=L28QSMP
NNODES=4, MYRANK=2, HOSTNAME=L28QSMP
NNODES=4, MYRANK=3, HOSTNAME=L28QSMP
NNODES=4, MYRANK=1, HOSTNAME=L28QSMP
NODEID=0 argc=20
NODEID=2 argc=20
NODEID=3 argc=20
NODEID=1 argc=20
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 4.0.99_development_20090307 (-:
Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2008, The GROMACS development team,
check out http://www.gromacs.org for more information.
:-) mdrun (-:
Reading file work/wudata_02.tpr, VERSION 3.3.99_development_20070618 (single precision)
Note: tpx file_version 48, software version 64
NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp
Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22878 system in water'
10750000 steps, 21500.0 ps (continuing from step 10500000, 21000.0 ps).
[19:17:47] Completed 0 out of 250000 steps (0%)
t = 21000.005 ps: Water molecule starting at atom 95476 can not be settled.
Check for bad contacts and/or reduce the timestep.
t = 21000.007 ps: Water molecule starting at atom 46285 can not be settled.
Check for bad contacts and/or reduce the timestep.
-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357
<snip>
Variable ci has value -2147483503. It should have been within [ 0 .. 2312 ]
<snip>
Variable ci has value -2147483519. It should have been within [ 0 .. 1800 ]