Re: Bigadv Collection and or Assignment server is broken
Posted: Mon Oct 01, 2012 9:34 am
We are continuing to investigate but think it may be a WS-CS communication issue. More updates to come.
Community driven support forum for Folding@home
https://foldingforum.org/
[04:41:29] - Preparing to get new work unit...
[04:41:29] Cleaning up work directory
[04:41:29] + Attempting to get work packet
[04:41:29] Passkey found
[04:41:29] - Connecting to assignment server
[04:41:30] - Successful: assigned to (128.143.231.201)
[04:41:30] + News From Folding@Home: Welcome to Folding@Home
[04:41:30] Loaded queue successfully.
[04:42:21] + Closed connections
[04:42:21]
[04:42:21] + Processing work unit
[04:42:21] Core required: FahCore_a5.exe
[04:42:21] Core found.
[04:42:21] Working on queue slot 02 [September 29 04:42:21 UTC]
[04:42:21] + Working ...
[04:42:21]
[04:42:21] *------------------------------*
[04:42:21] Folding@Home Gromacs SMP Core
[04:42:21] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[04:42:21]
[04:42:21] Preparing to commence simulation
[04:42:21] - Looking at optimizations...
[04:42:21] - Created dyn
[04:42:21] - Files status OK
[04:42:23] - Expanded 30305526 -> 33158020 (decompressed 109.4 percent)
[04:42:23] Called DecompressByteArray: compressed_data_size=30305526 data_size=33158020, decompressed_data_size=33158020 diff=0
[04:42:24] - Digital signature verified
[04:42:24]
[04:42:24] Project: 8101 (Run 24, Clone 1, Gen 39)
[04:42:24]
[04:42:24] Assembly optimizations on if available.
[04:42:24] Entering M.D.
[04:42:31] Mapping NT from 18 to 18
[04:42:48] Completed 0 out of 250000 steps (0%)
[05:15:17] Completed 2500 out of 250000 steps (1%)
[05:47:10] Completed 5000 out of 250000 steps (2%)
[06:19:00] Completed 7500 out of 250000 steps (3%)
x
[05:55:54] Completed 235000 out of 250000 steps (94%)
[06:25:57] Completed 237500 out of 250000 steps (95%)
[06:55:58] Completed 240000 out of 250000 steps (96%)
[07:26:05] Completed 242500 out of 250000 steps (97%)
[07:56:12] Completed 245000 out of 250000 steps (98%)
[08:26:16] Completed 247500 out of 250000 steps (99%)
[08:56:20] Completed 250000 out of 250000 steps (100%)
[08:56:30] DynamicWrapper: Finished Work Unit: sleep=10000
[08:56:40]
[08:56:40] Finished Work Unit:
[08:56:40] - Reading up to 64340496 from "work/wudata_02.trr": Read 64340496
[08:56:40] trr file hash check passed.
[08:56:40] - Reading up to 31618496 from "work/wudata_02.xtc": Read 31618496
[08:56:41] xtc file hash check passed.
[08:56:41] edr file hash check passed.
[08:56:41] logfile size: 219703
[08:56:41] Leaving Run
[08:56:44] - Writing 96339571 bytes of core data to disk...
[08:56:59] Done: 96339059 -> 91562584 (compressed to 5.8 percent)
[08:56:59] ... Done.
[08:57:10] - Shutting down core
[08:57:10]
[08:57:10] Folding@home Core Shutdown: FINISHED_UNIT
[08:57:12] CoreStatus = 64 (100)
[08:57:12] Sending work to server
[08:57:12] Project: 8101 (Run 24, Clone 1, Gen 39)
[08:57:12] + Attempting to send results October 1 08:57:12 UTC
[09:19:29] - Server reports problem with unit.
[09:19:29] - Preparing to get new work unit...
[09:19:29] Cleaning up work directory
[09:19:29] + Attempting to get work packet
[09:19:29] Passkey found
[09:19:29] - Connecting to assignment server
[09:19:30] - Successful: assigned to (128.143.199.96).
[09:19:30] + News From Folding@Home: Welcome to
freeloader1969;1039194919 wrote: I've had two 8101's go bad for half a million points. I'll let this one finish and if it fails, I'll be shutting down my folding rigs until Stanford fixes their "problem". My latest one just failed this morning.
Quote:
Originally Posted by Grandpa_01 View Post
freeloader1969 what do you mean by failes, are you getting the server has a problem wit the unit message or are they getting eue errors ox8b erors you should not be getting the server error. If you are it needs to be reported over at the FF they can not fix an issue if they do not know about it. All of the messed up WU's should have been completed by now.
I got the "server has a problem with the unit" last night.
http://hardforum.com/showthread.php?t=1719949freeloader1969;1039196005 wrote: I got the "server has a problem with the unit" last night.
Code: Select all
[02:16:55] Completed 242500 out of 250000 steps (97%) [02:45:20] Completed 245000 out of 250000 steps (98%) [03:13:48] Completed 247500 out of 250000 steps (99%) [03:42:18] Completed 250000 out of 250000 steps (100%) [03:42:31] DynamicWrapper: Finished Work Unit: sleep=10000 [03:42:41] [03:42:41] Finished Work Unit: [03:42:41] - Reading up to 64340496 from "work/wudata_04.trr": Read 64340496 [03:42:42] trr file hash check passed. [03:42:42] - Reading up to 31616784 from "work/wudata_04.xtc": Read 31616784 [03:42:42] xtc file hash check passed. [03:42:42] edr file hash check passed. [03:42:42] logfile size: 203100 [03:42:42] Leaving Run [03:42:42] - Writing 96321256 bytes of core data to disk... [03:43:14] Done: 96320744 -> 91568336 (compressed to 5.8 percent) [03:43:14] ... Done. [03:43:25] - Shutting down core [03:43:25] [03:43:25] Folding@home Core Shutdown: FINISHED_UNIT [03:43:27] CoreStatus = 64 (100) [03:43:27] Sending work to server [03:43:27] Project: 8101 (Run 22, Clone 1, Gen 60) [03:43:27] + Attempting to send results [October 2 03:43:27 UTC] [04:01:56] - Server reports problem with unit. [04:01:56] - Preparing to get new work unit... [04:01:56] Cleaning up work directory
That means several things.kasson wrote:This problem should be taken care of going forward; we are continuing to review the logs to analyze the impact of the problem on rejected work units.
There is insufficient information in the quoted material to determine one way or another if the problem still exists. Estimating backwards from the TPF of the last couple frames shown in the log excerpt, the WU could have been downloaded while the problem was still occurring. Or it could have been a re-download of one of the problem WU's issued during that period. If freeloader is still having this problem, have him post about it in this forum.Grandpa_01 wrote:It appears that one of the members over at the team I fold for may still having the problem below is the quote from his report some one may want to check it out.
Code: Select all
[17:41:29] - Preparing to get new work unit...
[17:41:29] Cleaning up work directory
[17:41:29] + Attempting to get work packet
[17:41:29] Passkey found
[17:41:29] - Connecting to assignment server
[17:41:39] - Successful: assigned to (128.143.231.201).
[17:41:39] + News From Folding@Home: Welcome to Folding@Home
[17:41:39] Loaded queue successfully.
[17:41:53] + Closed connections
[17:41:53]
[17:41:53] + Processing work unit
[17:41:53] Core required: FahCore_a5.exe
[17:41:53] Core found.
[17:41:53] Working on queue slot 03 [October 31 17:41:53 UTC]
[17:41:53] + Working ...
thekraken: The Kraken 0.7-pre15 (compiled Sun Oct 28 20:27:39 EDT 2012 by folding@Linux-Server)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 4582
thekraken: Logging to thekraken.log
[17:41:53]
[17:41:53] *------------------------------*
[17:41:53] Folding@Home Gromacs SMP Core
[17:41:53] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[17:41:53]
[17:41:53] Preparing to commence simulation
[17:41:53] - Looking at optimizations...
[17:41:53] - Created dyn
[17:41:53] - Files status OK
[17:41:57] - Expanded 30305865 -> 33158020 (decompressed 109.4 percent)
[17:41:57] Called DecompressByteArray: compressed_data_size=30305865 data_size=33158020, decompressed_data_size=33158020 diff=0
[17:41:58] - Digital signature verified
[17:41:58]
[17:41:58] Project: 8101 (Run 16, Clone 1, Gen 57)
[17:41:58]
[17:41:58] Assembly optimizations on if available.
[17:41:58] Entering M.D.
:-) G R O M A C S (-:
Groningen Machine for Chemical Simulation
:-) VERSION 4.5.3 (-:
Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra,
Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff,
Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz,
Michael Shirts, Alfons Sijbers, Peter Tieleman,
Berk Hess, David van der Spoel, and Erik Lindahl.
Copyright (c) 1991-2000, University of Groningen, The Netherlands.
Copyright (c) 2001-2010, The GROMACS development team at
Uppsala University & The Royal Institute of Technology, Sweden.
check out http://www.gromacs.org for more information.
:-) Gromacs (-:
Reading file work/wudata_03.tpr, VERSION 4.5.5-dev-20120903-d64b9e3 (single precision)
[17:42:05] Mapping NT from 48 to 48
Starting 48 threads
Making 2D domain decomposition 12 x 4 x 1
starting mdrun 'FP_membrane in water'
14500000 steps, 58000.0 ps (continuing from step 14250000, 57000.0 ps).
[17:42:10] Completed 0 out of 250000 steps (0%)
NOTE: Turning on dynamic load balancing
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09] (128.143.231.201:80)
[17:53:09] - Error: Could not transmit unit 01 (completed October 30) to work server.
[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09] (128.143.199.97:8080)
[17:53:09] + Retrying using alternative port
[17:53:09] - Couldn't send HTTP request to server
[17:53:09] + Could not connect to Work Server (results)
[17:53:09] (128.143.199.97:80)
[17:53:09] Could not transmit unit 01 to Collection server; keeping in queue.
[17:53:09] Project: 8101 (Run 3, Clone 5, Gen 90)
[17:53:09] + Attempting to send results [October 31 17:53:09 UTC]
[18:05:00] Completed 2500 out of 250000 steps (1%)
[18:10:22] - Couldn't send HTTP request to server
[18:10:22] + Could not connect to Work Server (results)
[18:10:22] (128.143.231.201:8080)
[18:10:22] + Retrying using alternative port
[18:27:37] - Couldn't send HTTP request to server
[18:27:37] + Could not connect to Work Server (results)
[18:27:37] (128.143.231.201:80)
[18:27:37] - Error: Could not transmit unit 02 (completed October 31) to work server.
[18:27:37] Keeping unit 02 in queue.
[18:47:21] Completed 5000 out of 250000 steps (2%)
[19:22:46] Completed 7500 out of 250000 steps (3%)
[20:03:08] Completed 10000 out of 250000 steps (4%)
[20:34:26] Completed 12500 out of 250000 steps (5%)
[21:06:07] Completed 15000 out of 250000 steps (6%)
[21:33:16] Completed 17500 out of 250000 steps (7%)
True, and I was concerned about that, too, but it doesn't seem to have taken the server out of service.PinHead wrote:you seem to be dying one short. Missing pmks04.med.Virginia.EDU [128.143.231.201]. I get 97 ms on 128.143.222.92.
But they are in Virginia that was also smacked by Sandy.