128.143.231.201 or Bigadv Collection server broken

Moderators: Site Moderators, FAHC Science Team

Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: Bigadv Collection and or Assighnment server is broken

Post by Grandpa_01 »

bruce wrote:
kromberg wrote:
Grandpa_01 wrote:I would say it is resolved for those with faster machines,
I would say not. Missing 4 WUs. The "system", for what is it is, i saccepting newly completed WU. WU completed over the last 24 hours are SOL it looks like. +1 for Stanford ......
Let's be clear about this. I think you guys are talking about two or three different things.

a) If a WU was discarded because the "server didn't like" it, it's gone. Fixing the problem won't find them.
b) If as new WU uploads successfully, the problem is fixed going forward, but not going backward.

The question has also been asked (but not answered) whether "new" means newly downloaded or newly completed. Until enough data is reported with dates and times of both download and completion/upload dates and times there's no way to tell.
If you ran a WU and it was rejected with the ( Server reports problem with unit), then you were reassigned and downloaded the same WU to run again (as in same R C G) then the second time you complete it the server will accept it. If you downloaded a WU during the time frame the AS was borking WU's the first time you complete it the server will give you the (Server reports problem with unit) message that is why some are still getting the message on slower machines they have not completed the borked WU yet. Unfortunately we do knot know what the time frame was that the AS was messing up WU's, if we did the answer would be to just dump the WU's that were assigned during that time frame.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
jcoffland
Site Admin
Posts: 1018
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Bigadv Collection and or Assignment server is broken

Post by jcoffland »

tear wrote:
kasson wrote:Yes--I see something weird going on. Nothing has changed with the work server, but I think some of the people at Stanford may have changed the assignment server without telling me. I'm investigating.
Peter, can you please let know what the problem and resolution were and what are you guys planning to do to avoid
such situations in future?

I'm sure many donors will appreciate your time to answer these questions.
I could be wrong but I believe it is unlikely this has anything to do with the recent AS (Assignment Server) changes. The error indicates data was lost or damaged either on the client or much more likely on the WS (Work Server) considering that it happened to many people, around the same time, all on project 8101, all using the same WS and the problem did not occur on subsequently assigned WSs (Work Units). Have I got this right? I believe it is most likely that some data was unintentionally lost on the WS. Probably during the process of working with the simulation data. This would at least explain the symptoms.

BTW, The scientific data is safe and the only penalty to the project is a slight loss in time to completion.
Cauldron Development LLC
http://cauldrondevelopment.com/
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Bigadv Collection and or Assignment server is broken

Post by bollix47 »

There was at least one report for the same problem on project 8102 ... same work server.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: Bigadv Collection and or Assignment server is broken

Post by Grandpa_01 »

Yes it was on 8102 I lost 3 of them very first post in this thread and 1 8101 also some lost 6901's so yes AS server changes were most likely the cause.

PS What data is safe the rejected WU's
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
jcoffland
Site Admin
Posts: 1018
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Bigadv Collection and or Assignment server is broken

Post by jcoffland »

Grandpa_01 wrote:Yes it was on 8102 I lost 3 of them very first post in this thread and 1 8101 also some lost 6901's so yes AS server changes were most likely the cause.
I'm not sure I agree that that is the obvious conclusion from the evidence.

EDIT: You can see on the psummary page that 8101 and 8102 are from the same WS. Unless the problem happened often on project 6901 that incident is unlikely to be related.
Cauldron Development LLC
http://cauldrondevelopment.com/
decali
Posts: 5
Joined: Sun Sep 30, 2012 8:22 pm

Re: Bigadv Collection and or Assighnment server is broken

Post by decali »

bollix47 wrote:If you're still reporting errors please add a couple of processing frames to your logs or let us know when it was downloaded. That might have a bearing on whether or not they fail.
Thanks.
Sure thing. This is the log from the entire 8101 that was lost (I posted about this previously, but didn't include enough info)

Code: Select all

[11:46:49] - Preparing to get new work unit...
[11:46:49] Cleaning up work directory
[11:46:50] + Attempting to get work packet
[11:46:50] Passkey found
[11:46:50] - Connecting to assignment server
[11:46:51] - Successful: assigned to (128.143.231.201).
[11:46:51] + News From Folding@Home: Welcome to Folding@Home
[11:46:51] Loaded queue successfully.
[11:47:08] + Closed connections
[11:47:08] 
[11:47:08] + Processing work unit
[11:47:08] Core required: FahCore_a5.exe
[11:47:08] Core found.
[11:47:08] Working on queue slot 02 [September 29 11:47:08 UTC]
[11:47:08] + Working ...
[11:47:08] 
[11:47:08] *------------------------------*
[11:47:08] Folding@Home Gromacs SMP Core
[11:47:08] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[11:47:08] 
[11:47:08] Preparing to commence simulation
[11:47:08] - Looking at optimizations...
[11:47:08] - Created dyn
[11:47:08] - Files status OK
[11:47:11] - Expanded 30298350 -> 33158020 (decompressed 109.4 percent)
[11:47:11] Called DecompressByteArray: compressed_data_size=30298350 data_size=33158020, decompressed_data_size=33158020 diff=0
[11:47:11] - Digital signature verified
[11:47:11] 
[11:47:11] Project: 8101 (Run 5, Clone 2, Gen 85)
[11:47:11] 
[11:47:11] Assembly optimizations on if available.
[11:47:11] Entering M.D.
[11:47:19] Mapping NT from 32 to 32 
[11:47:23] Completed 0 out of 250000 steps  (0%)
[12:05:45] Completed 2500 out of 250000 steps  (1%)
[12:23:48] Completed 5000 out of 250000 steps  (2%)
[12:41:46] Completed 7500 out of 250000 steps  (3%)
[12:59:48] Completed 10000 out of 250000 steps  (4%)
[13:17:51] Completed 12500 out of 250000 steps  (5%)
[13:35:51] Completed 15000 out of 250000 steps  (6%)
[13:53:52] Completed 17500 out of 250000 steps  (7%)
[14:11:51] Completed 20000 out of 250000 steps  (8%)
[14:29:53] Completed 22500 out of 250000 steps  (9%)
[14:47:55] Completed 25000 out of 250000 steps  (10%)
[15:05:53] Completed 27500 out of 250000 steps  (11%)
[15:23:55] Completed 30000 out of 250000 steps  (12%)
[15:41:56] Completed 32500 out of 250000 steps  (13%)
[15:59:59] Completed 35000 out of 250000 steps  (14%)
[16:18:02] Completed 37500 out of 250000 steps  (15%)
[16:36:01] Completed 40000 out of 250000 steps  (16%)
[16:54:02] Completed 42500 out of 250000 steps  (17%)
[17:12:01] Completed 45000 out of 250000 steps  (18%)
[17:30:05] Completed 47500 out of 250000 steps  (19%)
[17:48:10] Completed 50000 out of 250000 steps  (20%)
[18:06:12] Completed 52500 out of 250000 steps  (21%)
[18:24:17] Completed 55000 out of 250000 steps  (22%)
[18:42:17] Completed 57500 out of 250000 steps  (23%)
[19:00:21] Completed 60000 out of 250000 steps  (24%)
[19:18:28] Completed 62500 out of 250000 steps  (25%)
[19:36:31] Completed 65000 out of 250000 steps  (26%)
[19:54:36] Completed 67500 out of 250000 steps  (27%)
[20:12:39] Completed 70000 out of 250000 steps  (28%)
[20:30:43] Completed 72500 out of 250000 steps  (29%)
[20:48:49] Completed 75000 out of 250000 steps  (30%)
[21:06:54] Completed 77500 out of 250000 steps  (31%)
[21:25:00] Completed 80000 out of 250000 steps  (32%)
[21:43:04] Completed 82500 out of 250000 steps  (33%)
[22:01:10] Completed 85000 out of 250000 steps  (34%)
[22:19:17] Completed 87500 out of 250000 steps  (35%)
[22:37:21] Completed 90000 out of 250000 steps  (36%)
[22:55:26] Completed 92500 out of 250000 steps  (37%)
[23:13:30] Completed 95000 out of 250000 steps  (38%)
[23:31:35] Completed 97500 out of 250000 steps  (39%)
[23:49:43] Completed 100000 out of 250000 steps  (40%)
[00:07:47] Completed 102500 out of 250000 steps  (41%)
[00:25:55] Completed 105000 out of 250000 steps  (42%)
[00:43:58] Completed 107500 out of 250000 steps  (43%)
[01:02:04] Completed 110000 out of 250000 steps  (44%)
[01:20:09] Completed 112500 out of 250000 steps  (45%)
[01:38:12] Completed 115000 out of 250000 steps  (46%)
[01:56:19] Completed 117500 out of 250000 steps  (47%)
[02:14:24] Completed 120000 out of 250000 steps  (48%)
[02:32:29] Completed 122500 out of 250000 steps  (49%)
[02:50:37] Completed 125000 out of 250000 steps  (50%)
[03:08:43] Completed 127500 out of 250000 steps  (51%)
[03:26:52] Completed 130000 out of 250000 steps  (52%)
[03:44:57] Completed 132500 out of 250000 steps  (53%)
[04:03:06] Completed 135000 out of 250000 steps  (54%)
[04:21:14] Completed 137500 out of 250000 steps  (55%)
[04:39:20] Completed 140000 out of 250000 steps  (56%)
[04:57:29] Completed 142500 out of 250000 steps  (57%)
[05:15:34] Completed 145000 out of 250000 steps  (58%)
[05:33:42] Completed 147500 out of 250000 steps  (59%)
[05:51:50] Completed 150000 out of 250000 steps  (60%)
[06:09:54] Completed 152500 out of 250000 steps  (61%)
[06:28:02] Completed 155000 out of 250000 steps  (62%)
[06:46:07] Completed 157500 out of 250000 steps  (63%)
[07:04:14] Completed 160000 out of 250000 steps  (64%)
[07:22:21] Completed 162500 out of 250000 steps  (65%)
[07:40:24] Completed 165000 out of 250000 steps  (66%)
[07:58:31] Completed 167500 out of 250000 steps  (67%)
[08:16:36] Completed 170000 out of 250000 steps  (68%)
[08:34:43] Completed 172500 out of 250000 steps  (69%)
[08:52:50] Completed 175000 out of 250000 steps  (70%)
[09:10:56] Completed 177500 out of 250000 steps  (71%)
[09:29:05] Completed 180000 out of 250000 steps  (72%)
[09:47:11] Completed 182500 out of 250000 steps  (73%)
[10:05:20] Completed 185000 out of 250000 steps  (74%)
[10:23:28] Completed 187500 out of 250000 steps  (75%)
[10:41:34] Completed 190000 out of 250000 steps  (76%)
[10:59:42] Completed 192500 out of 250000 steps  (77%)
[11:17:50] Completed 195000 out of 250000 steps  (78%)
[11:35:56] Completed 197500 out of 250000 steps  (79%)
[11:54:05] Completed 200000 out of 250000 steps  (80%)
[12:12:13] Completed 202500 out of 250000 steps  (81%)
[12:30:22] Completed 205000 out of 250000 steps  (82%)
[12:48:32] Completed 207500 out of 250000 steps  (83%)
[13:06:42] Completed 210000 out of 250000 steps  (84%)
[13:24:51] Completed 212500 out of 250000 steps  (85%)
[13:42:59] Completed 215000 out of 250000 steps  (86%)
[14:01:08] Completed 217500 out of 250000 steps  (87%)
[14:19:17] Completed 220000 out of 250000 steps  (88%)
[14:37:22] Completed 222500 out of 250000 steps  (89%)
[14:55:30] Completed 225000 out of 250000 steps  (90%)
[15:13:36] Completed 227500 out of 250000 steps  (91%)
[15:31:45] Completed 230000 out of 250000 steps  (92%)
[15:49:54] Completed 232500 out of 250000 steps  (93%)
[16:07:58] Completed 235000 out of 250000 steps  (94%)
[16:26:04] Completed 237500 out of 250000 steps  (95%)
[16:44:07] Completed 240000 out of 250000 steps  (96%)
[17:02:12] Completed 242500 out of 250000 steps  (97%)
[17:20:19] Completed 245000 out of 250000 steps  (98%)
[17:38:24] Completed 247500 out of 250000 steps  (99%)
[17:56:29] Completed 250000 out of 250000 steps  (100%)
[17:56:40] DynamicWrapper: Finished Work Unit: sleep=10000
[17:56:50] 
[17:56:50] Finished Work Unit:
[17:56:50] - Reading up to 64340496 from "work/wudata_02.trr": Read 64340496
[17:56:51] trr file hash check passed.
[17:56:51] - Reading up to 31618172 from "work/wudata_02.xtc": Read 31618172
[17:56:51] xtc file hash check passed.
[17:56:51] edr file hash check passed.
[17:56:51] logfile size: 190554
[17:56:51] Leaving Run
[17:56:54] - Writing 96310098 bytes of core data to disk...
[17:57:21] Done: 96309586 -> 91545145 (compressed to 5.8 percent)
[17:57:21]   ... Done.
[17:57:28] - Shutting down core
[17:57:28] 
[17:57:28] Folding@home Core Shutdown: FINISHED_UNIT
[17:57:29] CoreStatus = 64 (100)
[17:57:29] Sending work to server
[17:57:29] Project: 8101 (Run 5, Clone 2, Gen 85)


[17:57:29] + Attempting to send results [September 30 17:57:29 UTC]
[18:03:00] - Server reports problem with unit.
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Bigadv Collection and or Assignment server is broken

Post by bollix47 »

I didn't see any reports for 6901 but if that's the case then two different work servers were involved.

As far as timing my download would have occurred around 9pm PDT Sep 28 and the failed upload would have been around 8pm PDT Sep 29.
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: Bigadv Collection and or Assignment server is broken

Post by PinHead »

jcoffland wrote:I'm not sure I agree that that is the obvious conclusion from the evidence.

EDIT: You can see on the psummary page that 8101 and 8102 are from the same WS. Unless the problem happened often on project 6901 that incident is unlikely to be related.
Could you check servers that received the same patch/upgrade and are not BA to see if they are starting to report reduced WU/points?
jcoffland
Site Admin
Posts: 1018
Joined: Fri Oct 10, 2008 6:42 pm
Location: Helsinki, Finland
Contact:

Re: Bigadv Collection and or Assignment server is broken

Post by jcoffland »

PinHead wrote:
jcoffland wrote:I'm not sure I agree that that is the obvious conclusion from the evidence.

EDIT: You can see on the psummary page that 8101 and 8102 are from the same WS. Unless the problem happened often on project 6901 that incident is unlikely to be related.
Could you check servers that received the same patch/upgrade and are not BA to see if they are starting to report reduced WU/points?
This does not have anything to do with a software upgrade or patch.
Cauldron Development LLC
http://cauldrondevelopment.com/
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: Bigadv Collection and or Assignment server is broken

Post by Grandpa_01 »

jcoffland wrote:
Grandpa_01 wrote:Yes it was on 8102 I lost 3 of them very first post in this thread and 1 8101 also some lost 6901's so yes AS server changes were most likely the cause.
I'm not sure I agree that that is the obvious conclusion from the evidence.

EDIT: You can see on the psummary page that 8101 and 8102 are from the same WS. Unless the problem happened often on project 6901 that incident is unlikely to be related.
Just curious as to how we explain that the slower machines are still getting there WU's that were assigned during that time period rejected, when other WU's assigned to faster machines since the time period, are being accepted by the Collection server. That does not compute in my mind for some reason. If it was the collection server that was borking the WU's then all Wu's that were returned after the fix should be accepted yet they are not some are accepted some are not depending on when they were assigned. At least that is the way it appears.

What am I missing here. ?

EDIT
I see what I am missing it is not the AS server that hands the WU's out it is the WS :e?:
Last edited by Grandpa_01 on Mon Oct 01, 2012 3:34 am, edited 2 times in total.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: Bigadv Collection and or Assignment server is broken

Post by PinHead »

One more observation for everyone's consumption.

The first 8101 on my faster machine that failed received full ( 330K on my machine, points ) on it's second folding effort ( 1 day later ); it should have received a 1 day penalty. To me, this implies that which ever server is responsible did not know that the WU was assigned in the first place. Just a little food for thought.
xposer
Posts: 10
Joined: Sun Nov 01, 2009 11:28 pm

Re: Bigadv Collection and or Assignment server is broken

Post by xposer »

So, this is the second wu this weekend from a different 32 core (server) computer. One was the v7 client, the second, this one was from a 6.34 client, both doing bigadv wu's.

Code: Select all

[01:11:23] Passkey found
[01:11:23] - Will indicate memory of 32159 MB
[01:11:23] - Connecting to assignment server
[01:11:23] Connecting to http://assign.stanford.edu:8080/
[01:11:23] Posted data.
[01:11:23] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[01:11:23] + News From Folding@Home: Welcome to Folding@Home
[01:11:24] Loaded queue successfully.
[01:11:24] Sent data
[01:11:24] Connecting to http://128.143.231.201:8080/
[01:11:30] Posted data.
[01:11:30] Initial: 0000; - Receiving payload (expected size: 30311815)
[01:11:50] - Downloaded at ~1480 kB/s
[01:11:50] - Averaged speed for that direction ~1518 kB/s
[01:11:50] + Received work.
[01:11:50] Trying to send all finished work units
[01:11:50] + No unsent completed units remaining.
[01:11:50] + Closed connections
[01:11:50] 
[01:11:50] + Processing work unit
[01:11:50] Core required: FahCore_a5.exe
[01:11:50] Core found.
[01:11:50] Working on queue slot 08 [September 29 01:11:50 UTC]
[01:11:50] + Working ...
[01:11:50] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 08 -np 32 -priority 96 -checkpoint 30 -forceasm -verbose -lifeline 2304 -version 634'

thekraken: The Kraken 0.7-pre15 (compiled Sun Sep  2 13:23:36 EDT 2012 by mike@server1)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 14769
thekraken: Logging to thekraken.log
[01:11:50] 
[01:11:50] *------------------------------*
[01:11:50] Folding@Home Gromacs SMP Core
[01:11:50] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[01:11:50] 
[01:11:50] Preparing to commence simulation
[01:11:50] - Assembly optimizations manually forced on.
[01:11:50] - Not checking prior termination.
[01:11:54] - Expanded 30311303 -> 33158020 (decompressed 109.3 percent)
[01:11:54] Called DecompressByteArray: compressed_data_size=30311303 data_size=33158020, decompressed_data_size=33158020 diff=0
[01:11:54] - Digital signature verified
[01:11:54] 
[01:11:54] Project: 8101 (Run 27, Clone 4, Gen 40)
[01:11:54] 
[01:11:54] Assembly optimizations on if available.
[01:11:54] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_08.tpr, VERSION 4.5.5-dev-20120903-d64b9e3 (single precision)
[01:12:01] Mapping NT from 32 to 32 
Starting 32 threads
Making 2D domain decomposition 8 x 4 x 1
starting mdrun 'FP_membrane in water'
10250000 steps,  41000.0 ps (continuing from step 10000000,  40000.0 ps).
[01:12:08] Completed 0 out of 250000 steps  (0%)

NOTE: Turning on dynamic load balancing

[01:37:29] Completed 2500 out of 250000 steps  (1%)
[02:02:35] Completed 5000 out of 250000 steps  (2%)
[02:14:16] - Autosending finished units... [September 29 02:14:16 UTC]
[02:14:16] Trying to send all finished work units
[02:14:16] + No unsent completed units remaining.
[02:14:16] - Autosend completed
[02:27:40] Completed 7500 out of 250000 steps  (3%)
[02:52:46] Completed 10000 out of 250000 steps  (4%)
[03:17:52] Completed 12500 out of 250000 steps  (5%)
[03:42:58] Completed 15000 out of 250000 steps  (6%)
[04:08:02] Completed 17500 out of 250000 steps  (7%)
[04:33:09] Completed 20000 out of 250000 steps  (8%)
[04:58:14] Completed 22500 out of 250000 steps  (9%)
[05:23:20] Completed 25000 out of 250000 steps  (10%)
[05:48:25] Completed 27500 out of 250000 steps  (11%)
[06:13:30] Completed 30000 out of 250000 steps  (12%)
[06:38:31] Completed 32500 out of 250000 steps  (13%)
[07:03:37] Completed 35000 out of 250000 steps  (14%)
[07:28:43] Completed 37500 out of 250000 steps  (15%)
[07:53:49] Completed 40000 out of 250000 steps  (16%)
[08:14:16] - Autosending finished units... [September 29 08:14:16 UTC]
[08:14:16] Trying to send all finished work units
[08:14:16] + No unsent completed units remaining.
[08:14:16] - Autosend completed
[08:18:54] Completed 42500 out of 250000 steps  (17%)
[08:43:59] Completed 45000 out of 250000 steps  (18%)
[09:09:00] Completed 47500 out of 250000 steps  (19%)
[09:34:06] Completed 50000 out of 250000 steps  (20%)
[09:59:12] Completed 52500 out of 250000 steps  (21%)
[10:24:19] Completed 55000 out of 250000 steps  (22%)
[10:49:26] Completed 57500 out of 250000 steps  (23%)
[11:14:32] Completed 60000 out of 250000 steps  (24%)
[11:39:55] Completed 62500 out of 250000 steps  (25%)
[12:05:02] Completed 65000 out of 250000 steps  (26%)
[12:30:07] Completed 67500 out of 250000 steps  (27%)
[12:55:12] Completed 70000 out of 250000 steps  (28%)
[13:20:18] Completed 72500 out of 250000 steps  (29%)
[13:45:24] Completed 75000 out of 250000 steps  (30%)
[14:10:26] Completed 77500 out of 250000 steps  (31%)
[14:14:16] - Autosending finished units... [September 29 14:14:16 UTC]
[14:14:16] Trying to send all finished work units
[14:14:16] + No unsent completed units remaining.
[14:14:16] - Autosend completed
[14:35:32] Completed 80000 out of 250000 steps  (32%)
[15:00:37] Completed 82500 out of 250000 steps  (33%)
[15:25:47] Completed 85000 out of 250000 steps  (34%)
[15:50:52] Completed 87500 out of 250000 steps  (35%)
[16:15:57] Completed 90000 out of 250000 steps  (36%)
[16:41:00] Completed 92500 out of 250000 steps  (37%)
[17:06:07] Completed 95000 out of 250000 steps  (38%)
[17:31:13] Completed 97500 out of 250000 steps  (39%)
[17:56:22] Completed 100000 out of 250000 steps  (40%)
[18:21:29] Completed 102500 out of 250000 steps  (41%)
[18:46:38] Completed 105000 out of 250000 steps  (42%)
[19:11:41] Completed 107500 out of 250000 steps  (43%)
[19:36:50] Completed 110000 out of 250000 steps  (44%)
[20:01:58] Completed 112500 out of 250000 steps  (45%)
[20:14:16] - Autosending finished units... [September 29 20:14:16 UTC]
[20:14:16] Trying to send all finished work units
[20:14:16] + No unsent completed units remaining.
[20:14:16] - Autosend completed
[20:27:08] Completed 115000 out of 250000 steps  (46%)
[20:52:14] Completed 117500 out of 250000 steps  (47%)
[21:17:22] Completed 120000 out of 250000 steps  (48%)
[21:42:28] Completed 122500 out of 250000 steps  (49%)
[22:07:30] Completed 125000 out of 250000 steps  (50%)
[22:32:35] Completed 127500 out of 250000 steps  (51%)
[22:57:42] Completed 130000 out of 250000 steps  (52%)
[23:22:48] Completed 132500 out of 250000 steps  (53%)
[23:47:55] Completed 135000 out of 250000 steps  (54%)
[00:13:02] Completed 137500 out of 250000 steps  (55%)
[00:38:05] Completed 140000 out of 250000 steps  (56%)
[01:03:11] Completed 142500 out of 250000 steps  (57%)
[01:28:19] Completed 145000 out of 250000 steps  (58%)
[01:53:25] Completed 147500 out of 250000 steps  (59%)
[02:14:16] - Autosending finished units... [September 30 02:14:16 UTC]
[02:14:16] Trying to send all finished work units
[02:14:16] + No unsent completed units remaining.
[02:14:16] - Autosend completed
[02:18:32] Completed 150000 out of 250000 steps  (60%)
[02:43:38] Completed 152500 out of 250000 steps  (61%)
[03:08:40] Completed 155000 out of 250000 steps  (62%)
[03:33:45] Completed 157500 out of 250000 steps  (63%)
[03:58:51] Completed 160000 out of 250000 steps  (64%)
[04:23:58] Completed 162500 out of 250000 steps  (65%)
[04:49:05] Completed 165000 out of 250000 steps  (66%)
[05:14:11] Completed 167500 out of 250000 steps  (67%)
[05:39:16] Completed 170000 out of 250000 steps  (68%)
[06:04:24] Completed 172500 out of 250000 steps  (69%)
[06:29:32] Completed 175000 out of 250000 steps  (70%)
[06:54:40] Completed 177500 out of 250000 steps  (71%)
[07:19:46] Completed 180000 out of 250000 steps  (72%)
[07:44:53] Completed 182500 out of 250000 steps  (73%)
[08:10:00] Completed 185000 out of 250000 steps  (74%)
[08:14:16] - Autosending finished units... [September 30 08:14:16 UTC]
[08:14:16] Trying to send all finished work units
[08:14:16] + No unsent completed units remaining.
[08:14:16] - Autosend completed
[08:35:07] Completed 187500 out of 250000 steps  (75%)
[09:00:17] Completed 190000 out of 250000 steps  (76%)
[09:25:25] Completed 192500 out of 250000 steps  (77%)
[09:50:34] Completed 195000 out of 250000 steps  (78%)
[10:15:41] Completed 197500 out of 250000 steps  (79%)
[10:40:45] Completed 200000 out of 250000 steps  (80%)
[11:05:54] Completed 202500 out of 250000 steps  (81%)
[11:31:01] Completed 205000 out of 250000 steps  (82%)
[11:56:30] Completed 207500 out of 250000 steps  (83%)
[12:21:38] Completed 210000 out of 250000 steps  (84%)
[12:46:44] Completed 212500 out of 250000 steps  (85%)
[13:11:50] Completed 215000 out of 250000 steps  (86%)
[13:36:58] Completed 217500 out of 250000 steps  (87%)
[14:02:06] Completed 220000 out of 250000 steps  (88%)
[14:14:16] - Autosending finished units... [September 30 14:14:16 UTC]
[14:14:16] Trying to send all finished work units
[14:14:16] + No unsent completed units remaining.
[14:14:16] - Autosend completed
[14:27:14] Completed 222500 out of 250000 steps  (89%)
[14:52:21] Completed 225000 out of 250000 steps  (90%)
[15:17:28] Completed 227500 out of 250000 steps  (91%)
[15:42:35] Completed 230000 out of 250000 steps  (92%)
[16:07:39] Completed 232500 out of 250000 steps  (93%)
[16:32:49] Completed 235000 out of 250000 steps  (94%)
[16:57:57] Completed 237500 out of 250000 steps  (95%)
[17:23:05] Completed 240000 out of 250000 steps  (96%)
[17:48:14] Completed 242500 out of 250000 steps  (97%)
[18:13:24] Completed 245000 out of 250000 steps  (98%)
[18:38:30] Completed 247500 out of 250000 steps  (99%)
[19:03:40] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 0.2 %
 Part of the total run time spent waiting due to load imbalance: 0.1 %
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 %


	Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time: 150707.879 150707.879    100.0
                       1d17h51:47
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:    721.351     38.084      0.573     41.863

Thanx for Using GROMACS - Have a Nice Day

[19:03:55] DynamicWrapper: Finished Work Unit: sleep=10000
[19:04:05] 
[19:04:05] Finished Work Unit:
[19:04:05] - Reading up to 64340496 from "work/wudata_08.trr": Read 64340496
[19:04:06] trr file hash check passed.
[19:04:06] - Reading up to 31557920 from "work/wudata_08.xtc": Read 31557920
[19:04:06] xtc file hash check passed.
[19:04:06] edr file hash check passed.
[19:04:06] logfile size: 192025
[19:04:06] Leaving Run
[19:04:09] - Writing 96251317 bytes of core data to disk...
[19:04:31] Done: 96250805 -> 91518069 (compressed to 5.8 percent)
[19:04:31]   ... Done.
[19:15:09] - Shutting down core
[19:15:09] 
[19:15:09] Folding@home Core Shutdown: FINISHED_UNIT
[19:16:26] CoreStatus = 64 (100)
[19:16:26] Unit 8 finished with 56 percent of time to deadline remaining.
[19:16:26] Updated performance fraction: 0.560960
[19:16:26] Sending work to server
[19:16:26] Project: 8101 (Run 27, Clone 4, Gen 40)


[19:16:26] + Attempting to send results [September 30 19:16:26 UTC]
[19:16:26] - Reading file work/wuresults_08.dat from core
[19:16:26]   (Read 91518581 bytes from disk)
[19:16:26] Connecting to http://128.143.231.201:8080/
[19:28:46] Posted data.
[19:28:46] Initial: 0000; - Uploaded at ~120 kB/s
[19:28:46] - Averaged speed for that direction ~118 kB/s
[19:28:46] - Server reports problem with unit.
[19:28:46] Trying to send all finished work units
[19:28:46] + No unsent completed units remaining.
[19:28:46] - Preparing to get new work unit...
[19:28:46] Cleaning up work directory
[19:30:07] + Attempting to get work packet
[19:30:07] Passkey found
[19:30:07] - Will indicate memory of 32159 MB
[19:30:07] - Connecting to assignment server
[19:30:07] Connecting to http://assign.stanford.edu:8080/
(1)X58a-UD3R . I7-970 (6/12core) @ 3.9 ghz (Windows 7) 12 gigs Mushkin 1333(V7)
(2)KGPE-D16 . 2 x 6274(32cores) @ 2.2 ghz (ubuntu 12.04.1) 16 gigs Kingston 1333(6.34)
(3)KGPE-D16 . 2 x 6274(32cores) @ 2.2 ghz (ubuntu 12.04.1) 32 gigs G Skill 1333(V7)
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Bigadv Collection and or Assignment server is broken

Post by bollix47 »

Happy to report that the WU I received after the failed upload has just gone up and it was not dumped. It was a different 8101 WU from the failed one.
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: Bigadv Collection and or Assignment server is broken

Post by tear »

Joe, many thanks for looking into this on weekend.

Agreed, AS doesn't seem to be part of the issue (per my understanding of the system, at least).

What we're seeing here are/were _massive_ (affecting tens, if not hundreds of donors) issues w/returning WUs to 128.143.231.201 that also coincided -- unlikely to be "random" failures.
There have been no reports (guys, correct me if I'm wrong, please) of issues returning P6901 WUs (130.237.232.237).

As many folks explained -- symptoms were as follows (in sequence):
- WU complete
- WU results upload
- Server reports problem with unit.
- WU download (same PRCG if upload and download were sequential)
- WU complete
- WU return
- Server accepts the unit.

It does seem like 128.143.231.201 needs to be looked at...
One man's ceiling is another man's floor.
Image
Thomas R
Posts: 9
Joined: Thu Jul 19, 2012 9:19 pm

Re: Bigadv Collection and or Assignment server is broken

Post by Thomas R »

bollix47 wrote:Welcome to the folding support forum Thomas R.

Your problem is somewhat different than what is being discussed in this thread but I can't say that it's not somehow related since it is the same server.

Your log shows the attempts were made around 12 hours ago. Has the situation changed since then or is that work unit still trying to upload?
No (so how I can see it). Next Project is running. But I can´t find the result in any stats.
Post Reply