128.143.231.201 or Bigadv Collection server broken

Moderators: Site Moderators, FAHC Science Team

texinga
Posts: 52
Joined: Sat Feb 05, 2011 4:42 pm

Re: Bigadv Collection and or Assighnment server is broken

Post by texinga »

The 8102 that spoke about earlier in this thread (R0,C16,G71) did process OK the 2nd time that I completed it today. Like Grandpa, I got the "Results Successfully Sent...Thank you for your contribution message" when it completed this time. The next WU that I just got was a different 8102 WU (R0,C16,G72) and it is running now. Just wanted to report back that (for me) it would appear the problem is gone for now.
decali
Posts: 5
Joined: Sun Sep 30, 2012 8:22 pm

Re: Bigadv Collection and or Assighnment server is broken

Post by decali »

Looks like I lost an 8101 as well today, received this message:
[17:56:50] Finished Work Unit:
[17:56:50] - Reading up to 64340496 from "work/wudata_02.trr": Read 64340496
[17:56:51] trr file hash check passed.
[17:56:51] - Reading up to 31618172 from "work/wudata_02.xtc": Read 31618172
[17:56:51] xtc file hash check passed.
[17:56:51] edr file hash check passed.
[17:56:51] logfile size: 190554
[17:56:51] Leaving Run
[17:56:54] - Writing 96310098 bytes of core data to disk...
[17:57:21] Done: 96309586 -> 91545145 (compressed to 5.8 percent)
[17:57:21] ... Done.
[17:57:28] - Shutting down core
[17:57:28]
[17:57:28] Folding@home Core Shutdown: FINISHED_UNIT
[17:57:29] CoreStatus = 64 (100)
[17:57:29] Sending work to server
[17:57:29] Project: 8101 (Run 5, Clone 2, Gen 85)


[17:57:29] + Attempting to send results [September 30 17:57:29 UTC]
[18:03:00] - Server reports problem with unit.
I was then assigned the same 8101 (5, 2, 85).
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Bigadv Collection and or Assighnment server is broken

Post by bollix47 »

If you're still reporting errors please add a couple of processing frames to your logs or let us know when it was downloaded. That might have a bearing on whether or not they fail.

Thanks.
Grandpa_01
Posts: 1122
Joined: Wed Mar 04, 2009 7:36 am
Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M

Re: Bigadv Collection and or Assighnment server is broken

Post by Grandpa_01 »

I would say it is resolved for those with faster machines, I just sent my 3rd 1 today and it was accepted as good by the collection server and others have reported the same on different forums, my guess is this was an Assignment Server issue not a collection server issue, it appears that the AS may have messed up some WU's when it was assigning them they, would then need to be run sent back rejected reassigned after the problem was cleared up rerun and turned in again with no errors. It does appear the WU's were actually bad due to an AS glitch. The people who are still seeing the issue seem to be on slower rigs which have not completed the first step yet although I am not sure about patonb he did not provide any clue in his post.

But this is just a guess from looking at the available info.
Image
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
jeanjean15
Posts: 7
Joined: Fri Aug 06, 2010 8:55 am

Re: Bigadv Collection and or Assignment server is broken

Post by jeanjean15 »

I have the same problem with 2 P8101 :

1) Project: 8101 (Run 19, Clone 8, Gen 28)

Code: Select all

# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste-jeanjean15/3475/clientdir
Executable: /home/jeanjean15/fah/fah6
Arguments: -send 04 -smp -bigbeta 

[01:28:32] - Ask before connecting: No
[01:28:32] - Proxy: 127.0.0.1:8880
[01:28:32] - User name: [Inpact]jeanjean15 (Team 51)
[01:28:32] - User ID: 2F4FA5F87427C95C
[01:28:32] - Machine ID: 2
[01:28:32] 
[01:28:32] Loaded queue successfully.
[01:28:32] Attempting to return result(s) to server...
[01:28:32] Project: 8101 (Run 19, Clone 8, Gen 28)


[01:28:32] + Attempting to send results [September 30 01:28:32 UTC]
[01:43:17] - Server reports problem with unit.
[01:43:17] - Failed to send unit 04 to server

Folding@Home Client Shutdown.
all done
WARNING: unsent result files!
2) Project: 8101 (Run 3, Clone 3, Gen 82)

Code: Select all

# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /tmp/langouste-jeanjean15/3962/clientdir
Executable: /home/jeanjean15/fah/fah6
Arguments: -send 00 -smp -bigadv 

[19:17:53] - Ask before connecting: No
[19:17:53] - Proxy: 127.0.0.1:8880
[19:17:53] - User name: [Inpact]jeanjean15 (Team 51)
[19:17:53] - User ID: 2F4FA5F87427C95C
[19:17:53] - Machine ID: 1
[19:17:53] 
[19:17:53] Loaded queue successfully.
[19:17:53] Attempting to return result(s) to server...
[19:17:53] Project: 8101 (Run 3, Clone 3, Gen 82)


[19:17:53] + Attempting to send results [September 30 19:17:53 UTC]
[19:32:18] - Server reports problem with unit.
[19:32:18] - Failed to send unit 00 to server

Folding@Home Client Shutdown.
all done
WARNING: unsent result files!
Image
Thomas R
Posts: 9
Joined: Thu Jul 19, 2012 9:19 pm

Re: Bigadv Collection and or Assignment server is broken

Post by Thomas R »

Same problem with an 8102:

Code: Select all

[11:15:56] Completed 245000 out of 250000 steps  (98%)
[11:25:53] Completed 247500 out of 250000 steps  (99%)
[11:35:51] Completed 250000 out of 250000 steps  (100%)
[11:36:06] DynamicWrapper: Finished Work Unit: sleep=10000
[11:36:16] 
[11:36:16] Finished Work Unit:
[11:36:16] - Reading up to 64407792 from "work/wudata_05.trr": Read 64407792
[11:36:16] trr file hash check passed.
[11:36:16] - Reading up to 31623544 from "work/wudata_05.xtc": Read 31623544
[11:36:17] xtc file hash check passed.
[11:36:17] edr file hash check passed.
[11:36:17] logfile size: 191066
[11:36:17] Leaving Run
[11:36:21] - Writing 96383278 bytes of core data to disk...
[11:36:49] Done: 96382766 -> 91636117 (compressed to 5.9 percent)
[11:36:49]   ... Done.
[11:37:50] - Shutting down core
[11:37:50] 
[11:37:50] Folding@home Core Shutdown: FINISHED_UNIT
[11:37:58] CoreStatus = 64 (100)
[11:37:58] Unit 5 finished with 83 percent of time to deadline remaining.
[11:37:58] Updated performance fraction: 0.793144
[11:37:58] Sending work to server
[11:37:58] Project: 8102 (Run 0, Clone 33, Gen 55)


[11:37:58] + Attempting to send results [September 30 11:37:58 UTC]
[11:37:58] - Reading file work/wuresults_05.dat from core
[11:37:58]   (Read 91636629 bytes from disk)
[11:37:58] Connecting to http://128.143.231.201:8080/
[11:37:58] - Couldn't send HTTP request to server
[11:37:58] + Could not connect to Work Server (results)
[11:37:58]     (128.143.231.201:8080)
[11:37:58] + Retrying using alternative port
[11:37:58] Connecting to http://128.143.231.201:80/
[11:37:58] - Couldn't send HTTP request to server
[11:37:58] + Could not connect to Work Server (results)
[11:37:58]     (128.143.231.201:80)
[11:37:58] - Error: Could not transmit unit 05 (completed September 30) to work server.
[11:37:58] - 1 failed uploads of this unit.
[11:37:58]   Keeping unit 05 in queue.
[11:37:58] Trying to send all finished work units
[11:37:58] Project: 8102 (Run 0, Clone 33, Gen 55)


[11:37:58] + Attempting to send results [September 30 11:37:58 UTC]
[11:37:58] - Reading file work/wuresults_05.dat from core
[11:37:58]   (Read 91636629 bytes from disk)
[11:37:58] Connecting to http://128.143.231.201:8080/
[11:37:58] - Couldn't send HTTP request to server
[11:37:58] + Could not connect to Work Server (results)
[11:37:58]     (128.143.231.201:8080)
[11:37:58] + Retrying using alternative port
[11:37:58] Connecting to http://128.143.231.201:80/
[11:37:58] - Couldn't send HTTP request to server
[11:37:58] + Could not connect to Work Server (results)
[11:37:58]     (128.143.231.201:80)
[11:37:58] - Error: Could not transmit unit 05 (completed September 30) to work server.
[11:37:58] - 2 failed uploads of this unit.
[11:37:58]   Keeping unit 05 in queue.
[11:37:58] + Sent 0 of 1 completed units to the server
[11:37:58] - Preparing to get new work unit...
[11:37:58] Cleaning up work directory
[11:37:58] + Attempting to get work packet
[11:37:58] Passkey found
[11:37:58] - Will indicate memory of 64410 MB
[11:37:58] - Connecting to assignment server
[11:37:58] Connecting to http://assign.stanford.edu:8080/
[11:37:59] Posted data.
[11:37:59] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[11:37:59] + News From Folding@Home: Welcome to Folding@Home
[11:37:59] Loaded queue successfully.
[11:37:59] Sent data
[11:37:59] Connecting to http://128.143.231.201:8080/
[11:38:06] Posted data.
[11:38:06] Initial: 0000; - Receiving payload (expected size: 30310114)
[11:40:15] - Downloaded at ~229 kB/s
[11:40:15] - Averaged speed for that direction ~210 kB/s
[11:40:15] + Received work.
[11:40:15] Trying to send all finished work units
[11:40:15] Project: 8102 (Run 0, Clone 33, Gen 55)


[11:40:15] + Attempting to send results [September 30 11:40:15 UTC]
[11:40:15] - Reading file work/wuresults_05.dat from core
[11:40:15]   (Read 91636629 bytes from disk)
[11:40:15] Connecting to http://128.143.231.201:8080/
[11:40:15] - Couldn't send HTTP request to server
[11:40:15] + Could not connect to Work Server (results)
[11:40:15]     (128.143.231.201:8080)
[11:40:15] + Retrying using alternative port
[11:40:15] Connecting to http://128.143.231.201:80/
[11:40:16] - Couldn't send HTTP request to server
[11:40:16] + Could not connect to Work Server (results)
[11:40:16]     (128.143.231.201:80)
[11:40:16] - Error: Could not transmit unit 05 (completed September 30) to work server.
[11:40:16] - 3 failed uploads of this unit.
[11:40:16]   Keeping unit 05 in queue.
[11:40:16] + Sent 0 of 1 completed units to the server
[11:40:16] + Closed connections
[11:40:16] 
bollix47
Posts: 2953
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Bigadv Collection and or Assignment server is broken

Post by bollix47 »

Welcome to the folding support forum Thomas R.

Your problem is somewhat different than what is being discussed in this thread but I can't say that it's not somehow related since it is the same server.

Your log shows the attempts were made around 12 hours ago. Has the situation changed since then or is that work unit still trying to upload?
kromberg
Posts: 35
Joined: Sat Nov 07, 2009 4:36 pm

Re: Bigadv Collection and or Assighnment server is broken

Post by kromberg »

Grandpa_01 wrote:I would say it is resolved for those with faster machines,
I would say not. Missing 4 WUs. The "system", for what is it is, i saccepting newly completed WU. WU completed over the last 24 hours are SOL it looks like. +1 for Stanford ......
xposer
Posts: 10
Joined: Sun Nov 01, 2009 11:28 pm

Another: Server did not like results, dumping...?

Post by xposer »

Anyone know why after 1 day and 19 hours folding the server is refusing the work unit?

Code: Select all

******************************** Date: 30/09/12 ********************************
06:55:59:WU00:FS00:0xa5:Completed 250000 out of 250000 steps  (100%)
06:56:00:WU01:FS00:Connecting to assign3.stanford.edu:8080
06:56:01:WU01:FS00:News: Welcome to Folding@Home
06:56:01:WU01:FS00:Assigned to work server 128.143.231.201
06:56:01:WU01:FS00:Requesting new work unit for slot 00: RUNNING smp:32 from 128.143.231.201
06:56:01:WU01:FS00:Connecting to 128.143.231.201:8080
06:56:07:WU01:FS00:Downloading 28.91MiB
06:56:13:WU01:FS00:Download 26.81%
06:56:14:WU00:FS00:0xa5:DynamicWrapper: Finished Work Unit: sleep=10000
06:56:19:WU01:FS00:Download 57.72%
06:56:24:WU00:FS00:0xa5:
06:56:24:WU00:FS00:0xa5:Finished Work Unit:
06:56:24:WU00:FS00:0xa5:- Reading up to 64340496 from "00/wudata_01.trr": Read 64340496
06:56:25:WU01:FS00:Download 88.86%
06:56:25:WU00:FS00:0xa5:trr file hash check passed.
06:56:25:WU00:FS00:0xa5:- Reading up to 31616144 from "00/wudata_01.xtc": Read 31616144
06:56:25:WU00:FS00:0xa5:xtc file hash check passed.
06:56:25:WU00:FS00:0xa5:edr file hash check passed.
06:56:25:WU00:FS00:0xa5:logfile size: 192330
06:56:25:WU00:FS00:0xa5:Leaving Run
06:56:25:WU00:FS00:0xa5:- Writing 96309846 bytes of core data to disk...
06:56:27:WU01:FS00:Download complete
06:56:27:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:OK project:8101 run:20 clone:3 gen:50 core:0xa5 unit:0x0000004e088988e14f999342788663a9
06:56:48:WU00:FS00:0xa5:Done: 96309334 -> 91578018 (compressed to 5.8 percent)
06:56:48:WU00:FS00:0xa5:  ... Done.
07:07:44:WU00:FS00:0xa5:- Shutting down core
07:07:44:WU00:FS00:0xa5:
07:07:44:WU00:FS00:0xa5:Folding@home Core Shutdown: FINISHED_UNIT
07:09:02:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
07:09:02:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:8101 run:12 clone:0 gen:115 core:0xa5 unit:0x000000d8088988e14f31ae4470b3b366
07:09:02:WU00:FS00:Uploading 87.34MiB to 128.143.231.201
07:09:02:WU00:FS00:Connecting to 128.143.231.201:8080
07:09:03:WU01:FS00:Starting
07:09:03:WU01:FS00:Running FahCore: /home/mike/fahclient_7.1.52-64bit-release/FAHCoreWrapper /home/mike/fahclient_7.1.52-64bit-release/cores/www.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 701 -lifeline 1190 -checkpoint 15 -np 32 forceasm
07:09:03:WU01:FS00:Started FahCore on PID 24272
07:09:03:Started thread 76 on PID 1190
07:09:03:WU01:FS00:Core PID:24276
07:09:03:WU01:FS00:FahCore 0xa5 started
07:09:03:WU01:FS00:0xa5:
07:09:03:WU01:FS00:0xa5:*------------------------------*
07:09:03:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
07:09:03:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
07:09:03:WU01:FS00:0xa5:
07:09:03:WU01:FS00:0xa5:Preparing to commence simulation
07:09:03:WU01:FS00:0xa5:- Looking at optimizations...
07:09:03:WU01:FS00:0xa5:- Created dyn
07:09:03:WU01:FS00:0xa5:- Files status OK
07:09:06:WU01:FS00:0xa5:- Expanded 30312921 -> 33158020 (decompressed 109.3 percent)
07:09:06:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30312921 data_size=33158020, decompressed_data_size=33158020 diff=0
07:09:06:WU01:FS00:0xa5:- Digital signature verified
07:09:06:WU01:FS00:0xa5:
07:09:06:WU01:FS00:0xa5:Project: 8101 (Run 20, Clone 3, Gen 50)
07:09:06:WU01:FS00:0xa5:
07:09:06:WU01:FS00:0xa5:Assembly optimizations on if available.
07:09:06:WU01:FS00:0xa5:Entering M.D.
07:09:08:WU00:FS00:Upload 1.00%
07:09:13:WU01:FS00:0xa5:Mapping NT from 32 to 32 
07:09:14:WU00:FS00:Upload 1.86%
07:09:18:WU01:FS00:0xa5:Completed 0 out of 250000 steps  (0%)
07:09:20:WU00:FS00:Upload 2.72%
07:09:26:WU00:FS00:Upload 3.36%
07:09:32:WU00:FS00:Upload 4.37%
07:09:39:WU00:FS00:Upload 5.15%
07:09:45:WU00:FS00:Upload 6.01%
07:09:51:WU00:FS00:Upload 6.87%
07:09:57:WU00:FS00:Upload 7.73%
07:10:03:WU00:FS00:Upload 8.59%
07:10:10:WU00:FS00:Upload 9.45%
07:10:16:WU00:FS00:Upload 10.09%
07:10:22:WU00:FS00:Upload 11.09%
07:10:28:WU00:FS00:Upload 11.95%
07:10:35:WU00:FS00:Upload 12.81%
07:10:41:WU00:FS00:Upload 13.67%
07:10:48:WU00:FS00:Upload 14.60%
07:10:54:WU00:FS00:Upload 15.46%
07:11:00:WU00:FS00:Upload 16.10%
07:11:06:WU00:FS00:Upload 17.03%
07:11:12:WU00:FS00:Upload 17.89%
07:11:19:WU00:FS00:Upload 18.82%
07:11:26:WU00:FS00:Upload 19.75%
07:11:32:WU00:FS00:Upload 20.61%
07:11:39:WU00:FS00:Upload 21.54%
07:11:45:WU00:FS00:Upload 22.33%
07:11:51:WU00:FS00:Upload 23.26%
07:11:57:WU00:FS00:Upload 23.97%
07:12:04:WU00:FS00:Upload 24.76%
07:12:11:WU00:FS00:Upload 25.76%
07:12:17:WU00:FS00:Upload 26.76%
07:12:24:WU00:FS00:Upload 27.62%
07:12:32:WU00:FS00:Upload 28.70%
07:12:39:WU00:FS00:Upload 29.56%
07:12:45:WU00:FS00:Upload 30.49%
07:12:51:WU00:FS00:Upload 31.34%
07:12:57:WU00:FS00:Upload 31.92%
07:13:03:WU00:FS00:Upload 32.78%
07:13:10:WU00:FS00:Upload 33.85%
07:13:17:WU00:FS00:Upload 34.78%
07:13:23:WU00:FS00:Upload 35.64%
07:13:29:WU00:FS00:Upload 36.50%
07:13:35:WU00:FS00:Upload 37.07%
07:13:41:WU00:FS00:Upload 38.07%
07:13:48:WU00:FS00:Upload 38.93%
07:13:54:WU00:FS00:Upload 39.65%
07:14:01:WU00:FS00:Upload 40.86%
07:14:07:WU00:FS00:Upload 41.43%
07:14:14:WU00:FS00:Upload 42.44%
07:14:20:WU00:FS00:Upload 43.44%
07:14:26:WU00:FS00:Upload 44.08%
07:14:34:WU00:FS00:Upload 45.16%
07:14:41:WU00:FS00:Upload 46.30%
07:14:47:WU00:FS00:Upload 47.02%
07:14:54:WU00:FS00:Upload 47.73%
07:15:02:WU00:FS00:Upload 49.09%
07:15:08:WU00:FS00:Upload 49.81%
07:15:15:WU00:FS00:Upload 50.88%
07:15:22:WU00:FS00:Upload 51.88%
07:15:28:WU00:FS00:Upload 52.60%
07:15:36:WU00:FS00:Upload 53.67%
07:15:43:WU00:FS00:Upload 54.75%
07:15:50:WU00:FS00:Upload 55.46%
07:15:57:WU00:FS00:Upload 56.32%
07:16:03:WU00:FS00:Upload 57.39%
07:16:12:WU00:FS00:Upload 58.47%
07:16:20:WU00:FS00:Upload 59.61%
07:16:27:WU00:FS00:Upload 60.76%
07:16:34:WU00:FS00:Upload 61.47%
07:16:42:WU00:FS00:Upload 62.69%
07:16:50:WU00:FS00:Upload 63.76%
07:16:57:WU00:FS00:Upload 64.48%
07:17:06:WU00:FS00:Upload 65.84%
07:17:14:WU00:FS00:Upload 67.13%
07:17:20:WU00:FS00:Upload 67.56%
07:17:29:WU00:FS00:Upload 68.84%
07:17:37:WU00:FS00:Upload 70.27%
07:17:43:WU00:FS00:Upload 70.70%
07:17:53:WU00:FS00:Upload 71.99%
07:17:59:WU00:FS00:Upload 73.28%
07:18:08:WU00:FS00:Upload 74.14%
07:18:17:WU00:FS00:Upload 75.43%
07:18:26:WU00:FS00:Upload 76.79%
07:18:33:WU00:FS00:Upload 77.65%
07:18:42:WU00:FS00:Upload 78.93%
07:18:48:WU00:FS00:Upload 79.79%
07:18:54:WU00:FS00:Upload 80.65%
07:19:01:WU00:FS00:Upload 81.58%
07:19:07:WU00:FS00:Upload 82.01%
07:19:14:WU00:FS00:Upload 83.37%
07:19:23:WU00:FS00:Upload 84.23%
07:19:33:WU00:FS00:Upload 85.73%
07:19:39:WU00:FS00:Upload 86.88%
07:19:50:WU00:FS00:Upload 87.88%
07:20:00:WU00:FS00:Upload 89.31%
07:20:06:WU00:FS00:Upload 90.46%
07:20:16:WU00:FS00:Upload 91.46%
07:20:27:WU00:FS00:Upload 92.96%
07:20:33:WU00:FS00:Upload 94.10%
07:20:43:WU00:FS00:Upload 95.11%
07:20:54:WU00:FS00:Upload 96.61%
07:21:02:WU00:FS00:Upload 98.11%
07:21:14:WU00:FS00:Upload 99.11%
07:21:23:WU00:FS00:Upload complete
07:21:23:WU00:FS00:Server responded WORK_QUIT (404)
07:21:23:WARNING:WU00:FS00:Server did not like results, dumping
07:21:23:WU00:FS00:Cleaning up
07:35:02:WU01:FS00:0xa5:Completed 2500 out of 250000 steps  (1%)
08:00:49:WU01:FS00:0xa5:Completed 5000 out of 250000 steps  (2%)
(1)X58a-UD3R . I7-970 (6/12core) @ 3.9 ghz (Windows 7) 12 gigs Mushkin 1333(V7)
(2)KGPE-D16 . 2 x 6274(32cores) @ 2.2 ghz (ubuntu 12.04.1) 16 gigs Kingston 1333(6.34)
(3)KGPE-D16 . 2 x 6274(32cores) @ 2.2 ghz (ubuntu 12.04.1) 32 gigs G Skill 1333(V7)
Joe_H
Site Admin
Posts: 7900
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Bigadv Collection and or Assignment server is broken

Post by Joe_H »

xposer wrote:Anyone know why after 1 day and 19 hours folding the server is refusing the work unit?
Merged your topic with the rest experiencing the same problem this weekend.
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: Bigadv Collection and or Assignment server is broken

Post by PinHead »

I had 2 8101's fail last night after folding to 100% and a clean core exit.

Here is the download / upload info for WU 1 along with the log where it completed today and successfully uploaded.

Code: Select all

[01:31:51] Folding@Home Gromacs SMP Core
[01:31:51] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[01:31:51] 
[01:31:51] Preparing to commence simulation
[01:31:51] - Looking at optimizations...
[01:31:51] - Created dyn
[01:31:51] - Files status OK
[01:31:55] - Expanded 30304961 -> 33158020 (decompressed 109.4 percent)
[01:31:55] Called DecompressByteArray: compressed_data_size=30304961 data_size=33158020, decompressed_data_size=33158020 diff=0
[01:31:55] - Digital signature verified
[01:31:55] 
[01:31:55] Project: 8101 (Run 7, Clone 1, Gen 126)
[01:31:55] 
[01:31:55] Assembly optimizations on if available.
[01:31:55] Entering M.D.
[01:32:02] Mapping NT from 64 to 64 
[01:32:08] Completed 0 out of 250000 steps  (0%)
[01:45:11] Completed 2500 out of 250000 steps  (1%)
[01:58:15] Completed 5000 out of 250000 steps  (2%)
[02:11:19] Completed 7500 out of 250000 steps  (3%)
[02:24:22] Completed 10000 out of 250000 steps  (4%)
[02:37:25] Completed 12500 out of 250000 steps  (5%)
[02:50:28] Completed 15000 out of 250000 steps  (6%)
[03:03:31] Completed 17500 out of 250000 steps  (7%)
[03:16:31] Completed 20000 out of 250000 steps  (8%)
[03:26:39] - Autosending finished units... [September 29 03:26:39 UTC]
[03:26:39] Trying to send all finished work units
[03:26:39] + No unsent completed units remaining.
[03:26:39] - Autosend completed
[03:29:35] Completed 22500 out of 250000 steps  (9%)
.
.
.
[22:38:15] Completed 242500 out of 250000 steps  (97%)
[22:51:20] Completed 245000 out of 250000 steps  (98%)
[23:04:26] Completed 247500 out of 250000 steps  (99%)
[23:17:32] Completed 250000 out of 250000 steps  (100%)
[23:17:48] DynamicWrapper: Finished Work Unit: sleep=10000
[23:17:58] 
[23:17:58] Finished Work Unit:
[23:17:58] - Reading up to 64340496 from "work/wudata_05.trr": Read 64340496
[23:17:59] trr file hash check passed.
[23:17:59] - Reading up to 31620412 from "work/wudata_05.xtc": Read 31620412
[23:17:59] xtc file hash check passed.
[23:17:59] edr file hash check passed.
[23:17:59] logfile size: 186886
[23:17:59] Leaving Run
[23:18:01] - Writing 96308670 bytes of core data to disk...
[23:18:30] Done: 96308158 -> 91558907 (compressed to 5.8 percent)
[23:18:31]   ... Done.
[23:18:45] - Shutting down core
[23:18:45] 
[23:18:45] Folding@home Core Shutdown: FINISHED_UNIT
[23:18:47] CoreStatus = 64 (100)
[23:18:47] Unit 5 finished with 77 percent of time to deadline remaining.
[23:18:47] Updated performance fraction: 0.779235
[23:18:47] Sending work to server
[23:18:47] Project: 8101 (Run 7, Clone 1, Gen 126)


[23:18:47] + Attempting to send results [September 29 23:18:47 UTC]
[23:18:47] - Reading file work/wuresults_05.dat from core
[23:18:47]   (Read 91559419 bytes from disk)
[23:18:47] Connecting to http://128.143.231.201:8080/
[23:56:49] Posted data.
[23:56:49] Initial: 0000; - Uploaded at ~39 kB/s
[23:56:49] - Averaged speed for that direction ~38 kB/s
[23:56:49] - Server reports problem with unit.
[23:56:49] Trying to send all finished work units
[23:56:49] + No unsent completed units remaining.
[23:56:49] - Preparing to get new work unit...
[23:56:49] Cleaning up work directory
[23:56:50] + Attempting to get work packet



8101 ( 7,1,126 ) reprocessed and successfully sent today.


[21:03:40] Completed 245000 out of 250000 steps  (98%)
[21:16:35] Completed 247500 out of 250000 steps  (99%)
[21:26:39] - Autosending finished units... [September 30 21:26:39 UTC]
[21:26:39] Trying to send all finished work units
[21:26:39] + No unsent completed units remaining.
[21:26:39] - Autosend completed
[21:29:29] Completed 250000 out of 250000 steps  (100%)
[21:29:46] DynamicWrapper: Finished Work Unit: sleep=10000
[21:29:56] 
[21:29:56] Finished Work Unit:
[21:29:56] - Reading up to 64340496 from "work/wudata_06.trr": Read 64340496
[21:29:57] trr file hash check passed.
[21:29:57] - Reading up to 31620360 from "work/wudata_06.xtc": Read 31620360
[21:29:57] xtc file hash check passed.
[21:29:57] edr file hash check passed.
[21:29:57] logfile size: 186822
[21:29:57] Leaving Run
[21:30:01] - Writing 96308554 bytes of core data to disk...
[21:30:30] Done: 96308042 -> 91559986 (compressed to 5.8 percent)
[21:30:31]   ... Done.
[21:30:46] - Shutting down core
[21:30:46] 
[21:30:46] Folding@home Core Shutdown: FINISHED_UNIT
[21:30:47] CoreStatus = 64 (100)
[21:30:47] Unit 6 finished with 78 percent of time to deadline remaining.
[21:30:47] Updated performance fraction: 0.778510
[21:30:47] Sending work to server
[21:30:47] Project: 8101 (Run 7, Clone 1, Gen 126)


[21:30:47] + Attempting to send results [September 30 21:30:47 UTC]
[21:30:47] - Reading file work/wuresults_06.dat from core
[21:30:47]   (Read 91560498 bytes from disk)
[21:30:47] Connecting to http://128.143.231.201:8080/
[22:08:55] Posted data.
[22:08:55] Initial: 0000; - Uploaded at ~39 kB/s
[22:08:55] - Averaged speed for that direction ~38 kB/s
[22:08:55] + Results successfully sent
[22:08:55] Thank you for your contribution to Folding@Home.
[22:08:55] + Number of Units Completed: 63
2nd 8101 details shortly


8101 ( 1,5,74 ) - This one will no longer finish in the requested time frame.

Code: Select all

[05:03:12] + Processing work unit
[05:03:12] Core required: FahCore_a5.exe
[05:03:12] Core found.
[05:03:12] Working on queue slot 05 [September 28 05:03:12 UTC]
[05:03:12] + Working ...
[05:03:12] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 05 -np 24 -priority 96 -checkpoint 10 -verbose -lifeline 1786 -version 634'

[05:03:12] 
[05:03:12] *------------------------------*
[05:03:12] Folding@Home Gromacs SMP Core
[05:03:12] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[05:03:12] 
[05:03:12] Preparing to commence simulation
[05:03:12] - Looking at optimizations...
[05:03:12] - Created dyn
[05:03:12] - Files status OK
[05:03:15] - Expanded 30309163 -> 33158020 (decompressed 109.3 percent)
[05:03:15] Called DecompressByteArray: compressed_data_size=30309163 data_size=33158020, decompressed_data_size=33158020 diff=0
[05:03:16] - Digital signature verified
[05:03:16] 
[05:03:16] Project: 8101 (Run 1, Clone 5, Gen 74)
[05:03:16] 
[05:03:16] Assembly optimizations on if available.
[05:03:16] Entering M.D.
[05:03:24] Mapping NT from 24 to 24 
[05:03:28] Completed 0 out of 250000 steps  (0%)
[05:29:58] Completed 2500 out of 250000 steps  (1%)
[05:56:29] Completed 5000 out of 250000 steps  (2%)
[06:22:55] Completed 7500 out of 250000 steps  (3%)
[06:49:25] Completed 10000 out of 250000 steps  (4%)
.
.
.
[20:19:54] Completed 222500 out of 250000 steps  (89%)
[20:46:19] Completed 225000 out of 250000 steps  (90%)
[21:12:43] Completed 227500 out of 250000 steps  (91%)
[21:21:51] - Autosending finished units... [September 29 21:21:51 UTC]
[21:21:51] Trying to send all finished work units
[21:21:51] + No unsent completed units remaining.
[21:21:51] - Autosend completed
[21:39:10] Completed 230000 out of 250000 steps  (92%)
[22:05:37] Completed 232500 out of 250000 steps  (93%)
[22:32:01] Completed 235000 out of 250000 steps  (94%)
[22:58:28] Completed 237500 out of 250000 steps  (95%)
[23:24:55] Completed 240000 out of 250000 steps  (96%)
[23:51:17] Completed 242500 out of 250000 steps  (97%)
[00:17:46] Completed 245000 out of 250000 steps  (98%)
[00:44:16] Completed 247500 out of 250000 steps  (99%)
[01:10:42] Completed 250000 out of 250000 steps  (100%)
[01:10:55] DynamicWrapper: Finished Work Unit: sleep=10000
[01:11:05] 
[01:11:05] Finished Work Unit:
[01:11:05] - Reading up to 64340496 from "work/wudata_05.trr": Read 64340496
[01:11:05] trr file hash check passed.
[01:11:05] - Reading up to 31616188 from "work/wudata_05.xtc": Read 31616188
[01:11:06] xtc file hash check passed.
[01:11:06] edr file hash check passed.
[01:11:06] logfile size: 198036
[01:11:06] Leaving Run
[01:11:09] - Writing 96315596 bytes of core data to disk...
[01:11:38] Done: 96315084 -> 91566638 (compressed to 5.8 percent)
[01:11:38]   ... Done.
[01:11:51] - Shutting down core
[01:11:51] 
[01:11:51] Folding@home Core Shutdown: FINISHED_UNIT
[01:11:53] CoreStatus = 64 (100)
[01:11:53] Unit 5 finished with 54 percent of time to deadline remaining.
[01:11:53] Updated performance fraction: 0.612862
[01:11:53] Sending work to server
[01:11:53] Project: 8101 (Run 1, Clone 5, Gen 74)


[01:11:53] + Attempting to send results [September 30 01:11:53 UTC]
[01:11:53] - Reading file work/wuresults_05.dat from core
[01:11:53]   (Read 91567150 bytes from disk)
[01:11:53] Connecting to http://128.143.231.201:8080/
[01:50:13] Posted data.
[01:50:13] Initial: 0000; - Uploaded at ~38 kB/s
[01:50:13] - Averaged speed for that direction ~36 kB/s
[01:50:13] - Server reports problem with unit.
[01:50:13] Trying to send all finished work units
[01:50:13] + No unsent completed units remaining.
[01:50:13] - Preparing to get new work unit...
[01:50:13] Cleaning up work directory


.
.
.

[01:50:13] - Connecting to assignment server
[01:50:13] Connecting to http://assign.stanford.edu:8080/
[01:50:14] Posted data.
[01:50:14] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[01:50:14] + News From Folding@Home: Welcome to Folding@Home
[01:50:14] Loaded queue successfully.
[01:50:14] Sent data
[01:50:14] Connecting to http://128.143.231.201:8080/
[01:50:20] Posted data.
[01:50:20] Initial: 0000; - Receiving payload (expected size: 30309675)
[01:51:53] - Downloaded at ~318 kB/s
[01:51:53] - Averaged speed for that direction ~346 kB/s
[01:51:53] + Received work.
[01:51:53] Trying to send all finished work units
[01:51:53] + No unsent completed units remaining.
[01:51:53] + Closed connections
[01:51:53] 
[01:51:53] + Processing work unit
[01:51:53] Core required: FahCore_a5.exe
[01:51:53] Core found.
[01:51:53] Working on queue slot 06 [September 30 01:51:53 UTC]
[01:51:53] + Working ...
[01:51:53] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 06 -np 24 -priority 96 -checkpoint 10 -verbose -lifeline 1786 -version 634'

[01:51:54] 
[01:51:54] *------------------------------*
[01:51:54] Folding@Home Gromacs SMP Core
[01:51:54] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[01:51:54] 
[01:51:54] Preparing to commence simulation
[01:51:54] - Looking at optimizations...
[01:51:54] - Created dyn
[01:51:54] - Files status OK
[01:51:57] - Expanded 30309163 -> 33158020 (decompressed 109.3 percent)
[01:51:57] Called DecompressByteArray: compressed_data_size=30309163 data_size=33158020, decompressed_data_size=33158020 diff=0
[01:51:58] - Digital signature verified
[01:51:58] 
[01:51:58] Project: 8101 (Run 1, Clone 5, Gen 74)
[01:51:58] 
[01:51:58] Assembly optimizations on if available.
[01:51:58] Entering M.D.
[01:52:05] Mapping NT from 24 to 24 
[01:52:10] Completed 0 out of 250000 steps  (0%)
[02:18:49] Completed 2500 out of 250000 steps  (1%)
[02:45:25] Completed 5000 out of 250000 steps  (2%)



Last edited by PinHead on Mon Oct 01, 2012 12:18 am, edited 1 time in total.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Bigadv Collection and or Assighnment server is broken

Post by bruce »

kromberg wrote:
Grandpa_01 wrote:I would say it is resolved for those with faster machines,
I would say not. Missing 4 WUs. The "system", for what is it is, i saccepting newly completed WU. WU completed over the last 24 hours are SOL it looks like. +1 for Stanford ......
Let's be clear about this. I think you guys are talking about two or three different things.

a) If a WU was discarded because the "server didn't like" it, it's gone. Fixing the problem won't find them.
b) If as new WU uploads successfully, the problem is fixed going forward, but not going backward.

The question has also been asked (but not answered) whether "new" means newly downloaded or newly completed. Until enough data is reported with dates and times of both download and completion/upload dates and times there's no way to tell.
PinHead
Posts: 285
Joined: Tue Jan 24, 2012 3:43 am
Hardware configuration: Quad Q9550 2.83 contains the GPU 57xx - running SMP and GPU
Quad Q6700 2.66 running just SMP
2P 32core Interlagos SMP on linux

Re: Bigadv Collection and or Assighnment server is broken

Post by PinHead »

bruce wrote:The question has also been asked (but not answered) whether "new" means newly downloaded or newly completed. Until enough data is reported with dates and times of both download and completion/upload dates and times there's no way to tell.
I can't answer that question until around this time tomorrow. BA's take a bit, but folding a new one now. 2nd rig will be about 3 days from now as it's still rechewing the error WU.

Grandpa will be able to answer that question before I will!

But so far I have had success with newly recompleted WU.
Sneevly
Posts: 6
Joined: Mon Oct 01, 2012 12:51 am

Re: Bigadv Collection and or Assignment server is broken

Post by Sneevly »

I had two units get rejected, but this one in particular showed a major load imbalance. The first I've ever seen out of this computer. Frame times were ~16 minutes instead of ~13. Now it's doing just fine with a new 8101. No idea if this has any thing to do with the failing units, just thought I'd mention this.

Code: Select all

[21:44:16] Completed 250000 out of 250000 steps  (100%)

Writing final coordinates.

 Average load imbalance: 22.9 %
 Part of the total run time spent waiting due to load imbalance: 4.1 %


NOTE: 7 % of the run time was spent communicating energies,
      you might want to use the -gcom option of mdrun


    Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:  93897.514  93897.514    100.0
                       1d02h04:57
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:   1157.131     59.946      0.920     26.083

Thanx for Using GROMACS - Have a Nice Day

[21:50:28] DynamicWrapper: Finished Work Unit: sleep=10000
[21:50:38] 
[21:50:38] Finished Work Unit:
[21:50:38] - Reading up to 64340496 from "work/wudata_08.trr": Read 64340496
[21:57:56] trr file hash check passed.
[22:00:39] - Reading up to 31675528 from "work/wudata_08.xtc": Read 31675528
[22:01:11] xtc file hash check passed.
[22:01:11] edr file hash check passed.
[22:01:11] logfile size: 188004
[22:01:11] Leaving Run
[22:01:16] - Writing 96364904 bytes of core data to disk...
[22:23:20] Done: 96364392 -> 91638975 (compressed to 5.9 percent)
[22:31:33]   ... Done.
[22:33:46] - Shutting down core
[22:33:46] 
[22:33:46] Folding@home Core Shutdown: FINISHED_UNIT
[22:33:48] CoreStatus = 64 (100)
[22:33:48] Sending work to server
[22:33:48] Project: 8101 (Run 0, Clone 7, Gen 82)


[22:33:48] + Attempting to send results [September 30 22:33:48 UTC]
[22:36:25] - Server reports problem with unit.
Image
Folding Rigs - i7 970@4.0Ghz 2xGTX570@900/2000 GTX580 | 4p 4x6172@2.1Ghz | 4p 4x6180@2.5Ghz
tear
Posts: 254
Joined: Sun Dec 02, 2007 4:08 am
Hardware configuration: None
Location: Rocky Mountains

Re: Bigadv Collection and or Assignment server is broken

Post by tear »

Got one machine here that attempted to return a WU _after_ the server has been observed to accept re-issued WUs.
This suggests that all BA units "in flight" have effectively been discarded (== server lost context* of all BA units out there).

*) context required to accept results
kasson wrote:Yes--I see something weird going on. Nothing has changed with the work server, but I think some of the people at Stanford may have changed the assignment server without telling me. I'm investigating.
Peter, can you please let know what the problem and resolution were and what are you guys planning to do to avoid
such situations in future?

I'm sure many donors will appreciate your time to answer these questions.
One man's ceiling is another man's floor.
Image
Post Reply