Page 1 of 1
130.237.232.237
Posted: Fri Feb 25, 2011 2:34 am
by MIBW
Hi Guys,
Yesterday updated client and got 3 new 6901s. All have failed to upload after finishing ~ 7 hours ago. Musky from [H] is also reporting problems.
Hard to tell, lots of problems here. It is Langouste, or the work server, or a new core local bug in corrupting file data?
Units finished, langouste forks them, and moves on to next. Uploads have errors. I go back and try to redo uploads with the FAH.exe -local -send all, (this works when a unit has failed to send after a -oneunit shutdown) but this fails too.
Code: Select all
[23:36:12] + Attempting to send results [February 24 23:36:12 UTC]
[23:51:15] - Couldn't send HTTP request to server
[23:51:15] + Could not connect to Work Server (results)
[23:51:15] (130.237.232.237:8080)
[23:51:15] + Retrying using alternative port
[00:06:16] - Couldn't send HTTP request to server
[00:06:16] + Could not connect to Work Server (results)
[00:06:16] (FAH.exe -local -send all:80)
[00:06:16] - Error: Could not transmit unit 04 (completed February 24) to work server.
[00:06:16] - Read packet limit of 540015616... Set to 524286976.
[00:06:16] + Attempting to send results [February 25 00:06:16 UTC]
[00:06:17] - Couldn't send HTTP request to server
[00:06:17] + Could not connect to Work Server (results)
[00:06:17] (130.237.165.141:8080)
[00:06:17] + Retrying using alternative port
[00:06:18] - Couldn't send HTTP request to server
[00:06:18] + Could not connect to Work Server (results)
[00:06:18] (130.237.165.141:80)
[00:06:18] Could not transmit unit 04 to Collection server; keeping in queue.
[00:06:18] - Failed to send all units to server
Folding@Home Client Shutdown.
On one machine trying again just results in fast exits of FAH - the log shows that it can't read some of the files.
Code: Select all
[23:14:34] Loaded queue successfully.
[23:14:34] Attempting to return result(s) to server...
[23:14:34] Project: 6900 (Run 17, Clone 10, Gen 25)
[23:14:34] - Read packet limit of 540015616... Set to 524286976.
[23:14:34] - Error: Could not get length of results file work/wuresults_05.dat
[23:14:34] - Error: Could not read unit 05 file. Removing from queue.
[23:14:34] - Failed to send all units to server
Folding@Home Client Shutdown.
EDIT: reconfigured to eliminate Langouste, still failing.
Code: Select all
# Windows CPU Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.34
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\Users\dave\AppData\Local\Temp\langouste-dave\6729\clientdir
Executable: C:\Apps\FAH GPU Tracker V2\FAH.exe
Arguments: -local -send all
[02:53:38] - Ask before connecting: No
[02:53:38] - User name: DigitalFX (Team 33)
[02:53:38] - User ID: 1976C4B84F42DFEC
[02:53:38] - Machine ID: 3
[02:53:38]
[02:53:38] Loaded queue successfully.
[02:53:38] Attempting to return result(s) to server...
[02:53:38] Project: 6901 (Run 15, Clone 1, Gen 0)
[02:53:38] - Read packet limit of 540015616... Set to 524286976.
[02:53:38] + Attempting to send results [February 25 02:53:38 UTC]
[03:08:41] - Couldn't send HTTP request to server
[03:08:41] + Could not connect to Work Server (results)
[03:08:41] (130.237.232.237:8080)
[03:08:41] + Retrying using alternative port
[03:23:42] - Couldn't send HTTP request to server
[03:23:42] + Could not connect to Work Server (results)
[03:23:42] (130.237.232.237:80)
[03:23:42] - Error: Could not transmit unit 05 (completed February 24) to work server.
[03:23:42] - Read packet limit of 540015616... Set to 524286976.
[03:23:42] + Attempting to send results [February 25 03:23:42 UTC]
[03:23:43] - Couldn't send HTTP request to server
[03:23:43] + Could not connect to Work Server (results)
[03:23:43] (130.237.165.141:8080)
[03:23:43] + Retrying using alternative port
[03:23:44] - Couldn't send HTTP request to server
[03:23:44] + Could not connect to Work Server (results)
[03:23:44] (130.237.165.141:80)
[03:23:44] Could not transmit unit 05 to Collection server; keeping in queue.
[03:23:44] - Failed to send all units to server
Folding@Home Client Shutdown.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 4:42 am
by bruce
I moved the topic from "Problems with a specific WU" to "issues with a specific server" Clearly you're not reporting an EUE problem with a single WU but a more general problem with multiple WUs associated with a single server.
Thank you for the report.
The topic "NOTICE: Before you post here, please read this" (above) is applicable. I was not able to get the "OK" message from this server so I will forward your request to those who own this server.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 5:36 am
by MIBW
Sorry I didn't know which to do - I posted as a work unit problem as it seems to be having trouble with multiple servers/ports:
130.237.232.237:8080
130.237.232.237:80
130.237.165.141:8080
130.237.165.141:80
SR2#1:
Code: Select all
[20:33:14] Completed 250000 out of 250000 steps (100%)
[20:33:25] DynamicWrapper: Finished Work Unit: sleep=10000
[20:33:35]
[20:33:35] Finished Work Unit:
[20:33:35] - Reading up to 52713120 from "work/wudata_05.trr": Read 52713120
[20:33:35] trr file hash check passed.
[20:33:35] - Reading up to 47065544 from "work/wudata_05.xtc": Read 47065544
[20:33:36] xtc file hash check passed.
[20:33:36] edr file hash check passed.
[20:33:36] logfile size: 218615
[20:33:36] Leaving Run
[20:33:37] - Writing 100165219 bytes of core data to disk...
[20:34:25] Done: 100164707 -> 96434316 (compressed to 10.5 percent)
[20:34:26] ... Done.
[20:34:41] - Shutting down core
[20:34:41]
[20:34:41] Folding@home Core Shutdown: FINISHED_UNIT
[20:34:45] CoreStatus = 64 (100)
[20:34:45] Unit 5 finished with 86 percent of time to deadline remaining.
[20:34:45] Updated performance fraction: 0.861236
[20:34:45] Sending work to server
[20:34:45] Project: 6901 (Run 16, Clone 1, Gen 0)
[20:34:45] + Attempting to send results [February 24 20:34:45 UTC]
[20:34:45] - Reading file work/wuresults_05.dat from core
[20:34:45] (Read 96434828 bytes from disk)
[20:34:45] Connecting to http://130.237.232.237:8080/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.232.237:8080)
[20:34:45] + Retrying using alternative port
[20:34:45] Connecting to http://130.237.232.237:80/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.232.237:80)
[20:34:45] - Error: Could not transmit unit 05 (completed February 24) to work server.
[20:34:45] - 1 failed uploads of this unit.
[20:34:45] Keeping unit 05 in queue.
[20:34:45] Trying to send all finished work units
[20:34:45] Project: 6901 (Run 16, Clone 1, Gen 0)
[20:34:45] + Attempting to send results [February 24 20:34:45 UTC]
[20:34:45] - Reading file work/wuresults_05.dat from core
[20:34:45] (Read 96434828 bytes from disk)
[20:34:45] Connecting to http://130.237.232.237:8080/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.232.237:8080)
[20:34:45] + Retrying using alternative port
[20:34:45] Connecting to http://130.237.232.237:80/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.232.237:80)
[20:34:45] - Error: Could not transmit unit 05 (completed February 24) to work server.
[20:34:45] - 2 failed uploads of this unit.
[20:34:45] + Attempting to send results [February 24 20:34:45 UTC]
[20:34:45] - Reading file work/wuresults_05.dat from core
[20:34:45] (Read 96434828 bytes from disk)
[20:34:45] Connecting to http://130.237.165.141:8080/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.165.141:8080)
[20:34:45] + Retrying using alternative port
[20:34:45] Connecting to http://130.237.165.141:80/
[20:34:45] - Couldn't send HTTP request to server
[20:34:45] + Could not connect to Work Server (results)
[20:34:45] (130.237.165.141:80)
[20:34:45] Could not transmit unit 05 to Collection server; keeping in queue.
[20:34:45] + Sent 0 of 1 completed units to the server
[20:34:45] - Preparing to get new work unit...
[20:34:45] Cleaning up work directory
[20:34:45] + Attempting to get work packet
[20:34:45] Passkey found
[20:34:45] - Will indicate memory of 24567 MB
[20:34:45] - Connecting to assignment server
[20:34:45] Connecting to http://assign.stanford.edu:8080/
[20:34:46] Posted data.
[20:34:46] Initial: ED82; - Successful: assigned to (130.237.232.237).
[20:34:46] + News From Folding@Home: Welcome to Folding@Home
[20:34:46] Loaded queue successfully.
[20:34:46] Sent data
[20:34:46] Connecting to http://130.237.232.237:8080/
[20:34:52] Posted data.
[20:34:52] Initial: 0000; - Receiving payload (expected size: 20064950)
[20:36:18] - Downloaded at ~227 kB/s
[20:36:18] - Averaged speed for that direction ~169 kB/s
[20:36:18] + Received work.
[20:36:18] Trying to send all finished work units
[20:36:18] Project: 6901 (Run 16, Clone 1, Gen 0)
[20:36:18] + Attempting to send results [February 24 20:36:18 UTC]
[20:36:18] - Reading file work/wuresults_05.dat from core
[20:36:18] (Read 96434828 bytes from disk)
[20:36:18] Connecting to http://130.237.232.237:8080/
[20:36:18] - Couldn't send HTTP request to server
[20:36:18] + Could not connect to Work Server (results)
[20:36:18] (130.237.232.237:8080)
[20:36:18] + Retrying using alternative port
[20:36:18] Connecting to http://130.237.232.237:80/
[20:36:18] - Couldn't send HTTP request to server
[20:36:18] + Could not connect to Work Server (results)
[20:36:18] (130.237.232.237:80)
[20:36:18] - Error: Could not transmit unit 05 (completed February 24) to work server.
[20:36:18] - 3 failed uploads of this unit.
[20:36:18] + Attempting to send results [February 24 20:36:18 UTC]
[20:36:18] - Reading file work/wuresults_05.dat from core
[20:36:18] (Read 96434828 bytes from disk)
[20:36:18] Connecting to http://130.237.165.141:8080/
[20:36:18] - Couldn't send HTTP request to server
[20:36:18] + Could not connect to Work Server (results)
[20:36:18] (130.237.165.141:8080)
[20:36:18] + Retrying using alternative port
[20:36:18] Connecting to http://130.237.165.141:80/
[20:36:19] - Couldn't send HTTP request to server
[20:36:19] + Could not connect to Work Server (results)
[20:36:19] (130.237.165.141:80)
[20:36:19] Could not transmit unit 05 to Collection server; keeping in queue.
[20:36:19] + Sent 0 of 1 completed units to the server
[20:36:19] + Closed connections
[20:36:19]
SR2#2 - I wont paste all 131 upload attempts.
Code: Select all
[19:15:14] Completed 250000 out of 250000 steps (100%)
[19:15:23] DynamicWrapper: Finished Work Unit: sleep=10000
[19:15:33]
[19:15:33] Finished Work Unit:
[19:15:33] - Reading up to 52713120 from "work/wudata_05.trr": Read 52713120
[19:15:33] trr file hash check passed.
[19:15:33] - Reading up to 47103408 from "work/wudata_05.xtc": Read 47103408
[19:15:33] xtc file hash check passed.
[19:15:33] edr file hash check passed.
[19:15:33] logfile size: 218724
[19:15:33] Leaving Run
[19:15:35] - Writing 100203192 bytes of core data to disk...
[19:16:21] Done: 100202680 -> 96469248 (compressed to 10.5 percent)
[19:16:22] ... Done.
[19:16:34] - Shutting down core
[19:16:34]
[19:16:34] Folding@home Core Shutdown: FINISHED_UNIT
[19:16:38] CoreStatus = 64 (100)
[19:16:38] Unit 5 finished with 87 percent of time to deadline remaining.
[19:16:38] Updated performance fraction: 0.864802
[19:16:38] Sending work to server
[19:16:38] Project: 6901 (Run 15, Clone 1, Gen 0)
[19:16:38] + Attempting to send results [February 24 19:16:38 UTC]
[19:16:38] - Reading file work/wuresults_05.dat from core
[19:16:38] (Read 96469760 bytes from disk)
[19:16:38] Connecting to http://130.237.232.237:8080/
[19:16:38] - Couldn't send HTTP request to server
[19:16:38] + Could not connect to Work Server (results)
[19:16:38] (130.237.232.237:8080)
[19:16:38] + Retrying using alternative port
[19:16:38] Connecting to http://130.237.232.237:80/
[19:16:39] - Couldn't send HTTP request to server
[19:16:39] + Could not connect to Work Server (results)
[19:16:39] (130.237.232.237:80)
[19:16:39] - Error: Could not transmit unit 05 (completed February 24) to work server.
[19:16:39] - 1 failed uploads of this unit.
[19:16:39] Keeping unit 05 in queue.
[19:16:39] Trying to send all finished work units
[19:16:39] Project: 6901 (Run 15, Clone 1, Gen 0)
[19:16:39] + Attempting to send results [February 24 19:16:39 UTC]
[19:16:39] - Reading file work/wuresults_05.dat from core
[19:16:39] (Read 96469760 bytes from disk)
[19:16:39] Connecting to http://130.237.232.237:8080/
[19:16:39] - Couldn't send HTTP request to server
[19:16:39] + Could not connect to Work Server (results)
[19:16:39] (130.237.232.237:8080)
[19:16:39] + Retrying using alternative port
[19:16:39] Connecting to http://130.237.232.237:80/
[19:16:39] - Couldn't send HTTP request to server
[19:16:39] + Could not connect to Work Server (results)
[19:16:39] (130.237.232.237:80)
[19:16:39] - Error: Could not transmit unit 05 (completed February 24) to work server.
[19:16:39] - 2 failed uploads of this unit.
[19:16:39] + Attempting to send results [February 24 19:16:39 UTC]
[19:16:39] - Reading file work/wuresults_05.dat from core
[19:16:39] (Read 96469760 bytes from disk)
[19:16:39] Connecting to http://130.237.165.141:8080/
[19:16:39] - Couldn't send HTTP request to server
[19:16:39] + Could not connect to Work Server (results)
[19:16:39] (130.237.165.141:8080)
[19:16:39] + Retrying using alternative port
[19:16:39] Connecting to http://130.237.165.141:80/
[19:16:39] - Couldn't send HTTP request to server
[19:16:39] + Could not connect to Work Server (results)
[19:16:39] (130.237.165.141:80)
[19:16:39] Could not transmit unit 05 to Collection server; keeping in queue.
[19:16:39] + Sent 0 of 1 completed units to the server
[19:17:09] Trying to send all finished work units
[19:17:09] Project: 6901 (Run 15, Clone 1, Gen 0)
[19:17:09] + Attempting to send results [February 24 19:17:09 UTC]
[19:17:09] - Reading file work/wuresults_05.dat from core
[19:17:09] (Read 96469760 bytes from disk)
[19:17:09] Connecting to http://130.237.232.237:8080/
[19:17:09] - Couldn't send HTTP request to server
[19:17:09] + Could not connect to Work Server (results)
[19:17:09] (130.237.232.237:8080)
[19:17:09] + Retrying using alternative port
[19:17:09] Connecting to http://130.237.232.237:80/
[19:17:09] - Couldn't send HTTP request to server
[19:17:09] + Could not connect to Work Server (results)
[19:17:09] (130.237.232.237:80)
[19:17:09] - Error: Could not transmit unit 05 (completed February 24) to work server.
[19:17:09] - 3 failed uploads of this unit.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 10:29 am
by toTOW
Count me in ... deadlines are slowly passing, and I removed my machines from BigAdv to not waste power ...
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 11:46 am
by bollix47
toTOW wrote:Count me in ... deadlines are slowly passing, and I removed my machines from BigAdv to not waste power ...
OOC are you also using the Linux client or was this on a Windows system?
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 12:32 pm
by kasson
Thanks--we're looking at it. The server is up and receiving uploads from a number of users, but uploads from some users seem to be failing repeatedly. We're talking with the work server code developers to see if we can get a better sense of what is going on (and whether it's on the server, on the side of the client--can you successfully upload 6900's? That machine is near-identical except for the server software version and is sitting next to the 6901 server in the machine room, etc.).
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 1:01 pm
by MIBW
I have shut down today, but all three machines were humming along fine eating plenty of 6900s recently. Updated all three to A5 and all three grabbed 6901s and fail. All are Windows 7 64bit.
Musky reports his unit did upload fine eventually, so I guess I must be special!
Although punching
http://130.237.232.237:8080/ into a web browser has not once given me an OK - tried every 2-3 hours over the last 14 hrs. 171.67.108.22:8080 works fine.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 2:03 pm
by kasson
It shouldn't give you an ok. Bruce is quoting old data. The new work server doesn't work that way.
Good to know that the 6900's were working well; it may be an issue with the new work server code + bigadv. I've contacted the developer, and we're looking into it.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 2:04 pm
by kasson
PS what is the size of the upload file (wu_results.dat or something similar in the work directory)?
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 8:59 pm
by phoenicis.
I seem to be suffering from the same issue. The log mirrors one of MIBW's above, the wu begins to upload for 15 minutes (almost to the second) but then fails. Could the server be timing out? Pure conjecture but may explain why some are having no problems if they have a fast connection. The size of the results file in the work folder is 94,256 kb. I have previously had no problems uploading 6900s.
Code: Select all
--- Opening Log file [February 25 19:53:09 UTC]
# Windows SMP Console Edition #################################################
###############################################################################
Folding@Home Client Version 6.34
http://folding.stanford.edu
###############################################################################
###############################################################################
Launch directory: C:\FAH SMP
Executable: C:\FAH SMP\Folding@home-Win32-x86.exe
Arguments: -smp -bigadv -verbosity 9
[19:53:09] - Ask before connecting: No
[19:53:09] - User name: phoenicis (Team 35947)
[19:53:09] - User ID: 1465852054606F6E
[19:53:09] - Machine ID: 1
[19:53:09]
[19:53:09] Loaded queue successfully.
[19:53:09]
[19:53:09] - Autosending finished units... [February 25 19:53:09 UTC]
[19:53:09] + Processing work unit
[19:53:09] Trying to send all finished work units
[19:53:09] Core required: FahCore_a5.exe
[19:53:09] Project: 6901 (Run 9, Clone 4, Gen 0)
[19:53:09] Core found.
[19:53:09] + Attempting to send results [February 25 19:53:09 UTC]
[19:53:09] - Reading file work/wuresults_05.dat from core
[19:53:09] Working on queue slot 06 [February 25 19:53:09 UTC]
[19:53:09] (Read 96517922 bytes from disk)
[19:53:09] + Working ...
[19:53:09] Connecting to http://130.237.232.237:8080/
[19:53:09] - Calling '.\FahCore_a5.exe -dir work/ -nice 19 -suffix 06 -np 24 -checkpoint 30 -verbose -lifeline 1776 -version 634'
[19:53:09]
[19:53:09] *------------------------------*
[19:53:09] Folding@Home Gromacs SMP Core
[19:53:09] Version 2.27 (Mar 12, 2010)
[19:53:09]
[19:53:09] Preparing to commence simulation
[19:53:09] - Ensuring status. Please wait.
[19:53:18] - Looking at optimizations...
[19:53:18] - Working with standard loops on this execution.
[19:53:18] - Previous termination of core was improper.
[19:53:18] - Going to use standard loops.
[19:53:18] - Files status OK
[19:53:23] - Expanded 24874349 -> 30796292 (decompressed 123.8 percent)
[19:53:23] Called DecompressByteArray: compressed_data_size=24874349 data_size=30796292, decompressed_data_size=30796292 diff=0
[19:53:23] - Digital signature verified
[19:53:23]
[19:53:23] Project: 6901 (Run 5, Clone 0, Gen 2)
[19:53:23]
[19:53:23] Entering M.D.
[19:53:29] Using Gromacs checkpoints
[19:53:30] Mapping NT from 24 to 24
[19:53:36] Resuming from checkpoint
[19:53:37] Verified work/wudata_06.log
[19:53:37] Verified work/wudata_06.trr
[19:53:37] Verified work/wudata_06.xtc
[19:53:37] Verified work/wudata_06.edr
[19:53:38] Completed 74410 out of 250000 steps (29%)
[19:56:18] Completed 75000 out of 250000 steps (30%)
[20:07:37] Completed 77500 out of 250000 steps (31%)
[20:08:10] - Couldn't send HTTP request to server
[20:08:10] + Could not connect to Work Server (results)
[20:08:10] (130.237.232.237:8080)
[20:08:10] + Retrying using alternative port
[20:08:10] Connecting to http://130.237.232.237:80/
[20:18:56] Completed 80000 out of 250000 steps (32%)
[20:23:12] - Couldn't send HTTP request to server
[20:23:12] + Could not connect to Work Server (results)
[20:23:12] (130.237.232.237:80)
[20:23:12] - Error: Could not transmit unit 05 (completed February 25) to work server.
[20:23:12] - 7 failed uploads of this unit.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 9:34 pm
by kasson
Perfect--thanks. That helps a lot. There was a setting in the new WS code that defaulted to limiting connections to 15 min. It's now at 2 hrs; we'll see how that works and extend as necessary.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 10:29 pm
by k1wi
is it the same server that was running 6900? I remember a while ago some users were having problems with uploading all their 6900 work units and don't know if it was resolved?
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 10:34 pm
by PantherX
When I used to upload the wuresult of ~100MB, it would take 2 hours 32 minutes on my connection.
Re: 130.237.232.237
Posted: Fri Feb 25, 2011 10:54 pm
by MIBW
Ok, that sounds exactly like the smoking gun - my uploads always take 18 to 25 mins, and they would time out.
ADSL in Australia is not very fast at uploads.
Will try and upload them again...
EDIT: uploaded all three fine - getting credit of 80,000 each instead of 126,000, but better than nothing.
So problem solved! Thanks guys.