Page 3 of 4
171.65.103.100 - Collection Server Down
Posted: Mon Jun 15, 2009 10:25 pm
by COOLDUDEGAMER
It seems that with my NVIDIA GPU, the client tries to send work, but it cannot send it. It gets new work, but leaves the not sending unit in queue.
The server # is 171.65.103.100 and according to
http://fah-web.stanford.edu/serverstat.html, the server is down. Is anyone having the same problem?
Also, is there another collection server that can be used to send finished units to?
Thanks in advanced,
Signed,
COOLDUDEGAMER
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Mon Jun 15, 2009 11:01 pm
by bruce
Yes, Collection Server 171.65.103.100 is DOWN and I don't know anything about why it is off-line or when it might be restored to service. In general, it's not all that important to know whether a Collection Server is off-line. They do help, but they're not as important as the other servers. In other words, all Collection Servers are a backup for the primary Work Server and should only be needed if there's a problem with the primary Work Server..
Every WU has a primary Work Server and a backup Collection Server. There's no point in worrying about the backup unless you also look at the primary Work Server, and that information was not included in the portion of FAHlog that you posted. Which WS is involved and is there a problem with it?
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Mon Jun 15, 2009 11:33 pm
by Oldhat
Hi Bruce,
I have units trying to send to both 171.65.103.100 and 171.64.122.70. So it appears that for at least five units both the collection servers are DOWN.
Status page has ..........
Mon Jun 8 16:00:10 PDT 2009 171.65.103.100 - VSPMF33 - FAIL Reject
Sun Jun 14 22:45:10 PDT 2009 171.64.122.70 GPU VSP03 - full DOWN
Perhaps Stanford could kick-start 171.64.122.70, as the other server has been off-line for so long it appears only a miracle could save it.
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Mon Jun 15, 2009 11:42 pm
by bruce
OK. 171.64.122.70 is the primary Work Server. It has gone off-line a number of times but comes back for a while so your last sentence isn't really accurate. I'll see if I can find somebody to deal with at least one of them.
Re: 171.65.103.100 - Collection Server Down
Posted: Mon Jun 15, 2009 11:46 pm
by 7im
Already have a thread running on this topic... here:
http://foldingforum.org/viewtopic.php?f=18&t=10385
Edit by Mod: Threads merged.
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 12:05 am
by VijayPande
This machine has gone down and the sysadmins have been working on it. I'm very sorry it's taken so long, especially since these WUs have short deadlines. I've sent another email to the sysadmins asking for a status update and I will relay the info once we know it.
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 2:53 am
by ElectricVehicle
Yup, I can confirm the servers are down at the moment, and I also have a unit pending upload, but you already knew that!
Pass on my best wishes to the sysadmins for a speedy server recovery!
http://fah-web.stanford.edu/serverstat.html
171.64.122.70 GPU VSP03 - full DOWN
171.65.103.100 - VSPMF33 - CS 1 DOWN -
Code: Select all
Slot 06 Done
Project: 5905 (Run 9, Clone 512, Gen 34), Core: 14
Work server: 171.64.122.70:8080
Collection server: 171.65.103.100
Download date: June 14 18:35:11
Finished date: June 15 07:45:21
Failed uploads: 13
Code: Select all
[01:40:47] - Couldn't send HTTP request to server
[01:40:47] + Could not connect to Work Server (results)
[01:40:47] (171.64.122.70:8080)
[01:40:47] + Retrying using alternative port
[01:40:47] Connecting to http://171.64.122.70:80/
[01:41:08] - Couldn't send HTTP request to server
[01:41:08] + Could not connect to Work Server (results)
[01:41:08] (171.64.122.70:80)
[01:41:08] - Error: Could not transmit unit 06 (completed June 15) to work server.
[01:41:08] - 13 failed uploads of this unit.
[01:41:08] - Read packet limit of 540015616... Set to 524286976.
[01:41:08] + Attempting to send results [June 16 01:41:08 UTC]
[01:41:08] - Reading file work/wuresults_06.dat from core
[01:41:08] (Read 67156 bytes from disk)
[01:41:08] Connecting to http://171.65.103.100:8080/
[01:41:29] - Couldn't send HTTP request to server
[01:41:29] + Could not connect to Work Server (results)
[01:41:29] (171.65.103.100:8080)
[01:41:29] + Retrying using alternative port
[01:41:29] Connecting to http://171.65.103.100:80/
[01:41:50] - Couldn't send HTTP request to server
[01:41:50] + Could not connect to Work Server (results)
[01:41:50] (171.65.103.100:80)
[01:41:50] Could not transmit unit 06 to Collection server; keeping in queue.
[01:41:50] + Sent 0 of 1 completed units to the server
[01:41:50] - Autosend completed
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 5:28 am
by boscoj
yup yup . . .
Code: Select all
[05:15:02] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[05:15:02] Build host: amoeba
[05:15:02] Board Type: Nvidia
[05:15:02] Core :
[05:15:02] Preparing to commence simulation
[05:15:02] - Looking at optimizations...
[05:15:02] - Files status OK
[05:15:02] - Expanded 98750 -> 492276 (decompressed 498.5 percent)
[05:15:02] Called DecompressByteArray: compressed_data_size=98750 data_size=492276, decompressed_data_size=492276 diff=0
[05:15:02] - Digital signature verified
[05:15:02]
[05:15:02] Project: 5756 (Run 2, Clone 21, Gen 324)
[05:15:02]
[05:15:02] Assembly optimizations on if available.
[05:15:02] Entering M.D.
[05:15:08] Will resume from checkpoint file
[05:15:09] Working on Protein
[05:15:11] Client config found, loading data.
[05:15:11] Starting GUI Server
[05:15:12] Resuming from checkpoint
[05:15:12] Verified work/wudata_04.log
[05:15:12] Verified work/wudata_04.edr
[05:15:12] Verified work/wudata_04.xtc
[05:15:12] Completed 76%
[05:15:23] - Couldn't send HTTP request to server
[05:15:23] + Could not connect to Work Server (results)
[05:15:23] (171.64.122.70:8080)
[05:15:23] + Retrying using alternative port
[05:15:44] - Couldn't send HTTP request to server
[05:15:44] + Could not connect to Work Server (results)
[05:15:44] (171.64.122.70:80)
[05:15:44] - Error: Could not transmit unit 08 (completed June 15) to work server.
[05:15:44] + Attempting to send results [June 16 05:15:44 UTC]
[05:16:05] - Couldn't send HTTP request to server
[05:16:05] + Could not connect to Work Server (results)
[05:16:05] (171.65.103.100:8080)
[05:16:05] + Retrying using alternative port
[05:16:26] - Couldn't send HTTP request to server
[05:16:26] + Could not connect to Work Server (results)
[05:16:26] (171.65.103.100:80)
[05:16:26] Could not transmit unit 08 to Collection server; keeping in queue.
[05:17:21] Completed 77%
[05:19:30] Completed 78%
Re: 171.64.122.70
Posted: Tue Jun 16, 2009 9:06 am
by Teddy
Any update on this?
Teddy
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 10:54 am
by gas1can
How many completed WU's can the client hold and continue to crunch on new units?
I run 2 boxes each with dual gpu's and feel as if i am running out of time and my clients go idle.
I seem to be able to get new WU's but am unable to send any of my completed ones.
If our WU deadline is reached, will we still get credit since this is a server issue?
I also noticed a period of greater than a hour in which my clients could not get a new WU, is there a way to increase the WU cache?
fyi I run only the windows console client for the ease of multi-gpu configuration and no-cpu load.
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 11:25 am
by Oldhat
gas1can ยป Tue Jun 16, 2009 10:54 am
How many completed WU's can the client hold and continue to crunch on new units?
I run 2 boxes each with dual gpu's and feel as if i am running out of time and my clients go idle.
There are 10 slots for work units. If you are unlucky enough not to get the work unit sent back the data will get overwritten by the 10th unit and things will continue on.
I also noticed a period of greater than a hour in which my clients could not get a new WU, is there a way to increase the WU cache?
Sometimes the reason that you fail to get a unit is purely due to the number of computers trying to gain another unit.
One hour downtimes aren't usually due to a lack of available units, unless you're really lucky.
When the servers went down for maintenance, you would have noticed longer periods of downtime due to the large numbers trying to get units.
All going well, the problem with 171.65.103.100 and 171.64.122.70 will be resolved prior to the units getting overwritten or passing their "use by date".
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 12:26 pm
by Shadowtester
Each client has a que with 10 slots so you could hold 9 completed wu's while folding 1 wu.
Code: Select all
[10:18:41] + Attempting to send results [June 16 10:18:41 UTC]
[10:18:41] - Reading file work/wuresults_00.dat from core
[10:18:41] (Read 70355 bytes from disk)
[10:18:41] Connecting to http://171.64.122.70:8080/
[10:21:50] - Couldn't send HTTP request to server
[10:21:50] + Could not connect to Work Server (results)
[10:21:50] (171.64.122.70:8080)
[10:21:50] + Retrying using alternative port
[10:21:50] Connecting to http://171.64.122.70:80/
[10:24:59] - Couldn't send HTTP request to server
[10:24:59] + Could not connect to Work Server (results)
[10:24:59] (171.64.122.70:80)
[10:24:59] - Error: Could not transmit unit 00 (completed June 15) to work server.
[10:24:59] - 27 failed uploads of this unit.
[10:24:59] - Read packet limit of 540015616... Set to 524286976.
[10:24:59] + Attempting to send results [June 16 10:24:59 UTC]
[10:24:59] - Reading file work/wuresults_00.dat from core
[10:24:59] (Read 70355 bytes from disk)
[10:24:59] Connecting to http://171.65.103.100:8080/
[10:28:08] - Couldn't send HTTP request to server
[10:28:08] + Could not connect to Work Server (results)
[10:28:08] (171.65.103.100:8080)
[10:28:08] + Retrying using alternative port
[10:28:08] Connecting to http://171.65.103.100:80/
[10:31:17] - Couldn't send HTTP request to server
[10:31:17] + Could not connect to Work Server (results)
[10:31:17] (171.65.103.100:80)
[10:31:17] Could not transmit unit 00 to Collection server; keeping in queue.
[10:31:17] + Sent 0 of 2 completed units to the server
[10:31:17] + Closed connections
Server is still down currently at 27 failed attempts to upload my two completed wu's only good thing is that I have not received any more new wu's which were assigned to this server for data collection. I know its got to be getting close to if it has not already exceed the preferred deadline what will happen if these wu's exceed the deadlines due to the server down?
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 12:29 pm
by shdbcamping
For me it seems to be all 5905 WU's and there are several not sending over 3 different computers and several collection servers. Any ideas would be helpful as I don't know what to do with the Work folder after the WU's expire.
Code: Select all
10:45:36] Completed 97%
[10:46:49] Completed 98%
[10:48:02] Completed 99%
[10:49:15] Completed 100%
[10:49:15] Successful run
[10:49:15] DynamicWrapper: Finished Work Unit: sleep=10000
[10:49:25] Reserved 79132 bytes for xtc file; Cosm status=0
[10:49:25] Allocated 79132 bytes for xtc file
[10:49:25] - Reading up to 79132 from "work/wudata_05.xtc": Read 79132
[10:49:25] Read 79132 bytes from xtc file; available packet space=786351332
[10:49:25] xtc file hash check passed.
[10:49:25] Reserved 23472 23472 786351332 bytes for arc file=<work/wudata_05.trr> Cosm status=0
[10:49:25] Allocated 23472 bytes for arc file
[10:49:25] - Reading up to 23472 from "work/wudata_05.trr": Read 23472
[10:49:25] Read 23472 bytes from arc file; available packet space=786327860
[10:49:25] trr file hash check passed.
[10:49:25] Allocated 560 bytes for edr file
[10:49:25] Read bedfile
[10:49:25] edr file hash check passed.
[10:49:25] Allocated 31105 bytes for logfile
[10:49:25] Read logfile
[10:49:25] GuardedRun: success in DynamicWrapper
[10:49:25] GuardedRun: done
[10:49:25] Run: GuardedRun completed.
[10:49:30] - Writing 134781 bytes of core data to disk...
[10:49:30] Done: 134269 -> 111321 (compressed to 82.9 percent)
[10:49:30] ... Done.
[10:49:30] - Shutting down core
[10:49:30]
[10:49:30] Folding@home Core Shutdown: FINISHED_UNIT
[10:49:34] CoreStatus = 64 (100)
[10:49:34] Sending work to server
[10:49:34] Project: 5762 (Run 6, Clone 277, Gen 5)
[10:49:34] - Read packet limit of 540015616... Set to 524286976.
[10:49:34] + Attempting to send results [June 16 10:49:34 UTC]
[10:49:36] + Results successfully sent
[10:49:36] Thank you for your contribution to Folding@Home.
[10:49:36] + Number of Units Completed: 1789
[10:49:41] Project: 5905 (Run 9, Clone 716, Gen 13)
[10:49:41] - Read packet limit of 540015616... Set to 524286976.
[10:49:41] + Attempting to send results [June 16 10:49:41 UTC]
[10:50:01] - Couldn't send HTTP request to server
[10:50:01] + Could not connect to Work Server (results)
[10:50:01] (171.64.122.70:8080)
[10:50:01] + Retrying using alternative port
[10:50:23] - Couldn't send HTTP request to server
[10:50:23] + Could not connect to Work Server (results)
[10:50:23] (171.64.122.70:80)
[10:50:23] - Error: Could not transmit unit 06 (completed June 15) to work server.
[10:50:23] - Read packet limit of 540015616... Set to 524286976.
[10:50:23] + Attempting to send results [June 16 10:50:23 UTC]
[10:50:44] - Couldn't send HTTP request to server
[10:50:44] + Could not connect to Work Server (results)
[10:50:44] (171.65.103.100:8080)
[10:50:44] + Retrying using alternative port
[10:51:05] - Couldn't send HTTP request to server
[10:51:05] + Could not connect to Work Server (results)
[10:51:05] (171.65.103.100:80)
[10:51:05] Could not transmit unit 06 to Collection server; keeping in queue.
[10:51:05] - Preparing to get new work unit...
[10:51:05] + Attempting to get work packet
[10:51:05] - Connecting to assignment server
[10:51:05] - Successful: assigned to (171.67.108.11).
[10:51:05] + News From Folding@Home: Welcome to Folding@Home
[10:51:05] Loaded queue successfully.
[10:51:08] Project: 5905 (Run 9, Clone 716, Gen 13)
[10:51:08] - Read packet limit of 540015616... Set to 524286976.
[10:51:08] + Attempting to send results [June 16 10:51:08 UTC]
[10:51:29] - Couldn't send HTTP request to server
[10:51:29] + Could not connect to Work Server (results)
[10:51:29] (171.64.122.70:8080)
[10:51:29] + Retrying using alternative port
[10:51:50] - Couldn't send HTTP request to server
[10:51:50] + Could not connect to Work Server (results)
[10:51:50] (171.64.122.70:80)
[10:51:50] - Error: Could not transmit unit 06 (completed June 15) to work server.
[10:51:50] - Read packet limit of 540015616... Set to 524286976.
[10:51:50] + Attempting to send results [June 16 10:51:50 UTC]
[10:52:11] - Couldn't send HTTP request to server
[10:52:11] + Could not connect to Work Server (results)
[10:52:11] (171.65.103.100:8080)
[10:52:11] + Retrying using alternative port
[10:52:32] - Couldn't send HTTP request to server
[10:52:32] + Could not connect to Work Server (results)
[10:52:32] (171.65.103.100:80)
[10:52:32] Could not transmit unit 06 to Collection server; keeping in queue.
[10:52:32] + Closed connections
[10:52:32]
[10:52:32] + Processing work unit
[10:52:32] Core required: FahCore_11.exe
[10:52:32] Core found.
[10:52:32] Working on queue slot 07 [June 16 10:52:32 UTC]
[10:52:32] + Working ...
[10:52:32]
[10:52:32] *------------------------------*
[10:52:32] Folding@Home GPU Core - Beta
[10:52:32] Version 1.19 (Mon Nov 3 09:34:13 PST 2008)
[10:52:32]
[10:52:32] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[10:52:32] Build host: amoeba
[10:52:32] Board Type: Nvidia
[10:52:32] Core :
[10:52:32] Preparing to commence simulation
[10:52:32] - Looking at optimizations...
[10:52:32] - Created dyn
[10:52:32] - Files status OK
[10:52:32] - Expanded 45349 -> 251112 (decompressed 553.7 percent)
[10:52:32] Called DecompressByteArray: compressed_data_size=45349 data_size=251112, decompressed_data_size=251112 diff=0
[10:52:32] - Digital signature verified
[10:52:32]
[10:52:32] Project: 5769 (Run 13, Clone 173, Gen 272)
[10:52:32]
[10:52:32] Assembly optimizations on if available.
[10:52:32] Entering M.D.
[10:52:39] Working on Protein
[10:52:39] Client config found, loading data.
[10:52:39] Starting GUI Server
[10:53:37] Completed 1%
[10:54:36] Completed 2%
[10:55:34] Completed 3%
[10:56:32] Completed 4%
[10:57:30] Completed 5%
[10:58:29] Completed 6%
[10:59:27] Completed 7%
[11:00:25] Completed 8%
[11:01:24] Completed 9%
[11:02:22] Completed 10%
[11:03:20] Completed 11%
[11:04:18] Completed 12%
[11:05:17] Completed 13%
[11:06:15] Completed 14%
[11:07:13] Completed 15%
[11:08:12] Completed 16%
[11:09:10] Completed 17%
[11:10:08] Completed 18%
[11:11:06] Completed 19%
[11:12:05] Completed 20%
[11:13:03] Completed 21%
[11:14:01] Completed 22%
[11:15:00] Completed 23%
[11:15:58] Completed 24%
[11:16:56] Completed 25%
[11:17:54] Completed 26%
[11:18:53] Completed 27%
[11:19:51] Completed 28%
[11:20:49] Completed 29%
[11:21:48] Completed 30%
[11:22:46] Completed 31%
[11:23:44] Completed 32%
[11:24:43] Completed 33%
[11:25:41] Completed 34%
[11:26:39] Completed 35%
[11:27:37] Completed 36%
[11:28:36] Completed 37%
[11:29:34] Completed 38%
[11:30:32] Completed 39%
[11:31:30] Completed 40%
[11:32:29] Completed 41%
[11:33:27] Completed 42%
[11:34:25] Completed 43%
[11:35:24] Completed 44%
[11:36:22] Completed 45%
[11:37:20] Completed 46%
[11:38:18] Completed 47%
[11:39:17] Completed 48%
[11:40:15] Completed 49%
[11:41:13] Completed 50%
[11:42:12] Completed 51%
[11:43:10] Completed 52%
[11:44:08] Completed 53%
[11:45:07] Completed 54%
[11:46:05] Completed 55%
[11:47:03] Completed 56%
[11:48:01] Completed 57%
[11:49:00] Completed 58%
[11:49:58] Completed 59%
[11:50:56] Completed 60%
[11:51:55] Completed 61%
[11:52:53] Completed 62%
[11:53:51] Completed 63%
[11:54:49] Completed 64%
[11:55:48] Completed 65%
[11:56:46] Completed 66%
[11:57:44] Completed 67%
[11:58:43] Completed 68%
[11:59:41] Completed 69%
[12:00:39] Completed 70%
[12:00:42] Project: 5905 (Run 9, Clone 716, Gen 13)
[12:00:42] - Read packet limit of 540015616... Set to 524286976.
[12:00:42] + Attempting to send results [June 16 12:00:42 UTC]
[12:01:03] - Couldn't send HTTP request to server
[12:01:03] + Could not connect to Work Server (results)
[12:01:03] (171.64.122.70:8080)
[12:01:03] + Retrying using alternative port
[12:01:24] - Couldn't send HTTP request to server
[12:01:24] + Could not connect to Work Server (results)
[12:01:24] (171.64.122.70:80)
[12:01:24] - Error: Could not transmit unit 06 (completed June 15) to work server.
[12:01:24] - Read packet limit of 540015616... Set to 524286976.
[12:01:24] + Attempting to send results [June 16 12:01:24 UTC]
[12:01:38] Completed 71%
[12:01:45] - Couldn't send HTTP request to server
[12:01:45] + Could not connect to Work Server (results)
[12:01:45] (171.65.103.100:8080)
[12:01:45] + Retrying using alternative port
[12:02:06] - Couldn't send HTTP request to server
[12:02:06] + Could not connect to Work Server (results)
[12:02:06] (171.65.103.100:80)
[12:02:06] Could not transmit unit 06 to Collection server; keeping in queue.
[12:02:36] Completed 72%
[12:03:34] Completed 73%
[12:04:33] Completed 74%
[12:05:31] Completed 75%
[12:06:29] Completed 76%
[12:07:28] Completed 77%
[12:08:26] Completed 78%
[12:09:24] Completed 79%
[12:10:22] Completed 80%
[12:11:21] Completed 81%
[12:12:19] Completed 82%
[12:13:17] Completed 83%
[12:14:16] Completed 84%
[12:15:14] Completed 85%
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 3:52 pm
by capreppy
I just checked mine and I have till June 22nd before they expire (pass the final deadline). It does seem limted to 5905's picked up yesterday. I don't seem to have this issue with other WUs. My 10 GPUs are still picking up GPU WUs, completing them and sending them back. Just not to this server. I checked all of my clients and it looks as if I only have 4 (total) WUs that haven't been sent in. They were all picked up yesterday morning (very early) and upwards of 100 attempts have been made to send them. As I indicated, I've got till June 22nd of all of them so I have time.
Re: 171.65.103.100 ports 80 and 8080 down
Posted: Tue Jun 16, 2009 4:07 pm
by Nathan_P
Hi
Noob here so i can't post a link but on Vijays blog it says that they have had major problems over the last day or so with the gpu servers. 3 went down and only 2 are back up and under heavy load. They are also rolling out the new server code so are stretched thin anyways. hope this helps shed some light