Page 1 of 1 and Down?

Posted: Fri Mar 27, 2009 1:32 am
by road-runner
Got a few rigs that cant send in results, could someone kick the servers please...

Code: Select all

[01:10:49] Finished Work Unit:
[01:10:49] - Reading up to 21182544 from "work/wudata_09.trr": Read 21182544
[01:10:49] trr file hash check passed.
[01:10:49] - Reading up to 27672948 from "work/wudata_09.xtc": Read 27672948
[01:10:49] xtc file hash check passed.
[01:10:49] edr file hash check passed.
[01:10:49] logfile size: 178024
[01:10:49] Leaving Run
[01:10:51] - Writing 49249980 bytes of core data to disk...
[01:10:52]   ... Done.
[01:11:01] - Shutting down core
[01:11:01] Folding@home Core Shutdown: FINISHED_UNIT
[01:14:14] CoreStatus = 64 (100)
[01:14:14] Sending work to server
[01:14:14] Project: 2671 (Run 10, Clone 71, Gen 0)

[01:14:14] + Attempting to send results [March 27 01:14:14 UTC]
[01:14:14] - Couldn't send HTTP request to server
[01:14:14]   (Got status 503)
[01:14:14] + Could not connect to Work Server (results)
[01:14:14]     (
[01:14:14] + Retrying using alternative port
[01:14:14] - Couldn't send HTTP request to server
[01:14:14] + Could not connect to Work Server (results)
[01:14:14]     (
[01:14:14] - Error: Could not transmit unit 09 (completed March 27) to work server.
[01:14:14]   Keeping unit 09 in queue.
[01:14:14] Project: 2671 (Run 10, Clone 71, Gen 0)

[01:14:14] + Attempting to send results [March 27 01:14:14 UTC]
[01:14:14] - Couldn't send HTTP request to server
[01:14:14]   (Got status 503)
[01:14:14] + Could not connect to Work Server (results)
[01:14:14]     (
[01:14:14] + Retrying using alternative port
[01:14:14] - Couldn't send HTTP request to server
[01:14:14] + Could not connect to Work Server (results)
[01:14:14]     (
[01:14:14] - Error: Could not transmit unit 09 (completed March 27) to work server.

[01:14:14] + Attempting to send results [March 27 01:14:14 UTC]
[01:14:15] - Couldn't send HTTP request to server
[01:14:15]   (Got status 503)
[01:14:15] + Could not connect to Work Server (results)
[01:14:15]     (
[01:14:15] + Retrying using alternative port
[01:14:15] - Couldn't send HTTP request to server
[01:14:15]   (Got status 503)
[01:14:15] + Could not connect to Work Server (results)
[01:14:15]     (
[01:14:15]   Could not transmit unit 09 to Collection server; keeping in queue.
[01:14:15] - Preparing to get new work unit...
[01:14:15] + Attempting to get work packet
[01:14:15] - Connecting to assignment server
[01:14:15] - Successful: assigned to (
[01:14:15] + News From Folding@Home: Welcome to Folding@Home
[01:14:15] Loaded queue successfully.
[01:14:20] Project: 2671 (Run 10, Clone 71, Gen 0)

[01:14:20] + Attempting to send results [March 27 01:14:20 UTC]
[01:14:21] - Couldn't send HTTP request to server
[01:14:21]   (Got status 503)
[01:14:21] + Could not connect to Work Server (results)
[01:14:21]     (
[01:14:21] + Retrying using alternative port
[01:14:21] - Couldn't send HTTP request to server
[01:14:21] + Could not connect to Work Server (results)
[01:14:21]     (
[01:14:21] - Error: Could not transmit unit 09 (completed March 27) to work server.

[01:14:21] + Attempting to send results [March 27 01:14:21 UTC]
[01:14:21] - Couldn't send HTTP request to server
[01:14:21]   (Got status 503)
[01:14:21] + Could not connect to Work Server (results)
[01:14:21]     (
[01:14:21] + Retrying using alternative port
[01:14:21] - Couldn't send HTTP request to server
[01:14:21]   (Got status 503)
[01:14:21] + Could not connect to Work Server (results)
[01:14:21]     (
[01:14:21]   Could not transmit unit 09 to Collection server; keeping in queue.
[01:14:26] Project: 2671 (Run 10, Clone 71, Gen 0)

[01:14:26] + Attempting to send results [March 27 01:14:26 UTC]
[01:14:27] - Couldn't send HTTP request to server
[01:14:27]   (Got status 503)
[01:14:27] + Could not connect to Work Server (results)
[01:14:27]     (
[01:14:27] + Retrying using alternative port
[01:14:27] - Couldn't send HTTP request to server
[01:14:27] + Could not connect to Work Server (results)
[01:14:27]     (
[01:14:27] - Error: Could not transmit unit 09 (completed March 27) to work server.

[01:14:27] + Attempting to send results [March 27 01:14:27 UTC]
[01:14:27] - Couldn't send HTTP request to server
[01:14:27]   (Got status 503)
[01:14:27] + Could not connect to Work Server (results)
[01:14:27]     (
[01:14:27] + Retrying using alternative port
[01:14:27] - Couldn't send HTTP request to server
[01:14:27]   (Got status 503)
[01:14:27] + Could not connect to Work Server (results)
[01:14:27]     (
[01:14:27]   Could not transmit unit 09 to Collection server; keeping in queue.
[01:14:27] + Closed connections
[01:14:27] + Processing work unit
[01:14:27] At least 4 processors must be requested.Core required: FahCore_a2.exe
[01:14:27] Core found.
[01:14:27] Working on queue slot 00 [March 27 01:14:27 UTC]
[01:14:27] + Working ...
[01:14:27] *------------------------------*
[01:14:27] Folding@Home Gromacs SMP Core
[01:14:27] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[01:14:27] Preparing to commence simulation
[01:14:27] - Ensuring status. Please wait.
[01:14:28] Called DecompressByteArray: compressed_data_size=4834326 data_size=24039181, decompressed_data_size=24039181 diff=0
[01:14:28] - Digital signature verified
[01:14:28] Project: 2677 (Run 7, Clone 1, Gen 1)

Re: and Down?

Posted: Fri Mar 27, 2009 1:37 am
by anandhanju
Net load for .24 appears to be stuck at 300.

Re: and Down?

Posted: Fri Mar 27, 2009 12:30 pm
by Pick2
I've got two WU waiting on server to get fixed also. Still at 300 net load

Re: and Down?

Posted: Fri Mar 27, 2009 4:02 pm
by susato
It's back up (or else the collection server is) - it just accepted two work units from my machine when I restarted Folding to force the upload.

Re: and Down?

Posted: Sat Mar 28, 2009 2:38 am
by road-runner
Seems to be worse today, I have a lot trying to send in. Do we need to go ahead and shutdown till they get this fixed? Does not seem like its doing much good to do the WUs if there going to expire before they get back...

Re: and Down?

Posted: Sat Mar 28, 2009 1:43 pm
by Pick2
Net load for .24 appears to be stuck at 300 again.
Won't take a WU at present.

Re: and Down?

Posted: Sat Mar 28, 2009 1:50 pm
by road-runner
Yea still having problems here also, this is 2 days now...

Re: and Down?

Posted: Sat Mar 28, 2009 1:56 pm
by goodyca
I also have a WU (project 2676, run 3, clone 196, generation 2) that can not connect to this server. It was completed Mar 27 01:44:34 2009 UTC and is due Mar 29 07:10:40 2009 UTC (3 days).

Here is its last attempt to connect:

Code: Select all

[11:05:32] - Autosending finished units... [March 28 11:05:32 UTC]
[11:05:32] Trying to send all finished work units
[11:05:32] Project: 2676 (Run 3, Clone 196, Gen 2)

[11:05:32] + Attempting to send results [March 28 11:05:32 UTC]
[11:05:32] - Reading file work/wuresults_03.dat from core
[11:05:32]   (Read 26036354 bytes from disk)
[11:05:32] Connecting to
[11:05:32] - Couldn't send HTTP request to server
[11:05:32]   (Got status 503)
[11:05:32] + Could not connect to Work Server (results)
[11:05:32]     (
[11:05:32] + Retrying using alternative port
[11:05:32] Connecting to
[11:05:32] - Couldn't send HTTP request to server
[11:05:32] + Could not connect to Work Server (results)
[11:05:32]     (
[11:05:32] - Error: Could not transmit unit 03 (completed March 27) to work server.
[11:05:32] - 12 failed uploads of this unit.

[11:05:32] + Attempting to send results [March 28 11:05:32 UTC]
[11:05:32] - Reading file work/wuresults_03.dat from core
[11:05:32]   (Read 26036354 bytes from disk)
[11:05:32] Connecting to
[11:05:32] - Couldn't send HTTP request to server
[11:05:32]   (Got status 503)
[11:05:32] + Could not connect to Work Server (results)
[11:05:32]     (
[11:05:32] + Retrying using alternative port
[11:05:32] Connecting to
[11:05:33] - Couldn't send HTTP request to server
[11:05:33]   (Got status 503)
[11:05:33] + Could not connect to Work Server (results)
[11:05:33]     (
[11:05:33]   Could not transmit unit 03 to Collection server; keeping in queue.
[11:05:33] + Sent 0 of 1 completed units to the server
[11:05:33] - Autosend completed

Re: and Down?

Posted: Sat Mar 28, 2009 2:06 pm
by Flathead74
[12:52:26] - Autosending finished units... [March 28 12:52:26 UTC]
[12:52:26] Trying to send all finished work units
[12:52:26] Project: 2677 (Run 34, Clone 50, Gen 2)

[12:52:26] + Attempting to send results [March 28 12:52:26 UTC]
[12:52:26] - Reading file work/wuresults_05.dat from core
[12:52:26] (Read 49222561 bytes from disk)
[12:52:26] Connecting to
[12:52:27] - Couldn't send HTTP request to server
[12:52:27] (Got status 503)
[12:52:27] + Could not connect to Work Server (results)
[12:52:27] (
[12:52:27] + Retrying using alternative port
[12:52:27] Connecting to
[12:52:27] - Couldn't send HTTP request to server
[12:52:27] (Got status 503)
[12:52:27] + Could not connect to Work Server (results)
[12:52:27] (
[12:52:27] - Error: Could not transmit unit 05 (completed March 28) to work server.

Re: and Down?

Posted: Sat Mar 28, 2009 11:29 pm
by toTOW
I've just notified the Pande Group about this issue.

Re: and Down?

Posted: Sun Mar 29, 2009 12:58 am
by VijayPande
Thanks for the heads up. Looking at the situation, we see that both of these servers are up right now. We've done some tweaks to help them out. It's possible that this may be some weird Stanford network issue, especially since the servers seem fine, although people are clearly having problems. We're keeping an eye on it and will post more info when we know it.

Re: and Down?

Posted: Sun Mar 29, 2009 2:37 pm
by Pick2
They seem to be working fine since ( somepoint ) after your post VP :)
Both and have sent up all "Ready For Upload" WU and have downloaded new WU when assigned there.

Re: and Down?

Posted: Sun Mar 29, 2009 8:18 pm
by WickedPixie
Seems like still having issues uploading...

3 on queue on this one and have a couple on another

Code: Select all

[10:01:20] Folding@home Core Shutdown: FINISHED_UNIT
Error encountered before initializing MPICH
[10:04:14] CoreStatus = 64 (100)
[10:04:14] Unit 6 finished with 79 percent of time to deadline remaining.
[10:04:14] Updated performance fraction: 0.784875
[10:04:14] Sending work to server
[10:04:14] - Already sending work
[10:04:14] Trying to send all finished work units
[10:04:14] - Already sending work
[10:04:14] - Already sending work
[10:04:14] - Already sending work
[10:04:14] + Sent 0 of 3 completed units to the server
[10:04:14] - Preparing to get new work unit...
[10:04:14] + Attempting to get work packet
[10:04:14] - Will indicate memory of 1004 MB
[10:04:14] - Connecting to assignment server
[10:04:14] Connecting to
[10:04:14] Posted data.
[10:04:14] Initial: 40AB; - Successful: assigned to (
[10:04:14] + News From Folding@Home: Welcome to Folding@Home
[10:04:15] Loaded queue successfully.
[10:04:15] Connecting to
[10:04:20] Posted data.
[10:04:20] Initial: 0000; - Receiving payload (expected size: 4838194)
[10:04:29] - Downloaded at ~524 kB/s
[10:04:29] - Averaged speed for that direction ~511 kB/s
[10:04:29] + Received work.
[10:04:29] Trying to send all finished work units
[10:04:29] - Already sending work
[10:04:29] - Already sending work
[10:04:29] - Already sending work
[10:04:29] + Sent 0 of 3 completed units to the server
[10:04:29] + Closed connections
[10:04:29] + Processing work unit
[10:04:29] At least 4 processors must be requested.Core required: FahCore_a2.exe
[10:04:29] Core found.
[10:04:29] Working on queue slot 07 [March 29 10:04:29 UTC]
[10:04:29] + Working ...
[10:04:29] - Calling './mpiexec -np 4 -host ./FahCore_a2.exe -dir work/ -suffix 07 -checkpoint 15 -verbose -lifeline 17582 -version 624'

[10:04:29] *------------------------------*
[10:04:29] Folding@Home Gromacs SMP Core
[10:04:29] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[10:04:29] Preparing to commence simulation
[10:04:29] - Ensuring status. Please wait.
[10:04:30] Called DecompressByteArray: compressed_data_size=4837682 data_size=24019629, decompressed_data_size=24019629 diff=0
[10:04:30] - Digital signature verified
[10:04:30] Project: 2677 (Run 3, Clone 25, Gen 2)
[10:04:30] Assembly optimizations on if available.
[10:04:30] Entering M.D.
[10:04:42]  (Run 3, Clone 25, Gen 2)

Re: and Down?

Posted: Sun Apr 05, 2009 12:35 pm
by bollix47
Problems with again.

Net load around 300.

Code: Select all

Sun Apr 5 05:00:20 PDT 2009 	classic 	vsp21v 	kasson 	full 	Accepting 	0.10 	300 	0 	3 	26815 	2664 	0 	1.29 	4369 	4369 	2540 	- 	- 	- 	- 	22 	0 	3 	3 	 0 	- 	0 	 0 	LX; LX; 	9000000, 1 	6, 6 	5, 5 	10000, 10000 	64, 64 	- 	F, A, B, F, A, B 	8080, 8080 	6.27.01 	- 	- 	- 	11 	kasson 	1 	vsp21v 	
Sun Apr 5 05:10:20 PDT 2009 	classic 	vsp21v 	kasson 	full 	Accepting 	0.04 	301 	1 	3 	26815 	2664 	0 	1.29 	4369 	4369 	2540 	- 	- 	- 	- 	22 	0 	3 	3 	0 	- 	0 	0 	LX; LX; 	9000000, 1 	6, 6 	5, 5 	10000, 10000 	64, 64 	- 	F, A, B, F, A, B 	8080, 8080 	6.27.01 	- 	- 	- 	11 	kasson 	1 	vsp21v 	
Sun Apr 5 05:20:20 PDT 2009 	classic 	vsp21v 	kasson 	full 	Accepting 	0.00 	300 	0 	3 	26815 	2664 	0 	1.29 	4369 	4369 	2540 	- 	- 	- 	- 	22 	0 	3 	3 	0 	- 	0 	0 	LX; LX; 	9000000, 1 	6, 6 	5, 5 	10000, 10000 

Re: and Down?

Posted: Thu Apr 09, 2009 9:08 am
by kiore
Getting 503's on this one. in the queue.