Page 1 of 1
RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 5:52 pm
by noorman
both these GPU servers are in Reject, as stated in the title ...
Is anyone checking these server stats or getting any reports on failing servers ?
.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 6:13 pm
by noorman
Does anyone know why some servers go in to 'Reject' mode ?
.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 6:45 pm
by 7im
noorman wrote:both these GPU servers are in Reject, as stated in the title ...
Is anyone checking these server stats or getting any reports on failing servers ?
.
You are asking very broad questions, which have simple and broad answers.
Yes, people do monitor the servers, and get reports about them. Those people include both members of the Stanford IT Staff and members of Pande Group.
Asking a more specific question might help get a more direct answer.
Yes, someone does know. Are you asking who that someone is, or asking for a list of reasons, or both?
The common reasons are hardware failures (mostly RAID problems), networking issues, and also when they load new work units from a new Project.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 6:56 pm
by noorman
I guess, the latter will be the most common answer ...
You seem to be able enough to read between the lines, so to speak.
The answer that tells me why they do is much more important than who knows why they do it and I think I made it clear that I didn't know, just by asking the question.
The 'automatism' that was brought in to 're-boot' (or whatever) servers if they misbehave seems to have worked.
I don't think that 're-loading' 2 servers would stop at about the exact same time. Both servers were back to normal when I checked back just after I typed out this thread.
So I changed the title to reflect that, then and there.
.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 7:03 pm
by 7im
Please note that Stanford is virtualizing servers. So multiple "servers" can go down if 1 hardware box goes offline. Probably not the case here, but it does happen.
Also, as these are both GPU servers, they were likely taken off line at the same time for a specific reason.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 06, 2011 7:30 pm
by noorman
I didn't know that when a server is running in a virtual environment with another, it can't be taken offline - on its own - anymore ...
Anyway, the problem seems to be solved.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Wed Jun 08, 2011 1:37 am
by bruce
Of course virtual servers can be taken off-line, either intentionally or due to certain specific hardware or software failures. I think the point is that multiple virtual servers can all be taken off-line simultaneously by general hardware failures, too.
No matter how reliable server hardware has become, it still goes off-line from time to time. The automatic notification systems sometimes work; the automatic reboot systems sometimes work, and posting a note here in this forum is productive in still other cases.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 13, 2011 3:51 pm
by baz657
Problems uploading results again today.
Code: Select all
[13:07:05] + Attempting to send results [June 13 13:07:05 UTC]
[13:07:05] Gpu type=3 species=21.
[13:07:06] - Couldn't send HTTP request to server
[13:07:06] + Could not connect to Work Server (results)
[13:07:06] (171.64.65.64:8080)
[13:07:06] + Retrying using alternative port
[13:07:08] - Couldn't send HTTP request to server
[13:07:08] + Could not connect to Work Server (results)
[13:07:08] (171.64.65.64:80)
[13:07:08] - Error: Could not transmit unit 09 (completed June 13) to work server.
[13:07:08] - Read packet limit of 540015616... Set to 524286976.
[13:07:08] + Attempting to send results [June 13 13:07:08 UTC]
[13:07:08] Gpu type=3 species=21.
[13:12:23] - Couldn't send HTTP request to server
[13:12:23] + Could not connect to Work Server (results)
[13:12:23] (171.67.108.26:8080)
[13:12:23] + Retrying using alternative port
[13:12:25] - Couldn't send HTTP request to server
[13:12:25] + Could not connect to Work Server (results)
[13:12:25] (171.67.108.26:80)
[13:12:25] Could not transmit unit 09 to Collection server; keeping in queue.
[13:12:25] + Closed connections
and
Code: Select all
[15:47:48] Sending work to server
[15:47:48] Project: 11245 (Run 2, Clone 65, Gen 10)
[15:47:48] - Read packet limit of 540015616... Set to 524286976.
[15:47:48] + Attempting to send results [June 13 15:47:48 UTC]
[15:47:48] Gpu type=3 species=21.
[15:48:14] + Results successfully sent
[15:48:14] Thank you for your contribution to Folding@Home.
[15:48:14] + Number of Units Completed: 388
[15:48:20] Project: 6801 (Run 3210, Clone 4, Gen 19)
[15:48:20] - Read packet limit of 540015616... Set to 524286976.
[15:48:20] + Attempting to send results [June 13 15:48:20 UTC]
[15:48:20] Gpu type=3 species=21.
[15:48:22] - Couldn't send HTTP request to server
[15:48:22] + Could not connect to Work Server (results)
[15:48:22] (171.64.65.64:8080)
[15:48:22] + Retrying using alternative port
[15:48:24] - Couldn't send HTTP request to server
[15:48:24] + Could not connect to Work Server (results)
[15:48:24] (171.64.65.64:80)
[15:48:24] - Error: Could not transmit unit 09 (completed June 13) to work server.
[15:48:24] - Read packet limit of 540015616... Set to 524286976.
Re: RESOLVED: 171.64.65.103 & 171.64.65.64 both in Reject
Posted: Mon Jun 13, 2011 4:53 pm
by noorman
171.64.65.64 and 171.67.108.26 both back up ...