Page 1 of 1

Is 128.252.203.10 configured correctly?

Posted: Wed May 06, 2020 7:16 pm
by BobHehmann
Obviously there are some unresolved ongoing technical problems with Work Server 128.252.203.10. Understood. However, while waiting for hours to upload a completed GPU Work Unit back to this server [WU 11759(0, 5680, 59)], I noticed that my FAH control panel indicates this WU has no Collection Server: IP 0.0.0.0. (Other WUs show explicit alternate IP addresses for their Collection Servers, as I expect.) I take this to mean that this completed WU can only be returned back to its Work Server, that there is no fail-over collection point. But the Server Stats page does show this Work Server as having a Collection Server configured. Is something misconfigured with the server, or perhaps with the WU family?

Second thing I noticed - the Server Stats page column for uptime is rather misleading, or at least not being calculated in a way that correlates with outside user perception. This troubled server has obviously been restarted multiple times today - and after the implied restarts, uptime starts incrementing following the wall clock. However, the column for time of last contact just sits - implying that the server "hung" quickly, and is going nowhere. Presumably, with the developing backlog, contacts would come nearly continuously, so last contact time will generally closely track with clock time as long as work is flowing. In many of my critical (monitored) datacenter servers, that is one of the first levels of health monitoring - does a monitored server respond to the monitor's regular polling? If not, raise an alert. Lacking such monitoring infrastructure, lack of progress in the time of last contact would seem one effective proxy for a failed server (especially with so many users trolling for work!)

Anyway, best wishes to the folks on the front lines, and thanks for providing a way for us to help with the science!

Cheers, Bob

Re: Is 128.252.203.10 configured correctly?

Posted: Wed May 06, 2020 7:35 pm
by Joe_H
A CS may not have been setup for the WS, Project, or WU at the time it was downloaded to your system. If a CS was added later, that is not retroactive.

Projects on a single WS may have different CS addresses assigned, though that is less common currently.

Last contact time would be to the server managing the Server Status page, a WS freshly rebooted would "check in". But later contacts might get lost or delayed if the WS is swamped with network requests for downloads and uploads.

Additional information has been posted by the person managing this server in other topics.