Is 128.252.203.10 configured correctly?
Posted: Wed May 06, 2020 7:16 pm
Obviously there are some unresolved ongoing technical problems with Work Server 128.252.203.10. Understood. However, while waiting for hours to upload a completed GPU Work Unit back to this server [WU 11759(0, 5680, 59)], I noticed that my FAH control panel indicates this WU has no Collection Server: IP 0.0.0.0. (Other WUs show explicit alternate IP addresses for their Collection Servers, as I expect.) I take this to mean that this completed WU can only be returned back to its Work Server, that there is no fail-over collection point. But the Server Stats page does show this Work Server as having a Collection Server configured. Is something misconfigured with the server, or perhaps with the WU family?
Second thing I noticed - the Server Stats page column for uptime is rather misleading, or at least not being calculated in a way that correlates with outside user perception. This troubled server has obviously been restarted multiple times today - and after the implied restarts, uptime starts incrementing following the wall clock. However, the column for time of last contact just sits - implying that the server "hung" quickly, and is going nowhere. Presumably, with the developing backlog, contacts would come nearly continuously, so last contact time will generally closely track with clock time as long as work is flowing. In many of my critical (monitored) datacenter servers, that is one of the first levels of health monitoring - does a monitored server respond to the monitor's regular polling? If not, raise an alert. Lacking such monitoring infrastructure, lack of progress in the time of last contact would seem one effective proxy for a failed server (especially with so many users trolling for work!)
Anyway, best wishes to the folks on the front lines, and thanks for providing a way for us to help with the science!
Cheers, Bob
Second thing I noticed - the Server Stats page column for uptime is rather misleading, or at least not being calculated in a way that correlates with outside user perception. This troubled server has obviously been restarted multiple times today - and after the implied restarts, uptime starts incrementing following the wall clock. However, the column for time of last contact just sits - implying that the server "hung" quickly, and is going nowhere. Presumably, with the developing backlog, contacts would come nearly continuously, so last contact time will generally closely track with clock time as long as work is flowing. In many of my critical (monitored) datacenter servers, that is one of the first levels of health monitoring - does a monitored server respond to the monitor's regular polling? If not, raise an alert. Lacking such monitoring infrastructure, lack of progress in the time of last contact would seem one effective proxy for a failed server (especially with so many users trolling for work!)
Anyway, best wishes to the folks on the front lines, and thanks for providing a way for us to help with the science!
Cheers, Bob