Hi all, thanks for the alerts on 128.252.203.11. As some have noted- 128.252.203.11 (highland1) was under high load as we were trying to generate many more CPU WUs to push out to folders. This was happening towards the end of last week and into the weekend, likely leading to issues with slow download/upload speeds at the end of last week/into the weekend. At some point over the weekend, we also unexpectedly ran out of storage on our server in the partition that houses our log files which led to errors in assigning points. At that time I turned off highland1's ability to assign new jobs (only allowing returns) so that we could resolve any remaining server issues. As of Tuesday we believed we had solved all errors so we reopened jobs for assignment. To try and limit future errors on highland1 I actually set a fairly low assignment rate (2-4 jobs/s). I think the problem that we're currently seeing (as highlighted by mgetz):
mgetz wrote:As of posting this
serverstats is showing no GPU work units at this time. They do seem to be bursting occasional failed units back out to be retried but that's all I'm seeing right now.
is that we're actually facing a GPU WU shortage due to all of the generosity in donating GPUs to the COVID Moonshot project. We're trying to bring more GPU and CPU projects online now, but are also trying to do so in a way that we don't overwhelm our work servers at the same time. We're working at the moment to set up some additional work servers on our end so that we can open these jobs up for folders.
Please continue to let us know about errors- I'm doing my best to chase down each of the issues with the servers involved with my project series (182XX) and will continue to let other folks know about other issues that are arising!