Hello,
I have folded in the past and my interest was rekindled after seeing the requests for help with COVID-19. I downloaded the FaH client and it happily downloaded some work and started on completing it.
It subsequently failed to download any work, so I checked these forums and discovered the servers were overwhelmed by the response. I left the client alone and assumed it would download more work when the servers were able to deliver it.
I checked the client some hours later to discover that the wait time on the next download attempt was over 5 hours. I was curious about it, so I manually paused the client and then resumed, whereupon it downloaded two workloads and started to work on them. I believe this is a missed opportunity caused by the client waiting for an excessive amount of time between attempting to download work units, as the time between attempts appears to get exponentially greater the more it attempts to download work and fails. This means that, once servers are back online, there are idle CPUs and GPUs awaiting work but not having any due to an arbitrary wait time set in the client.
I believe a better approach would be to increase the wait times as attempts fail, as is current behaviour, but to set a more reasonable maximum wait time between attempts so that hours are not wasted, for example 30 minutes. (I realise I do not have visibility of the load on the servers, but perhaps FaH staff can use that information and my suggestion to find a realistic 'maximum' wait time between attempts that would be more productive.)
Thanks for creating this client and for giving us the chance to help in this crisis (and beyond).
Next attempt... X Hours (suggestion)
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 1094
- Joined: Wed Nov 05, 2008 3:19 pm
- Location: Cambridge, UK
Re: Next attempt... X Hours (suggestion)
It is always difficult to design a strategy for retries that is right in all circumstances. The F@H algorithm for retries is designed to prevent overloads when servers come back into service after an outage. It may not be ideal in the current circumstances, but we must work with what we have. I'm sure the team are learning from the experience -- expect a more robust F@H environment in the future, but big changes won't happen while the Coronavirus workload is being processed unless they offer safe path to a large benefit in getting work done.