Page 2 of 2
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 6:49 pm
by Neil-B
Once the dust settles there may be a focus on developing ways to rapidly expand/reduce capacity to deal with surges which would mean future surges wouldn't have a similar impact - the team will be painfully aware (both in terms of the heroics they have had to perform and the potential missed WUs they could have had processed had the infrastructure be able to expand to meet demand) of the cost of the current infrastructure provisioning approach and I can't believe they won't look seriously at this (once they have recovered) … Better to spend resource on ensuring future capacity can meet demand that than on developing a whole new prioritisation system that shouldn't then be necessary?
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 6:51 pm
by bruce
alxbelu wrote:(I'm not affiliated with the F@H Team, just promoting these channels for official updates)
Bugs which affect scientific progress will be queued by priority. Anything that can be classified as "enhancement" will be queued behind other reports and I can guarantee they won't be worked on during the COVAID crisis.
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 6:52 pm
by NoMoreQuarantine
alxbelu wrote:On the topic of allocating WUs, I guess there are already WU benchmarks that could act as a first result/expected performance and could easily be tied to a passkey; i.e. when requesting work the server could know what kind of performance to expect from the client. After this the expected performance could be set to the average of completed WUs. I guess there may be other unique id's one could use to tie performance metrics to as well, but that might be easier to spoof/harder to control as a user etc. This would also account for how each specific machine behaves IRL, rather than theoretical performance of individual components (including how the donor uses the machine, e.g. letting it run 24/7 or only few hours per day, loading it with other heavy compute tasks, etc).
Another option would be to have the clients perform a quick representative calculation (time to complete various floating point operations) and use the total time as the metric, or even keep each operation time as a separate metric; that information could be used to optimize tasks even further if one were so inclined.
alxbelu wrote:On the topic of sending out WUs to multiple clients; why not simply send a command to stop the slower client and have it upload the current WU state to the faster client? If the transfer fails/the faster client cannot start, the previous client continues, else it clears the folding slot and requests a new WU, where WUs would then be assigned/reassigned in terms of priority.
I really like this idea. I would suggest is that if the transfer fails that it might be time to make sure the "slow" client is still online haha. Also, in some cases the slow client may not be far enough along for there to be any value in transferring, in which case they would just be told to drop it and start on the new WU immediately.
alxbelu wrote:I don't think it's unreasonable to expect future similar situations; i.e. we may well have new pandemics, at a time when interest for F@H has diminished to "normal" levels but suddenly gains a similar explosion of new donors as these weeks.
Agreed. Also, I'm beginning to think this idea might be useful to prevent bottlenecks in high priority projects during normal operations. I may be biased though
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 6:53 pm
by alxbelu
bruce wrote:The command to stop the slower client does not exist. FAH does not monitor your progress. (Maybe the slow machine was turned off for the night or something else.)
I know, this would definitely require some additional coding on both client and servers (which I know is not something likely to happen soon).
But in the case of an unresponsive machine, you could e.g. have logic that if it is X levels below the now available machine in performance, chances are that the faster machine would finish the WU even if starting from scratch and thus be delegated a duplicate.
For me, this is mostly stuff I would hope to see in place for future urgent situations.
Neil-B wrote:Better to spend resource on ensuring future capacity can meet demand that than on developing a whole new prioritisation system that shouldn't then be necessary?
I mean, part of it is also scientists not managing to produce new WUs quickly enough? Sure, I guess that process can likely be streamlined quite a bit, but likely not scaled like the physical infrastructure/hardware, even in the future. The point of this system would be to optimize the return of certain projects in the shortest turnaround time possible.
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 7:07 pm
by NoMoreQuarantine
alxbelu wrote:I know, this would definitely require some additional coding on both client and servers (which I know is not something likely to happen soon).
Yeah, that's the reason I'm only proposing the idea and not offering to write it: it sounds hard
Seriously though, if they asked me to try to develop a fork I would give it my best effort even though I am only a lowly EE with rudimentary programming skills.
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 7:28 pm
by alxbelu
NoMoreQuarantine wrote:
Seriously though, if they asked me to try to develop a fork I would give it my best effort even though I am only a lowly EE with rudimentary programming skills.
They're having a "fireside developer chat" on Thursday, you could consider joining (from twitter):
Re: Give more powerful clients the opportunity to race for W
Posted: Tue Apr 07, 2020 7:34 pm
by NoMoreQuarantine
Thanks, I submitted a sign up request, so we'll see what happens.
Re: Give more powerful clients the opportunity to race for W
Posted: Wed Apr 08, 2020 12:55 pm
by toTOW
NoMoreQuarantine wrote:alxbelu wrote:On the topic of allocating WUs, I guess there are already WU benchmarks that could act as a first result/expected performance and could easily be tied to a passkey; i.e. when requesting work the server could know what kind of performance to expect from the client. After this the expected performance could be set to the average of completed WUs. I guess there may be other unique id's one could use to tie performance metrics to as well, but that might be easier to spoof/harder to control as a user etc. This would also account for how each specific machine behaves IRL, rather than theoretical performance of individual components (including how the donor uses the machine, e.g. letting it run 24/7 or only few hours per day, loading it with other heavy compute tasks, etc).
Another option would be to have the clients perform a quick representative calculation (time to complete various floating point operations) and use the total time as the metric, or even keep each operation time as a separate metric; that information could be used to optimize tasks even further if one were so inclined.
Such thing existed in a previous version of the client (in v5 ... not sure about v6) ... but it has been abandoned because it was too hard to keep it representative of the real computations so it became useless.
Re: Give more powerful clients the opportunity to race for W
Posted: Wed Apr 08, 2020 5:28 pm
by NoMoreQuarantine
toTOW wrote:Such thing existed in a previous version of the client (in v5 ... not sure about v6) ... but it has been abandoned because it was too hard to keep it representative of the real computations so it became useless.
Interesting, I would expect that on an unloaded system that the calculations would be rather consistent, but thinking about it, several machines on the network vary in processing capacity over time (i.e. they're actually
using their computers, the selfish puffins
), plus the floating point operation speed of a machine will be very different when near 100% load vs 0% "load" and performing a single floating point calculation per processor would not be representative as most systems are multi-threaded. The metric would have to account for the processing time of each thread as well as how many of those threads are available on average.
How I imagine a metric could be implemented: for each operation, on each thread, record operations per second. Over time keep track of thread usage, if CPU thread 0 is in use 79% of the time, assume thread 0 performance equals ops/sec * (1 - 0.79). Add all thread performance variables together to get a nice clean FLOPS metric to send to the server. Assume downtime is equivalent to 100% usage on all threads. I am not sure what timescale would be reasonable for the usage calculation; maybe one could be calculated from existing usage statistics.
If server load ever becomes less of an issue, this could be made even more optimized by sending operation time for each type of operation, allowing tasks and machines to be matched more precisely (maybe even automatically tailoring Generations to be optimized for the available resources). One way to reduce server load may be to push some of that load on to volunteer clients that could keep track of system status on a public log (like in blockchains). I'm just spitballing at this point, this is not a proposal.