Folding Forum

Posted: **Tue Apr 07, 2020 3:06 pm**

Right now my CPU & GPU are idling and unable to get any WUs. Looking at the server list, it looks like there are several servers with no jobs. I assume this means that all available WUs have been handed out and the servers are waiting for new jobs to distribute to the clients. My machine is relatively powerful and I would not be surprised if it could process WUs faster than some of the machines currently working on them. In the interest of time, why not give machines like mine the opportunity to "race" for the WUs that are already being processed by slower machines? Ideally, the system could calculate an estimated-time-to-complete for each task by each machine and when a machine needs work, and there isn't any readily available, work could be reassigned (or raced for if the estimate is close) based on that calculation. Hypothetically, this would be faster, but less energy efficient.

I am making a lot of assumptions here as I am not familiar with the technical workings of the system. Please forgive me if this is nonsense and thank you for your time.

Posted: **Tue Apr 07, 2020 5:23 pm**

That would require a major overhaul of the server software and keeping track of what hardware a unit was issued to would be a huge pain in the butt I think. The servers are all loaded to the max as it is and that would slow them down even more. Also the project tries not to repeat work, lots of work would be getting done twice for no real reason other than to keep more computers busy. Seems like a huge waste of resources to me.

Posted: **Tue Apr 07, 2020 5:27 pm**

I was under the impression that a single WU is usually given to more than 1 client, as an accuracy-checking mechanism so they can compare that the results match. Can't remember where I read that from though.

Posted: **Tue Apr 07, 2020 5:38 pm**

Normally WU is only done once (if completed before Timeout) … If it fails then iirc it is released to 3 more folders at once … If I look back at the last 200 WUs my kit have folded only 2 have been folded more than once - one my machine Faulted on (too many cores) the other someone else faulted and I got it.

Edit - Update relevant to above:

Whilst I had seen a post indicating a faulty WU was released to 3 more folders - that may not have been "at once" - actually my logs (and a nudge from someone who probably knows better) imply it might only be to one other folder - it may be that if it the WU fails again it is then release again, and again, possibly therefore to a total of 4 failures maybe?

Posted: **Tue Apr 07, 2020 5:41 pm**

Rel25917 I believe the servers already keep track of machine capabilities by having the client self-report them as necessary. Repeating work is fine if doing so completes the task faster; that's literally what my proposal is, definitely not "keep more computers busy".

Thanks for the clarification Neil-B. That is a different feature than the one I am describing, but it could complicate it's implementation a bit.

Posted: **Tue Apr 07, 2020 5:52 pm**

If it didn't waste a bunch of resources, I could actually to some extent go along with such a thing, even though I have a much less capable system than a lot of people that do folding.

But a downside might be that fewer and fewer people would actually even bother folding when they have less powerful machines. Only the people on top of things with current hardware (which is ever changing) would fold. Which might be fine, even though it would bias things towards people with more financial resources to put towards folding. BUT then you have COVID-19, and no huge user base to do extra work that now exists. Essentially you cut off your available base for when all resources are really needed.

Being that anyone folding is essentially contributing some financial resources, you could compare it to saying that only contributions from those with the largest amount of resources are accepted.

I also can't even imagine the rage quits of people that took quite a bit of time, effort, and financial resources to find out the the others have upped the ante so much that they no longer get WU's.

Posted: **Tue Apr 07, 2020 5:59 pm**

Not getting WUs is simply a symptom of the issues caused by such a large influx of folders in such a short time … The team are working on improving the infrastructure to serve WUs to everyone who wants them … Once that is sorted there will be no need for such an approach?

Posted: **Tue Apr 07, 2020 6:04 pm**

BobWilliams757 wrote:If it didn't waste a bunch of resources, I could actually to some extent go along with such a thing, even though I have a much less capable system than a lot of people that do folding.

But a downside might be that fewer and fewer people would actually even bother folding when they have less powerful machines. Only the people on top of things with current hardware (which is ever changing) would fold. Which might be fine, even though it would bias things towards people with more financial resources to put towards folding. BUT then you have COVID-19, and no huge user base to do extra work that now exists. Essentially you cut off your available base for when all resources are really needed.

Fundamentally, if the system solves the same problems faster with the same amount of available resources, as it should work with my proposal, the resources are not wasted. This proposal is for edge cases where a problem can be solved faster by reassigning to a faster resource. That would only happen when the overall system load is sub-optimal. When the system is fully taxed everyone will have work to go round, which it should be if those lazy scientists are doing their jobs

(the previous is a joke in case you can't see my emoji. I love and admire the scientists who are working so hard to support this effort, you guys are amazing.)

BobWilliams757 wrote:I also can't even imagine the rage quits of people that took quite a bit of time, effort, and financial resources to find out the the others have upped the ante so much that they no longer get WU's.

In the case that a process gets reassigned, I don't see any reason they wouldn't give them credit for it. Heck, it would be a bonus, because they could get full credit for a partial job.

Posted: **Tue Apr 07, 2020 6:13 pm**

iceman1992 wrote:I was under the impression that a single WU is usually given to more than 1 client, as an accuracy-checking mechanism so they can compare that the results match. Can't remember where I read that from though.

You read that at BOINC/Seti@home, not at Folding@home. FAH was originally designed to be "lean and mean" doing it's best to avoid the duplication of work. Error checking is done by other means. FAH does require that each WU be completed so if a WU is, in fact, lost, a duplicate is sent out. It assumes that if a WU passes the Timeout, it must be lost ... which isn't really a guarantee that it's not sitting on a sleeping computer that will be awakened later so duplicate WUs may, in fact, be completed but as rarely as possible, given the constraints.

Posted: **Tue Apr 07, 2020 6:13 pm**

Neil-B wrote:Not getting WUs is simply a symptom of the issues caused by such a large influx of folders in such a short time … The team are working on improving the infrastructure to serve WUs to everyone who wants them … Once that is sorted there will be no need for such an approach?

True, I am only describing a solution for an edge case wherein there is not enough work to go round and time is of the essence. Although, now that I think about it, this approach could also be implemented when there are high priority jobs on top of the stack and a faster machine becomes available. Reassign high priority job to fast machine > move slow machine down stack to get head start on future lower priority jobs.

Posted: **Tue Apr 07, 2020 6:17 pm**

I hope those "lazy scientists" spot your emoji !! … I might suggest that before creating a solution to what may be ephemeral (at least in the big scheme) issue it might be first worth waiting until the dust settles and see whether a solution is actually needed … Personal I have no desire to have my kit warming the house simply to see if it can beat some other kit for little gain to science - happy for it to remain idling if servers are too busy to allocate WUs.

Of course the team could redeploy their singleton resource to redesign the WU allocation system, puting on hold work to expand and optimise the current infrastructure and speed up WU allocation - but I am not sure there would be as much gain to the science - even if people got warm fuzzy feelings cause they beat someone else to it and those lazy scientists could have a rest from generating new projects for a bit

Posted: **Tue Apr 07, 2020 6:33 pm**

NoMoreQuarantine wrote:Rel25917 I believe the servers already keep track of machine capabilities by having the client self-report them as necessary. Repeating work is fine if doing so completes the task faster; that's literally what my proposal is, definitely not "keep more computers busy".

FAHClient knows the hardware configuration and it includes that information when it's needed ... specifically to determine which of the available WUs should be assigned/not_assigned. It's a concise report. Maintaining a database of every client's hardware has been suggested before, and it's a lot of processing with almost zero improvement in scientific research results.

No preference is given to "fast" clients except that the QRB gives them more points that they would get if points were linear. If 100 WUs are assigned to 100 clients, some will be faster than others, but the results of even the slowest machine is a benefit to FAH. The other 99 can complete their assignments and move on to 99 more WUs while the slow machine is still working. The trajectory that the slow machine is working on moves ahead more slowly than it might have but it's still faster than if that machine was not working.

That's not to say that FAH isn't appreciative of fast machines. (Please consider upgrading your hardware

) so the other 99 are faster, but adding one more machine (provided it can meet the deadlines) is always a benefit -- or at least it was until the surge of Donors that have responded to the COVAID research. FAH is continuing to upgrading the servers and building additional research projects to be distributed. It's my hope that we'll catch up with the recent surge of Donors.

Posted: **Tue Apr 07, 2020 6:37 pm**

I added additional comments to make it clear it was a joke (the joke being that I am just a basic lazy user calling the brilliant scientists, who actually do all the hard work, lazy).

As for gains to science, an improvement to the overall processing speed of the system would be worth the energy cost considering the relative value of the work achieved.

Posted: **Tue Apr 07, 2020 6:38 pm**

On the topic of allocating WUs, I guess there are already WU benchmarks that could act as a first result/expected performance and could easily be tied to a passkey; i.e. when requesting work the server could know what kind of performance to expect from the client. After this the expected performance could be set to the average of completed WUs. I guess there may be other unique id's one could use to tie performance metrics to as well, but that might be easier to spoof/harder to control as a user etc. This would also account for how each specific machine behaves IRL, rather than theoretical performance of individual components (including how the donor uses the machine, e.g. letting it run 24/7 or only few hours per day, loading it with other heavy compute tasks, etc).

On the topic of sending out WUs to multiple clients; why not simply send a command to stop the slower client and have it upload the current WU state to the faster client? If the transfer fails/the faster client cannot start, the previous client continues, else it clears the folding slot and requests a new WU, where WUs would then be assigned/reassigned in terms of priority.

I don't think it's unreasonable to expect future similar situations; i.e. we may well have new pandemics, at a time when interest for F@H has diminished to "normal" levels but suddenly gains a similar explosion of new donors as these weeks.

(edit: realizing there's already some form of UUID in place/tracked on the servers: CPUId = UserID+MachineID)

Posted: **Tue Apr 07, 2020 6:45 pm**

The command to stop the slower client does not exist, and for it to exist, the client would have to accept 2-way communicationsl which it does not. FAH makes no effort to monitor your progress. (Maybe the slow machine was turned off for the night or something else. In fact, that might explain why the machine is slow.)

FAH does keep track of the WUs that have been distributed and the only time it matters is if the WU is completed or if it passes the deadline without being returned, at which point FAH issues a duplicate WU.

Folding Forum

Give more powerful clients the opportunity to race for WUs

Give more powerful clients the opportunity to race for WUs

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W

Re: Give more powerful clients the opportunity to race for W