Default using all CPUs seems flawed

Alex_Atkin · Post by **Alex_Atkin** » Mon Jan 30, 2023 8:57 pm

How v8 handles resource allocation seems rather flawed. Leaving it at default I got a 3 CPU core job reach 66%, then it got a GPU job (probably Alzeimers) which took over all resources, leading to the CPU only job potentially never finishing.

I realise I can reduce the CPU to 1 and add a CPU only virtual peer, but then I lose the ability to use all CPU cores for a GPU job that CAN make use of them.

I'm not really sure what the solution to this could be though. Presumably the long-term goal is to have all GPU jobs utilise all CPU cores rather than differentiating between the two. But can this even be done effectively given a GPUs performance is so many times greater?

Is there an explanation somewhere for why this is better than the old system?

Post by **calxalot** » Mon Jan 30, 2023 10:28 pm

I believe there are currently no hybrid CPU + GPU cores.
The resource system was designed to use them in the future.

Yes, if you create separate resource groups for each GPU, you will have to reduce the available CPUs in the default group.
It may be best to just have the default group with all resources.

If you wish to change allocation among groups, it can be helpful to set everything to finishing and reconfigure after groups are paused.

To only do do GPU folding, you might be able to set CPUs to 1 in each GPU resource group.
Set CPUs zero in default group.
I have not tried this because I'm just running macOS right now.

Alex_Atkin · Post by **Alex_Atkin** » Mon Jan 30, 2023 11:09 pm

I understand that, what seemed to go wrong is that for some reason it picked a CPU job, then some time later a GPU one and it paused the CPU job indefinitely claiming there were insufficient resources.

Thinking back, maybe this was because like past versions it disables the GPU by default and when I enabled it the GPU job took priority leaving the CPU job no way to finish? Perhaps there needs to be some failsafe for this first configuration scenario so the CPU job finishes before it pulls down a GPU job? Or ask if you want to enable the GPU before even looking for jobs at all?

Post by **calxalot** » Mon Jan 30, 2023 11:21 pm

The 8.1.11 client is supposed to allow the cpus for a WU to be changed without claiming insufficient resources.

It's possible the assignment had a minimum cpus that was no longer met.

This sounds like a bug you might want to report.

https://github.com/FoldingAtHome/fah-cl ... tet/issues

Post by **calxalot** » Mon Jan 30, 2023 11:47 pm

Meanwhile, you might need to set everything to finishing for the situation to clear up. If so, that would be a separate bug.

It would be good to know if another GPU WU is started without running the stalled CPU WU.

Alex_Atkin · Post by **Alex_Atkin** » Tue Jan 31, 2023 12:17 am

Being effectively in the same slot I couldn't figure out how to pause the GPU unit to let it finish so I ended up clearing it, never occurred to me Finishing might do it. Worth noting for the future as I hate dropping WUs.

Alex_Atkin · Post by **Alex_Atkin** » Tue Jan 31, 2023 2:26 am

Its happened again just letting F@H manage its own usage.

Post by **calxalot** » Tue Jan 31, 2023 3:35 am

I think a GPU WU assignment taking more than 1 cpu is a bug. It could be an assignment server bug or misconfiguration.

Please report it as a client issue.

Alex_Atkin · Post by **Alex_Atkin** » Tue Jan 31, 2023 5:08 am

https://github.com/FoldingAtHome/fah-cl ... issues/106

Hou5e · Post by **Hou5e** » Tue Jan 31, 2023 11:40 am

Until it gets fixed, you could setup Resource Groups to have separate 'Slot'-like behavior for your CPU and GPU, see GitHub Issue #52 for more info about enabling it.

Alex_Atkin · Post by **Alex_Atkin** » Tue Jan 31, 2023 11:05 pm

I tried that and yes it fixes that problem, but then the / peers are not reported in the initial json from the websocket connection which breaks my ability to monitor my folding boxes as I have no idea how to request the data for that virtual peer.

Obviously long-term I need to learn how to properly use the websockets properly, but of course there will no documentation until after the beta. Right now with a dirty hack (constantly closing/opening the websocket) I was able to include v8 alongside my v7 monitoring. Without that, I can't see if something is wrong.

Post by **calxalot** » Tue Jan 31, 2023 11:48 pm

You access the resource group peer with a separate websocket. Append the group name to ws: url. Example

ws://127.0.0.1:7396/api/websocket/myrg

Alex_Atkin · Post by **Alex_Atkin** » Tue Jan 31, 2023 11:53 pm

Thanks, it seems the problem will be fixed in the next release so I will probably just leave it alone as it seems somewhat random and as long as the CPU WU doesn't reach the timeout, its just a slight delay (relative the CPU WU timeouts) getting the WU finished.

Folding Forum

Default using all CPUs seems flawed

Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed

Re: Default using all CPUs seems flawed