Page 1 of 1

wu 14911 exceeding low utilization.

Posted: Wed Jan 13, 2021 6:49 pm
by cine.chris
Anyone else seeing issues with 14911?
I have a 2060s sitting at est 150932 PPD & dropping.
GPU is at 60 watt, 63% util.
Other 2060s is at 2.3M PPD on wu 17319.
This is a new setup, but it looked good... until this wu.
Not seeing any log errors.
In the viewer, it's a membrane structure, but it has chunks flying in & out, not normal.

Re: wu 14911 exceeding low utilization.

Posted: Wed Jan 13, 2021 8:37 pm
by cine.chris
I finished the other two slots, then deleted all the GPU, returning that wu.
Replaced the 2nd 2060s with a 2060KO.
Box is folding at 5.9M PPD.
Guess I'm superstitious, I can't put those 2060s together... shuffled cards.
Just hit my all-time high, 17.2M PPD.
17M is top100 territory, I'm liking that.

Re: wu 14911 exceeding low utilization.

Posted: Thu Jan 14, 2021 2:03 am
by BobWilliams757
Strange that it was running that low. It's not a really small atom count WU, so hard to say what would cause it.

I've only run one instance of that WU, and on a little 2400G APU it returned over half what the 2060 return. Strange.

Re: wu 14911 exceeding low utilization.

Posted: Thu Jan 14, 2021 12:13 pm
by PantherX
IF you had the viewer open while it was folding, it would cause a slow-down in the GPU folding since resources are being taken away from the GPU to render the simulation. However, if you could provide the PRCG, we can ask the researcher for additional details to see what's happening.

BTW, just for general information, when you deleted that GPU Slot, the WU isn't "returned" but rather it is dumped which means that the Server will have to wait for it to time-out before being reassigned which would slow down science. I personally won't dump any WUs even if they are "slow" since it allows me to report it to the researcher to discover any issues with the simulation and either stop the trajectory or fix it up :)

Re: wu 14911 exceeding low utilization.

Posted: Thu Jan 14, 2021 12:36 pm
by cine.chris
PantherX wrote:BTW, just for general information, when you deleted that GPU Slot, the WU isn't "returned" but rather it is dumped which means that the Server will have to wait for it to time-out before being reassigned which would slow down science. I personally won't dump any WUs even if they are "slow" since it allows me to report it to the researcher to discover any issues with the simulation and either stop the trajectory or fix it up :)
Considering the MANY WU that I've lost due to power sags & outages, it would be a good & useful client 'feature' to track & report lost WU, whatever the reason. HFM, tracks them. 20 yrs on, the client should track & report events that commonly could interfere with the completion of a WU.

Re: wu 14911 exceeding low utilization.

Posted: Thu Jan 14, 2021 12:45 pm
by PantherX
Yep, the primary developer is aware of this but not sure if/when it gets implemented.

When it comes to WU losses due to power outages, I do believe that FahCore should be able to handle them gracefully as long as there's a single valid checkpoint. Thus, it would prevent WU loss and you wouldn't have "wasted" your system resources due to an unexpected event that you can't control*

*One could use a fancy UPS but that's additional cost and resources that might not be considered "home" environment.

Re: wu 14911 exceeding low utilization.

Posted: Sun Jan 17, 2021 10:44 pm
by bruce
cine.chris wrote:Considering the MANY WU that I've lost due to power sags & outages, it would be a good & useful client 'feature' to track & report lost WU, whatever the reason. HFM, tracks them. 20 yrs on, the client should track & report events that commonly could interfere with the completion of a WU.
The FAHClient does have a feature that attempts to report lost WUs but there are quite a number of different reasons a WU can be lost. Each of those reasons would probably need a different detection method to be coded. That makes the solution to the problem very complex.

Re: wu 14911 exceeding low utilization.

Posted: Mon Jan 18, 2021 12:44 am
by cine.chris
The procedure described above, appeared to have sent it. It's very quick. Should the situation occur again, I'll check logs.
The lost power situation, I'll check the logs. There. all I get to see is freshly loaded WU. The windows clients seem to do better than the Linux, but that's an impression. When it occurs again, I'll take some notes. They might be on different circuits too.

Re: wu 14911 exceeding low utilization.

Posted: Wed Jan 27, 2021 4:15 pm
by cine.chris
A returned WU:
===============
15:23:04:FS01:Shutting core down
15:23:04:WARNING:WU00:Slot ID 0 no longer exists and there are no other matching slots, dumping
15:23:04:WU00:Sending unit results: id:00 state:SEND error:DUMPED project:13444 run:1247 clone:54 gen:1 core:0x22 unit:0x000000360000000100003484000004df
15:23:04:WU01:FS01:0xa8:WARNING:Console control signal 1 on PID 7020
15:23:04:WU01:FS01:0xa8:Exiting, please wait. . .
15:23:04:WU00:Connecting to 18.188.125.154:8080
15:23:04:WU00:Server responded WORK_ACK (400)
15:23:04:WU00:Cleaning up
========================
I'm consolidating platforms & a GTX1050 started & was a planned upgrade target in my monitor system, it would hold-up ~11M PPD in other gear moving, all set to finish.
As shuffling gear could lose work & WUs, I always finish slots that could be affected. This transition was across three systems.
An 8Hr WU was a problem. Halted, as described above, it was returned & ACK rc'vd.
And yes, as I recalled, it showed a SEND status in the GUI.
No WU were harmed in this transition.
cine.chris

Re: wu 14911 exceeding low utilization.

Posted: Wed Jan 27, 2021 4:20 pm
by bruce
cine.chris wrote:Halted, as described above, it was returned & ACK rc'vd.
And yes, as I recalled, it showed a SEND status in the GUI.
No WU were harmed in this transition.
I don't understand.

If the ACK was received, the server should have a record of the completed WU and FAHClient should have done the cleanup processing on the WU, in which case there would be no WU to try to move to a different slot.

Re: wu 14911 exceeding low utilization.

Posted: Wed Jan 27, 2021 4:34 pm
by cine.chris
bruce wrote:
cine.chris wrote:Halted, as described above, it was returned & ACK rc'vd.
And yes, as I recalled, it showed a SEND status in the GUI.
No WU were harmed in this transition.
I don't understand.

If the ACK was received, the server should have a record of the completed WU and FAHClient should have done the cleanup processing on the WU, in which case there would be no WU to try to move to a different slot.
Sorry if I'm not stating the situation clearly.
The WU wasn't completed. But, even interrupted, the WU was returned to the work server.
That was the question brought up earlier in the post... is it returned?
Pausing and removing the Slot with the GTX1050 with the unfinished WU, did return the WU, with a SEND status & an ACK in log.

Re: wu 14911 exceeding low utilization.

Posted: Wed Jan 27, 2021 5:38 pm
by Knish
yes, it was Returned. When sent before complete, such as when it fails to see the slot, the WUs are Dumped.
https://apps.foldingathome.org/wu#proje ... e=54&gen=1