Page 6 of 6
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 10:26 pm
by jpalpant
Sounds like a reasonable step, I'll give it a try! That said, I think the fault is likely in my Kubernetes environment, not my image. I just tried running the same image used for my Pod directly with docker run, and found it was able to start and run quite well. Interestingly, I then stopped the manually-run container and re-launched the Kubernetes-managed container - it was able to run successfully for a few thousand steps on the same WU, before being INTERRUPTED. Once interrupted, the work unit was discarded and the next one was not able to start without being immediately interrupted. Very weird, but not entirely surprising.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 10:49 pm
by toTOW
I looked for Project: 11741 (Run 0, Clone 2360, Gen 1) in the WU DB, and I found no entry ... could it be a bad WU ?
Messing up with VMs and other docker things is going to complicate the process of debugging ... don't count on me.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 11:09 pm
by jpalpant
You know I feel very dumb, but I do have my system running now with Core22. I had a memory limit on the Pod and though the container itself wasn't getting OOMKilled, I think it might have been getting interrupted for that reason. I've removed it for now and the container is pretty memory-hungry, so that was a bad idea in the first place. I was also using the --cpu-usage argument, which I've removed for now. I'll poke around and see if I can reproduce with those settings. AND of course, I have new args for the WU: 23:02:16:WU01:FS01:0x22:Project: 11747 (Run 0, Clone 159, Gen 1) (this has changed several times before and not resolved my issue though).
@toTOW, is this work unit database accessible to users? I'd be curious, then I could at least do that step myself next time I mess something up
Update: TIL apparently Kubernetes is smart enough to try to kill individual processes within a container that are using too much memory first, rather than killing the whole container. Since FAH is well-behaved, the core process terminates when it gets interrupted, but FAHClient itself does not, and neither does my container's main process. You can only detect the OOM via 1) kernel logs and 2) node exporter, if you use it. And now I know that INTERRUPTED message means exactly what it says on the tin.
https://github.com/kubernetes/kubernetes/issues/50632 (closed as working-as-designed)
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 11:19 pm
by toTOW
Yes, it's available amongst other tool on this page :
https://apps.foldingathome.org/
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 11:39 pm
by jpalpant
Awesome, thanks toTOW! Excited to get back to folding.
I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB -
https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Thu Mar 12, 2020 11:42 pm
by Joe_H
jpalpant wrote:Awesome, thanks toTOW! Excited to get back to folding.
I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB -
https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
By working on, do you mean currently processing? If so, the the WU will not show up in the database. That search is only for WU's that have been completed and turned in. It can take an hour or two from the time a WU is turned in before it shows up there.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Fri Mar 13, 2020 1:58 am
by bruce
Also, there is a setting for the number of retries for a failed WU (set by the project owner, so I don't know what it is).
FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Fri Mar 13, 2020 2:29 am
by jpalpant
Joe_H wrote:By working on, do you mean currently processing? If so, the the WU will not show up in the database.
bruce wrote:FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
Got it, I didn't realize that. Yes, I was looking at work units that were in-progress and not returned yet; I can see the WUs I've completed on that app. Very cool!
Posted: Sun Mar 15, 2020 12:16 am
by vnicolici
What exactly is the purpose of this project? I wanted to contribute to the COVID projects, and after an initial COVID work unit I got a unit from 11737 instead of COVID.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Posted: Sun Mar 15, 2020 4:16 pm
by jima13
foldy wrote:0x22 is more demanding on HW than 0x21. So maybe downclock or power limit the failing gtx 1080ti could help.
About a day after this post I uninstalled fah and reinstalled, the system crashed 2x while entering passkey
Then win 10 starts telling me to update, ok...done. System wouldn't reboot. Shut it down till a few minutes ago and now it boots...so I managed to get my passkey in and the gpu is back to 1.3k on 0x22 ....I'll check it again later to see if it's stable.