GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback thread
Moderators: Site Moderators, FAHC Science Team
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Sounds like a reasonable step, I'll give it a try! That said, I think the fault is likely in my Kubernetes environment, not my image. I just tried running the same image used for my Pod directly with docker run, and found it was able to start and run quite well. Interestingly, I then stopped the manually-run container and re-launched the Kubernetes-managed container - it was able to run successfully for a few thousand steps on the same WU, before being INTERRUPTED. Once interrupted, the work unit was discarded and the next one was not able to start without being immediately interrupted. Very weird, but not entirely surprising.
-
- Site Moderator
- Posts: 6359
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
I looked for Project: 11741 (Run 0, Clone 2360, Gen 1) in the WU DB, and I found no entry ... could it be a bad WU ?
Messing up with VMs and other docker things is going to complicate the process of debugging ... don't count on me.
Messing up with VMs and other docker things is going to complicate the process of debugging ... don't count on me.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
You know I feel very dumb, but I do have my system running now with Core22. I had a memory limit on the Pod and though the container itself wasn't getting OOMKilled, I think it might have been getting interrupted for that reason. I've removed it for now and the container is pretty memory-hungry, so that was a bad idea in the first place. I was also using the --cpu-usage argument, which I've removed for now. I'll poke around and see if I can reproduce with those settings. AND of course, I have new args for the WU: 23:02:16:WU01:FS01:0x22:Project: 11747 (Run 0, Clone 159, Gen 1) (this has changed several times before and not resolved my issue though).
@toTOW, is this work unit database accessible to users? I'd be curious, then I could at least do that step myself next time I mess something up
Update: TIL apparently Kubernetes is smart enough to try to kill individual processes within a container that are using too much memory first, rather than killing the whole container. Since FAH is well-behaved, the core process terminates when it gets interrupted, but FAHClient itself does not, and neither does my container's main process. You can only detect the OOM via 1) kernel logs and 2) node exporter, if you use it. And now I know that INTERRUPTED message means exactly what it says on the tin. https://github.com/kubernetes/kubernetes/issues/50632 (closed as working-as-designed)
@toTOW, is this work unit database accessible to users? I'd be curious, then I could at least do that step myself next time I mess something up
Update: TIL apparently Kubernetes is smart enough to try to kill individual processes within a container that are using too much memory first, rather than killing the whole container. Since FAH is well-behaved, the core process terminates when it gets interrupted, but FAHClient itself does not, and neither does my container's main process. You can only detect the OOM via 1) kernel logs and 2) node exporter, if you use it. And now I know that INTERRUPTED message means exactly what it says on the tin. https://github.com/kubernetes/kubernetes/issues/50632 (closed as working-as-designed)
-
- Site Moderator
- Posts: 6359
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Yes, it's available amongst other tool on this page : https://apps.foldingathome.org/
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Awesome, thanks toTOW! Excited to get back to folding.
I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB - https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB - https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
-
- Site Admin
- Posts: 7937
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
By working on, do you mean currently processing? If so, the the WU will not show up in the database. That search is only for WU's that have been completed and turned in. It can take an hour or two from the time a WU is turned in before it shows up there.jpalpant wrote:Awesome, thanks toTOW! Excited to get back to folding.
I'm currently working on 11747 (0, 159, 1) which also doesn't show up in the WU DB - https://apps.foldingathome.org/wu#proje ... =159&gen=1 - weird.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Also, there is a setting for the number of retries for a failed WU (set by the project owner, so I don't know what it is).
FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
Joe_H wrote:By working on, do you mean currently processing? If so, the the WU will not show up in the database.
Got it, I didn't realize that. Yes, I was looking at work units that were in-progress and not returned yet; I can see the WUs I've completed on that app. Very cool!bruce wrote:FAH does it's best to run every PRCG once and only once ... except when it fails. Once a WU is successfully returned, it isn't sent out again and the trajectory moves on from PRCG to PRC(G+1)
Re: GPU CORE22 0.0.2 coming to FAH - p11737-9 feedback threa
About a day after this post I uninstalled fah and reinstalled, the system crashed 2x while entering passkey Then win 10 starts telling me to update, ok...done. System wouldn't reboot. Shut it down till a few minutes ago and now it boots...so I managed to get my passkey in and the gpu is back to 1.3k on 0x22 ....I'll check it again later to see if it's stable.foldy wrote:0x22 is more demanding on HW than 0x21. So maybe downclock or power limit the failing gtx 1080ti could help.