Corrupted / bad job 18237/1069/0/71 (failing for all users)

Moderators: Site Moderators, FAHC Science Team

jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by jjmiller »

Hi Andre_Ti,

I do actually care quite a bit about the project, hence the restarts in the error counts to try and get more data. As I mentioned above, we don't currently have a good way to selectively filter which RUN x CLONEs get their errors reset- I can either not reset the error counts and get no data back or reset all of the error counts and let the majority of the WUs run while a few bad WU get reissued and fail out. While I am flooded with errors I don't have a good means of parsing legitimate failures in a WU from failures that are tied up with how 0x24 and the fah-client are interacting. We do have a software developer working to fix the interaction between fah-client and 0x24- until that patch comes through I am stuck manually removing simulations that have reached a bad state. I will happily do so whenever I am made aware of the failing WUs.
Andre_Ti
Posts: 35
Joined: Sat Mar 21, 2020 7:51 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Andre_Ti »

jjmiller wrote: Tue Nov 12, 2024 8:19 pm Hi Andre_Ti,

I do actually care quite a bit about the project, hence the restarts in the error counts to try and get more data. As I mentioned above, we don't currently have a good way to selectively filter which RUN x CLONEs get their errors reset- I can either not reset the error counts and get no data back or reset all of the error counts and let the majority of the WUs run while a few bad WU get reissued and fail out. While I am flooded with errors I don't have a good means of parsing legitimate failures in a WU from failures that are tied up with how 0x24 and the fah-client are interacting. We do have a software developer working to fix the interaction between fah-client and 0x24- until that patch comes through I am stuck manually removing simulations that have reached a bad state. I will happily do so whenever I am made aware of the failing WUs.
Thank you for your reply. There have been no errors in the last few days, but if I notice any WU with failures, I will report them.
Please tell me, is it possible to create a progress bar for projects?
I remember that earlier they even published the sizes of projects, i.e. the number of RxСxG, when they were announced on this forum. Now for some reason this data is not published.
Last edited by Andre_Ti on Wed Nov 13, 2024 7:26 am, edited 1 time in total.
Image
Nicolas_orleans
Posts: 114
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Justin,
Thanks a lot for sharing what is happening behind the scene. Would this manual reset you perform explain why, when I use one of the folding at home apps to import a PRCG for a failing WU (to see if it needs to be reported) and see who processed it whith which outcome, I sometimes do not see myself in the list of donors having been assigned the WU to ?
Best regards
Nicolas
MSI Z77A-GD55 - Core i5-3550 - PNY RTX 4080 Super @ 2715 MHz - Ubuntu 24.04 - 6.8 kernel
MSI MPG B550 - Ryzen 5 5600X - EVGA GTX 980 Ti Hybrid @ 1366 MHz - Ubuntu 24.04 - 6.8 kernel
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by jjmiller »

Hi Andre_Ti-

Glad the errors have slowed down- I'm hopeful we will have core patches available soon. Regarding a progress bar, I'm not sure- it should be feasible! Perhaps the reason this hasn't been implemented yet is that projects are occasionally adjusted on-the-fly and either taken down before 100% completion (researcher deemed the project sufficiently sampled to analyze) or additional WUs are added (researcher deemed more simulation necessary)? That wouldn't stop a progress bar from being possible though- I'll put in a request in to the stats page developer.

18235-18238 are all the same configuration, 1203 RUNs x 5 CLONEs x 100 GENs. Progress is a bit asymmetric across each RUNxCLONE, but as of writing each project has the following number of completed WUs:
18235- 88,608
18236- 97,076
18237- 183,259
18238- 61,686
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by jjmiller »

Hi Nicolas-

Hmm, I wouldn't expect the reset in error counts to affect the record of who all has tried a WU. We see this in the above WU reports where a given WU has been issued 10+ times (the default max is 5 attempts). How soon after a failure are you checking? It could be that it takes a little while to register in the database.

I believe whether the WU is registered also depends on how the WU failed. I think if there are connectivity issues (e.g. the WU never makes it to you), then it may not be counted in the online database? I'm not 100% sure how that accounting works though.
Nicolas_orleans
Posts: 114
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Justin,

I am checking a few hours after. For example for a failure from yesterday on P18238

This is the log

Code: Select all

16:30:05:I3:WU20:Started FahCore on PID 145117
16:30:06:I1:WU20:*********************** Log Started 2024-11-12T16:30:05Z ***********************
[...]
16:30:06:I1:WU20:************************************ OpenMM ************************************
16:30:06:I1:WU20: Version: 8.1.1
16:30:06:I1:WU20:********************************************************************************
16:30:06:I1:WU20:Project: 18238 (Run 104, Clone 0, Gen 18)
16:30:06:I1:WU20:Reading tar file core.xml
16:30:06:I1:WU20:Reading tar file integrator.xml
16:30:06:I1:WU20:Reading tar file state.xml.bz2
16:30:06:I1:WU20:Reading tar file system.xml.bz2
16:30:06:I1:WU20:Digital signatures verified
16:30:06:I1:WU20:Folding@home GPU Core24 Folding@home Core
16:30:06:I1:WU20:Version 8.1.4
16:30:06:I1:WU20: Checkpoint write interval: 50000 steps (2%) [50 total]
16:30:06:I1:WU20: JSON viewer frame write interval: 25000 steps (1%) [100 total]
16:30:06:I1:WU20: XTC frame write interval: 10000 steps (0.4%) [250 total]
16:30:06:I1:WU20: TRR frame write interval: disabled
16:30:06:I1:WU20: Global context and integrator variables write interval: disabled
16:30:06:I1:WU20:There are 4 platforms available.
16:30:06:I1:WU20:Platform 0: Reference
16:30:06:I1:WU20:Platform 1: CPU
16:30:06:I1:WU20:Platform 2: OpenCL
16:30:06:I1:WU20:Platform 3: CUDA
16:30:06:I1:WU20: cuda-device 0 specified
16:30:09:I1:WU20:Attempting to create CUDA context:
16:30:09:I1:WU20: Configuring platform CUDA
16:30:14:I1:WU20: Using CUDA on CUDA Platform and gpu 0
16:30:14:I1:WU20: GPU info: Platform: CUDA
16:30:14:I1:WU20: GPU info: PlatformIndex: 0
16:30:14:I1:WU20: GPU info: Device: NVIDIA GeForce GTX 980 Ti
16:30:14:I1:WU20: GPU info: DeviceIndex: 0
16:30:14:I1:WU20: GPU info: Vendor: 0x10de
16:30:14:I1:WU20: GPU info: PCI: 43:00:00
16:30:14:I1:WU20: GPU info: Compute: 5.2
16:30:14:I1:WU20: GPU info: Driver: 12.4
16:30:14:I1:WU20: GPU info: GPU: true
16:30:14:I1:WU20:Completed 0 out of 2500000 steps (0%)
16:30:15:I1:WU20:Checkpoint completed at step 0
16:35:26:I1:WU20:Completed 25000 out of 2500000 steps (1%)
16:40:40:I1:WU20:Completed 50000 out of 2500000 steps (2%)
16:40:41:I1:WU20:Checkpoint completed at step 50000
[...]
18:36:42:I1:WU20:Checkpoint completed at step 600000
18:41:59:I1:WU20:Completed 625000 out of 2500000 steps (25%)
18:47:14:I1:WU20:Completed 650000 out of 2500000 steps (26%)
18:47:15:I1:WU20:An exception occurred at step 650000: Potential energy error of 177.581, threshold of 20
18:47:15:I1:WU20:Reference Potential Energy: -1.65635e+06 | Given Potential Energy: -1.65653e+06
18:47:15:I1:WU20:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
18:47:15:I1:WU20:Folding@home Core Shutdown: CORE_RESTART
18:47:15:W :WU20:Core returned CORE_RESTART (98)
18:47:15:I1:Default:Added new work unit: cpus:1 gpus:gpu:43:00:00
18:47:15:I1:WU20:Sending dump report
This is the WU database

https://apps.foldingathome.org/wu#proje ... e=0&gen=18

I can see 3 other donors failing in the WU database, but not myself ? Sometimes I see myself, sometimes not.

Best regards

Nicolas
MSI Z77A-GD55 - Core i5-3550 - PNY RTX 4080 Super @ 2715 MHz - Ubuntu 24.04 - 6.8 kernel
MSI MPG B550 - Ryzen 5 5600X - EVGA GTX 980 Ti Hybrid @ 1366 MHz - Ubuntu 24.04 - 6.8 kernel
Andre_Ti
Posts: 35
Joined: Sat Mar 21, 2020 7:51 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Andre_Ti »

jjmiller wrote: Wed Nov 13, 2024 2:07 pm
Hi Justin,
thank you for understanding. I thought it would be interesting if we could see the status by the degree of completion of projects. In my opinion, many people would be motivated and aware of their contribution to science when we see the status and the light at the end of the tunnel). Sometimes projects are so long and it is not clear how much is left until completion. Perhaps I am one of those people who likes to watch the status of completion, how much is left and how far we have managed to advance, but perhaps I have like-minded people) I sincerely hope so.
Image
Post Reply