Page 1 of 1

12474 crashes repeatedly

Posted: Tue May 20, 2025 8:25 am
by bikeaddict
On two machines, project 12474 crashes repeatedly with FahCore returned: INTERRUPTED (102 = 0x66). Short excerpts of logs below.

Code: Select all

03:42:42:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:12474 run:24 clone:5 gen:104 core:0xa8 unit:0x680000000500000018000000ba300000
03:42:42:WU01:FS00:Starting
03:42:42:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 7133 -checkpoint 15 -np 31
03:42:42:WU01:FS00:Started FahCore on PID 997539
03:42:42:WU01:FS00:Core PID:997543
03:42:42:WU01:FS00:FahCore 0xa8 started
03:42:43:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:42:43:WU01:FS00:Starting
03:42:43:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 7133 -checkpoint 15 -np 31
03:42:43:WU01:FS00:Started FahCore on PID 997561
03:42:43:WU01:FS00:Core PID:997565
03:42:43:WU01:FS00:FahCore 0xa8 started
03:42:44:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:42:52:WU00:FS02:0x24:Completed 750000 out of 5000000 steps (15%)
03:42:52:WU00:FS02:0x24:Checkpoint completed at step 750000
03:43:43:WU01:FS00:Starting
03:43:43:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 7133 -checkpoint 15 -np 31
03:43:43:WU01:FS00:Started FahCore on PID 997582
03:43:43:WU01:FS00:Core PID:997586
03:43:43:WU01:FS00:FahCore 0xa8 started
03:43:44:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:44:43:WU01:FS00:Starting
03:44:43:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 7133 -checkpoint 15 -np 31
03:44:43:WU01:FS00:Started FahCore on PID 997604
03:44:43:WU01:FS00:Core PID:997608
03:44:43:WU01:FS00:FahCore 0xa8 started
03:44:44:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
03:44:59:WU00:FS02:0x24:Completed 800000 out of 5000000 steps (16%)

Code: Select all

07:06:26:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:12474 run:12 clone:6 gen:80 core:0xa8 unit:0x50000000060000000c000000ba300000
07:06:26:WU02:FS01:Starting
07:06:26:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 1811 -checkpoint 15 -np 31
07:06:26:WU02:FS01:Started FahCore on PID 1366389
07:06:26:WU02:FS01:Core PID:1366393
07:06:26:WU02:FS01:FahCore 0xa8 started
07:06:26:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
07:06:27:WU02:FS01:Starting
07:06:27:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 1811 -checkpoint 15 -np 31
07:06:27:WU02:FS01:Started FahCore on PID 1366410
07:06:27:WU02:FS01:Core PID:1366414
07:06:27:WU02:FS01:FahCore 0xa8 started
07:06:27:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
07:06:52:WU01:FS00:0x23:Completed 87500 out of 1250000 steps (7%)
07:07:27:WU02:FS01:Starting
07:07:27:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 1811 -checkpoint 15 -np 31
07:07:27:WU02:FS01:Started FahCore on PID 1366432
07:07:27:WU02:FS01:Core PID:1366436
07:07:27:WU02:FS01:FahCore 0xa8 started
07:07:27:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
07:07:34:WU01:FS00:0x23:Completed 100000 out of 1250000 steps (8%)
07:07:35:WU01:FS00:0x23:Checkpoint completed at step 100000
07:08:17:WU01:FS00:0x23:Completed 112500 out of 1250000 steps (9%)
07:08:27:WU02:FS01:Starting
07:08:27:WU02:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 02 -suffix 01 -version 706 -lifeline 1811 -checkpoint 15 -np 31
07:08:27:WU02:FS01:Started FahCore on PID 1366454
07:08:27:WU02:FS01:Core PID:1366458
07:08:27:WU02:FS01:FahCore 0xa8 started
07:08:27:WU02:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
07:08:59:WU01:FS00:0x23:Completed 125000 out of 1250000 steps (10%)
07:08:59:WU01:FS00:0x23:Checkpoint completed at step 125000

Re: 12474 crashes repeatedly

Posted: Tue May 20, 2025 8:27 am
by muziqaz
Permission issue?
Unstable machine?
Try deleting /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8
directory and restart fah-client, and let it download the core again

Re: 12474 crashes repeatedly

Posted: Tue May 20, 2025 1:32 pm
by bikeaddict
Every other project on these machines has been fine for over a year.

Also curious that one WU was assigned to another user at almost the same time who completed it in 6.5 minutes for 1.89M points.

https://apps.foldingathome.org/wu#proje ... =5&gen=104

Re: 12474 crashes repeatedly

Posted: Tue May 20, 2025 1:36 pm
by arisu
bikeaddict wrote: Tue May 20, 2025 1:32 pm Every other project on these machines has been fine for over a year.

Also curious that one WU was assigned to another user at almost the same time who completed it in 6.5 minutes for 1.89M points.

https://apps.foldingathome.org/wu#proje ... =5&gen=104
That has to be a server bug. That equates to over 421M PPD on a CPU, which is plainly impossible. When that user returned their WU, it was re-sent to you, so that means they must have failed it and the server improperly credited their failed/dumped WU as a success.

Re: 12474 crashes repeatedly

Posted: Sat May 24, 2025 6:47 pm
by bikeaddict
Another WU had similarly odd behavior. First client finished task in two seconds for 65K points. Second client finished in eight minutes for 1.6M points. Both were supposedly successful, but it was still sent to me and it gave the crash loop.

https://apps.foldingathome.org/wu#proje ... e=2&gen=63

Re: 12474 crashes repeatedly

Posted: Sat May 24, 2025 11:26 pm
by arisu
So the crash loop that you're experiencing is only being caused by WUs that also have this odd credit behavior?

Re: 12474 crashes repeatedly

Posted: Sun May 25, 2025 9:48 am
by Nicolas_orleans
The vast.ai instance I am running shows a unique and odd behavior also for this very single project, on all 3 WUs received that failed at startup. Running on Ubuntu 24.04.

Code: Select all

09:19:05:I1:WU323:Requesting WU assignment for user Nicolas_orleans team 33
09:19:06:I1:WU323:Received WU assignment gsN15JmmfSCaMjbRK9TxKKp1ni_rsyptPIJendXpJsU
09:19:06:I1:WU323:Downloading WU
09:19:08:I1:WU323:DOWNLOAD 3% 167.02KiB of 5.68MiB
09:19:09:I1:WU323:DOWNLOAD 38% 2.18MiB of 5.68MiB
09:19:10:I1:WU323:Received WU P12474 R8 C5 G79
09:19:10:I3:WU323:Started FahCore on PID 199222
09:19:11:E :WU323:Core was killed
09:19:11:E :WU323:Core returned FAILED_1 (0)
09:19:11:E :WU323:The folding core did not produce any log output. This indicates that the core is not functional on your system. Check for missing libraries or GPU drivers. Make a post about your issue on https://foldingforum.org/ to get more help.
09:19:11:E :WU323:Run did not produce any results. Dumping WU
09:19:11:I1:WU323:Sending dump report
09:19:12:I1:WU323:Dumped

Re: 12474 crashes repeatedly

Posted: Sun May 25, 2025 10:02 am
by muziqaz
Reported to researcher. This seems to be Linux only for now. Something might have broke in WU generation on their server

Re: 12474 crashes repeatedly

Posted: Sun May 25, 2025 10:46 am
by arisu
Can someone send me a copy of a wudata_01.dat for this so I can test it on my Linux system and try to diagnose the issue (if it's a Linux-only issue)?

Re: 12474 crashes repeatedly

Posted: Mon May 26, 2025 7:22 pm
by toTOW
I'm getting some bad WUs from this project too ...
19:29:23:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:12474 run:1 clone:1 gen:58 core:0xa8 unit:0x3a0000000100000001000000ba300000
Stuck in an endless loop :
19:48:27:WU01:FS00:Starting
19:48:27:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit-avx2-256/a8-0.0.12/Core_a8.fah/FahCore_a8 -dir 01 -suffix 01 -version 706 -lifeline 1126 -checkpoint 15 -np 12
19:48:27:WU01:FS00:Started FahCore on PID 20871
19:48:27:WU01:FS00:Core PID:20875
19:48:27:WU01:FS00:FahCore 0xa8 started
19:48:28:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)

Re: 12474 crashes repeatedly

Posted: Tue Jun 03, 2025 10:14 pm
by vvoelz
As @musiqaz suspected, we somehow managed to break our WU continuation scripts for p12474. We've turned this project off for now, and restarted the server. Please let us know if you see any more of this behavior for other projects!