Error handling in FAH

Moderators: Site Moderators, FAHC Science Team

Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Error handling in FAH

Post by Sparkly »

It would seem that the FAH client needs better error handling and recovery, since I have started to see the following happening rather frequently, after upgrading from 4 core 4 thread to 4 core 8 thread, where the WU is finished 100%, gets and error during the core shutdown and just restarts the same WU from 0% in the same directory, even thou it is already full of the 100 finished JSON files and a logfile saying it is finished:

Code: Select all

09:49:31:WU00:FS03:0x22:Completed 980000 out of 1000000 steps (98%)
09:55:10:WU00:FS03:0x22:Completed 990000 out of 1000000 steps (99%)
10:00:48:WU00:FS03:0x22:Completed 1000000 out of 1000000 steps (100%)
10:00:48:WU00:FS03:0x22:Average performance: 51.1545 ns/day
10:00:53:WU00:FS03:0x22:Saving result file ..\logfile_01.txt
10:00:53:WU00:FS03:0x22:Saving result file checkpointState.xml.bz2
10:00:53:WU00:FS03:0x22:Saving result file globals.csv
10:00:53:WU00:FS03:0x22:Saving result file positions.xtc
10:00:53:WU00:FS03:0x22:Saving result file science.log
10:00:53:WU00:FS03:0x22:Folding@home Core Shutdown: FINISHED_UNIT
10:00:55:WARNING:WU00:FS03:FahCore returned an unknown error code which probably indicates that it crashed
10:00:55:WARNING:WU00:FS03:FahCore returned: UNKNOWN_ENUM (-1073740940 = 0xc0000374)
10:00:55:WU00:FS03:Starting
10:00:55:WU00:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Admin\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.11/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 7224 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:00:55:WU00:FS03:Started FahCore on PID 7204
10:00:55:WU00:FS03:Core PID:4560
10:00:55:WU00:FS03:FahCore 0x22 started
10:00:56:WU00:FS03:0x22:*********************** Log Started 2020-07-16T10:00:55Z ***********************
10:00:56:WU00:FS03:0x22:*************************** Core22 Folding@home Core ***************************
10:00:56:WU00:FS03:0x22:       Core: Core22
10:00:56:WU00:FS03:0x22:       Type: 0x22
10:00:56:WU00:FS03:0x22:    Version: 0.0.11
10:00:56:WU00:FS03:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:00:56:WU00:FS03:0x22:  Copyright: 2020 foldingathome.org
10:00:56:WU00:FS03:0x22:   Homepage: https://foldingathome.org/
10:00:56:WU00:FS03:0x22:       Date: Jun 26 2020
10:00:56:WU00:FS03:0x22:       Time: 19:49:16
10:00:56:WU00:FS03:0x22:   Revision: 22010df8a4db48db1b35d33e666b64d8ce48689d
10:00:56:WU00:FS03:0x22:     Branch: core22-0.0.11
10:00:56:WU00:FS03:0x22:   Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22:   Platform: win32 10
10:00:56:WU00:FS03:0x22:       Bits: 64
10:00:56:WU00:FS03:0x22:       Mode: Release
10:00:56:WU00:FS03:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
10:00:56:WU00:FS03:0x22:             <peastman@stanford.edu>
10:00:56:WU00:FS03:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 7204 -checkpoint 15
10:00:56:WU00:FS03:0x22:             -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
10:00:56:WU00:FS03:0x22:************************************ libFAH ************************************
10:00:56:WU00:FS03:0x22:       Date: Jun 26 2020
10:00:56:WU00:FS03:0x22:       Time: 19:47:12
10:00:56:WU00:FS03:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
10:00:56:WU00:FS03:0x22:     Branch: HEAD
10:00:56:WU00:FS03:0x22:   Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22:   Platform: win32 10
10:00:56:WU00:FS03:0x22:       Bits: 64
10:00:56:WU00:FS03:0x22:       Mode: Release
10:00:56:WU00:FS03:0x22:************************************ CBang *************************************
10:00:56:WU00:FS03:0x22:       Date: Jun 26 2020
10:00:56:WU00:FS03:0x22:       Time: 19:46:11
10:00:56:WU00:FS03:0x22:   Revision: f8529962055b0e7bde23e429f5072ff758089dee
10:00:56:WU00:FS03:0x22:     Branch: master
10:00:56:WU00:FS03:0x22:   Compiler: Visual C++ 2015
10:00:56:WU00:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
10:00:56:WU00:FS03:0x22:   Platform: win32 10
10:00:56:WU00:FS03:0x22:       Bits: 64
10:00:56:WU00:FS03:0x22:       Mode: Release
10:00:56:WU00:FS03:0x22:************************************ System ************************************
10:00:56:WU00:FS03:0x22:        CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
10:00:56:WU00:FS03:0x22:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
10:00:56:WU00:FS03:0x22:       CPUs: 8
10:00:56:WU00:FS03:0x22:     Memory: 7.69GiB
10:00:56:WU00:FS03:0x22:Free Memory: 3.40GiB
10:00:56:WU00:FS03:0x22:    Threads: WINDOWS_THREADS
10:00:56:WU00:FS03:0x22: OS Version: 6.2
10:00:56:WU00:FS03:0x22:Has Battery: false
10:00:56:WU00:FS03:0x22: On Battery: false
10:00:56:WU00:FS03:0x22: UTC Offset: 2
10:00:56:WU00:FS03:0x22:        PID: 4560
10:00:56:WU00:FS03:0x22:        CWD: C:\Users\Admin\AppData\Roaming\FAHClient\work
10:00:56:WU00:FS03:0x22:********************************************************************************
10:00:56:WU00:FS03:0x22:Project: 13416 (Run 1040, Clone 205, Gen 0)
10:00:56:WU00:FS03:0x22:Unit: 0x0000000012bc7d9a5f0f8f4a3fff7242
10:00:56:WU00:FS03:0x22:Reading tar file core.xml
10:00:56:WU00:FS03:0x22:Reading tar file integrator.xml
10:00:56:WU00:FS03:0x22:Reading tar file state.xml.bz2
10:00:56:WU00:FS03:0x22:Reading tar file system.xml.bz2
10:00:56:WU00:FS03:0x22:Digital signatures verified
10:00:56:WU00:FS03:0x22:Folding@home GPU Core22 Folding@home Core
10:00:56:WU00:FS03:0x22:Version 0.0.11
10:00:59:WU00:FS03:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
10:00:59:WU00:FS03:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
10:00:59:WU00:FS03:0x22:  XTC frame write interval: 250000 steps (25%) [4 total]
10:00:59:WU00:FS03:0x22:  Global context and integrator variables write interval: 2500 steps (0.25%) [400 total]
10:01:17:WU00:FS03:0x22:Completed 0 out of 1000000 steps (0%)
10:06:56:WU00:FS03:0x22:Completed 10000 out of 1000000 steps (1%)
10:12:33:WU00:FS03:0x22:Completed 20000 out of 1000000 steps (2%)
10:18:12:WU00:FS03:0x22:Completed 30000 out of 1000000 steps (3%)
10:23:50:WU00:FS03:0x22:Completed 40000 out of 1000000 steps (4%)
10:29:28:WU00:FS03:0x22:Completed 50000 out of 1000000 steps (5%)
10:35:06:WU00:FS03:0x22:Completed 60000 out of 1000000 steps (6%)
10:40:42:WU00:FS03:0x22:Completed 70000 out of 1000000 steps (7%)
10:46:20:WU00:FS03:0x22:Completed 80000 out of 1000000 steps (8%)
10:51:59:WU00:FS03:0x22:Completed 90000 out of 1000000 steps (9%)
10:57:36:WU00:FS03:0x22:Completed 100000 out of 1000000 steps (10%)
I tried switching to new RAM modules, since the error code looks like a heap corruption thing, and also checked for any OC issues, but none of it has made any difference, the same thing still happens here and there.

As an added issue the FAH client downloads a new WU to a new directory that doesn’t get started, since the client just restarts the old one from 0% again, so you end up with a downloaded WU that is doing nothing, in the hopes that the previous one will finish at some point.

I tried setting the “next-unit-percentage” to “100”, in the hopes that a new download would not happen until the previous one was fully completed, but it just downloads at 100% instead of 99%, so I would suggest adding another value to this configuration option “-1” or something, where a new WU download doesn’t happen until the FahCore returns a correct finishing code “FahCore returned: FINISHED_UNIT (100 = 0x64)”, to avoid any new WU download until the previous one is actually handled.
Jan
Posts: 79
Joined: Tue Mar 31, 2020 6:46 pm

Re: Error handling in FAH

Post by Jan »

This is very interesting, looks like the same issue in this thread. Do you have windows event logs available?

*edit* Also, I suspect some WU involvement here. Its WU 13416 again.
mwroggenbuck
Posts: 127
Joined: Tue Mar 24, 2020 12:47 pm

Re: Error handling in FAH

Post by mwroggenbuck »

With the advanced control interface, you can select finish. This will not download any new WU, and will pause after the queue is empty. Unfortunately, some jobs seem to fail over and over and never complete. Note that in pause mode, nothing will be downloaded until fold is selected.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

Jan wrote:Do you have windows event logs available?
Well, the Event Viewer won’t really tell us much more than we would already know, since the error is just passed through to FahCore.exe, which doesn’t handle it very gracefully.

Code: Select all

Faulting application name: FahCore_22.exe, version: 0.0.0.0, time stamp: 0x5ef65146
Faulting module name: ntdll.dll, version: 10.0.19041.207, time stamp: 0xcad89ab4
Exception code: 0xc0000374
Fault offset: 0x00000000000fdec9
Faulting process id: 0x1a2c
Faulting application start time: 0x01d65b443cf7cdb4
Faulting application path: C:\Users\Admin\AppData\Roaming\FAHClient\cores\cores.foldingathome.org\win\64bit\22-0.0.11\Core_22.fah\FahCore_22.exe
Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll
Report Id: 5003a947-a5c7-43e4-8d1a-43e5c571d681
Faulting package full name: 
Faulting package-relative application ID:
Generally this indicates a heap corruption, but who knows what the FahCore_22.exe does to make it happen at the place it does, could be an issue with mismatched timings of user/kernel thread handling during some cleanup of the memory, but that is for the programmers to figure out, or just make a workaround.

I also checked if there was any issues with the OS files and versions by running the

DISM.exe /Online /Cleanup-image /Restorehealth

and

sfc /scannow

which reported that everything was fine.

But as comment on before, this issue didn’t exist until I went from 4 core 4 thread to 4 core 8 thread, so I am guessing thread handling over different cores might play a part here, since each FAH process keeps at least 3 different software threads constantly active after WU initialisation.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

mwroggenbuck wrote:With the advanced control interface, you can select finish. This will not download any new WU, and pause after the queue is empty.
Yeah, I am aware of this, but doing it doesn’t really solve anything, since the main issue is that the WU being handled fails after 100% and never recovers, so even if I don’t download another WU, the fail is still a fail, and since this is a 24/7 setup I am not really babysitting it every 20min to see if any of the WUs currently being handled has actually failed and the run happening is an overwrite.

So far, for me, this only seems to happen with the 13416 WUs, but since the majority over the last few days has been from this project, who knows.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

There are a couple dozen error reports mentioning 0xc0000374 and apparently they're against FAHCore_21. My hunch is that the problem isn't really in either _21 or _22 but in some support function that's used by both of them. That just makes it that much more difficult to debug.

Do you have a maximum configured for Windows virtual memory or can it grow as big as it wants to within the limits of your hardware?
Jan
Posts: 79
Joined: Tue Mar 31, 2020 6:46 pm

Re: Error handling in FAH

Post by Jan »

For me, it is configured to use between 1-2 GB. Do you recommend trying automatic sizing?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

Windows does not handle running out of virtual memory nicely. There are all kinds of recommendations on the internet about keeping your paging file small wich simply causes windows to crash if more is needed so I leave it unlimited. Periodically I look to see how big it has gotten when I left many apps running. Periodically I reorg my disk and create a contiguous file that starts about as big as the maximum I expect it to use. You don't have to do all of that, but I do recommend that you do not limit it, simply because tha forces Windows to crash.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

bruce wrote:Do you have a maximum configured for Windows virtual memory or can it grow as big as it wants to within the limits of your hardware?
My Virtual Memory is handled as needed by the OS, so no other limits on it than what the OS would have, but since I am only using around 60% of the regular RAM, when all FAH processes are running on all cards, I am not really running out of memory.

If Virtual/Memory was an issue, that would also have been a problem when running on 4 core 4 threads, but I never saw this issue until I upgraded to 4 core 8 threads.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

Sparkly wrote:
bruce wrote:Do you have a maximum configured for Windows virtual memory or can it grow as big as it wants to within the limits of your hardware?
If Virtual/Memory was an issue, that would also have been a problem when running on 4 core 4 threads, but I never saw this issue until I upgraded to 4 core 8 threads.
Not necessarily.

FAHClient allocates more RAM for 8 threads than it does for 4 threads. Without more information, neither of us have any idea whether Windows needs more (virtual) RAM doing whatever it's doing when Windows encounters the error.
Jan
Posts: 79
Joined: Tue Mar 31, 2020 6:46 pm

Re: Error handling in FAH

Post by Jan »

I changed virtual memory management to automatic for now. On the last, failing WU, FAH reported 10GB of free memory (16GB RAM installed). Will report further development.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

bruce wrote:FAHClient allocates more RAM for 8 threads than it does for 4 threads. Without more information, neither of us have any idea whether Windows needs more (virtual) RAM doing whatever it's doing when Windows encounters the error.
This is a GPU only system, so there is no folding happening on the CPU cores/threads, unless the 13416 does something special we don’t know about, meaning the CPU just handles the feeding of the active GPUs, so as can be seen from the memory utilization, the amount of RAM being used is the same now as it was before the upgrade and hasn’t changed from one CPU to the next in numbers, and there isn’t really any paging activity to disk either, so the virtual memory isn’t really being used for much, if at all, since there is plenty of physical memory available first.

What is a big difference thou is the load on the CPU from these 13416 WUs, where the 4c4t struggles to handle 4 running GPUs with the 13416 and moves between 95-100% CPU load constantly, but generates no errors, while the 4c8t handles the same 4 WUs with a CPU load around 40%, gets significantly better TPF speed, but also gets the 100% errors here and there.

Main issue here is that the regular WU error handling doesn’t kick in when the WU has reached 100% and then fails, like it would do on any fail before 100%, where it would rollback and restart the WU at a slightly earlier point and try again, instead of starting from scratch like it does now, so the error handling after reaching 100% should be adjusted in the next client upgrade.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Error handling in FAH

Post by bruce »

Do not assume that the CPU only manages the data flow between RAM and VRAM. That is it's primarly function, but not its only function. The CPU is used periodically to validate the active WU (known as a "sanity check")
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

bruce wrote:Do not assume that the CPU only manages the data flow between RAM and VRAM. That is it's primarly function, but not its only function. The CPU is used periodically to validate the active WU (known as a "sanity check")
Well, at least in my case so far, the error seems to be consistent over several days, since it always happens after 100% and always with a 13416 WU, while all the other WUs from other projects running on the same system completes fine.
Sparkly
Posts: 73
Joined: Sun Apr 19, 2020 11:01 am

Re: Error handling in FAH

Post by Sparkly »

Same stuff here:

i7-7700 - 4 core 8 thread and 134xx
viewtopic.php?f=81&t=35482

i7-3770K - 4 core 8 thread and 134xx
viewtopic.php?f=19&t=35658&start=45


So this smells like multithread issues in the software.


Edit:
- Missing link
Post Reply