Page 1 of 1

Frame time increases from 75 seconds to 90+ minutes for 16722 WU, after restart due to bad allocation.

Posted: Sun Oct 06, 2024 3:04 am
by ETA_2025
My GPU is doing Project: 16722 (Run 262, Clone 2, Gen 458), which will take more than six days to complete. The Timeout is three days.

It completed 83%, before the bad allocation occurred. Since it restarted, it's taking more than 90 minutes per frame, up from 75 seconds before it restarted!

Code: Select all

15:01:47:WU01:FS02:0x23:Checkpoint completed at step 2000000
15:03:05:WU01:FS02:0x23:Completed 2025000 out of 2500000 steps (81%)
15:04:23:WU01:FS02:0x23:Completed 2050000 out of 2500000 steps (82%)
15:05:38:WU01:FS02:0x23:Completed 2075000 out of 2500000 steps (83%)
15:06:38:WU01:FS02:0x23:An exception occurred at step 2094343: bad allocation
15:06:38:WU01:FS02:0x23:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
15:06:38:WU01:FS02:0x23:Folding@home Core Shutdown: CORE_RESTART
15:06:38:WARNING:WU01:FS02:FahCore returned: CORE_RESTART (98 = 0x62)
15:06:38:WU01:FS02:Starting
15:06:38:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" A:\FAHClient\cores/cores.foldingathome.org/openmm-core-23/windows-10-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23.exe -dir 01 -suffix 01 -version 706 -lifeline 13876 -checkpoint 5 -opencl-platform 1 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
15:06:38:WU01:FS02:Started FahCore on PID 491712
15:06:39:WU01:FS02:Core PID:775060
15:06:39:WU01:FS02:FahCore 0x23 started
15:06:39:WU01:FS02:0x23:*********************** Log Started 2024-10-05T15:06:39Z ***********************
15:06:39:WU01:FS02:0x23:*************************** Core23 Folding@home Core ***************************
15:06:39:WU01:FS02:0x23:       Core: Core23
15:06:39:WU01:FS02:0x23:       Type: 0x23
15:06:39:WU01:FS02:0x23:    Version: 8.0.3
15:06:39:WU01:FS02:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:06:39:WU01:FS02:0x23:  Copyright: 2022 foldingathome.org
15:06:39:WU01:FS02:0x23:   Homepage: https://foldingathome.org/
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:39:06
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:             -DOPENMM_VERSION="\"8.0.0\""
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
15:06:39:WU01:FS02:0x23:             <peastman@stanford.edu>
15:06:39:WU01:FS02:0x23:       Args: -dir 01 -suffix 01 -version 706 -lifeline 491712 -checkpoint 5
15:06:39:WU01:FS02:0x23:             -opencl-platform 1 -opencl-device 0 -cuda-device 0 -gpu-vendor
15:06:39:WU01:FS02:0x23:             nvidia -gpu 0 -gpu-usage 100
15:06:39:WU01:FS02:0x23:************************************ libFAH ************************************
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:37:55
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:************************************ CBang *************************************
15:06:39:WU01:FS02:0x23:    Version: 1.7.2
15:06:39:WU01:FS02:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:06:39:WU01:FS02:0x23:        Org: Cauldron Development LLC
15:06:39:WU01:FS02:0x23:  Copyright: Cauldron Development LLC, 2003-2023
15:06:39:WU01:FS02:0x23:   Homepage: https://cauldrondevelopment.com/
15:06:39:WU01:FS02:0x23:    License: GPL 2+
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:37:14
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:************************************ System ************************************
15:06:39:WU01:FS02:0x23:        CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
15:06:39:WU01:FS02:0x23:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 9
15:06:39:WU01:FS02:0x23:       CPUs: 8
15:06:39:WU01:FS02:0x23:     Memory: 31.70GiB
15:06:39:WU01:FS02:0x23:Free Memory: 8.64GiB
15:06:39:WU01:FS02:0x23:    Threads: WINDOWS_THREADS
15:06:39:WU01:FS02:0x23: OS Version: 6.2
15:06:39:WU01:FS02:0x23:Has Battery: false
15:06:39:WU01:FS02:0x23: On Battery: false
15:06:39:WU01:FS02:0x23: UTC Offset: 10
15:06:39:WU01:FS02:0x23:        PID: 775060
15:06:39:WU01:FS02:0x23:        CWD: A:\FAHClient\work
15:06:39:WU01:FS02:0x23:       Exec: A:\FAHClient\cores\cores.foldingathome.org\openmm-core-23\windows-10-64bit\release\0x23-8.0.3\Core_23.fah\FahCore_23.exe
15:06:39:WU01:FS02:0x23:************************************ OpenMM ************************************
15:06:39:WU01:FS02:0x23:    Version: 8.0.0
15:06:39:WU01:FS02:0x23:********************************************************************************
15:06:39:WU01:FS02:0x23:Project: 16722 (Run 262, Clone 2, Gen 458)
15:06:39:WU01:FS02:0x23:Digital signatures verified
15:06:39:WU01:FS02:0x23:Folding@home GPU Core23 Folding@home Core
15:06:39:WU01:FS02:0x23:Version 8.0.3
15:06:39:WU01:FS02:0x23:  Checkpoint write interval: 100000 steps (4%) [25 total]
15:06:39:WU01:FS02:0x23:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
15:06:39:WU01:FS02:0x23:  XTC frame write interval: 10000 steps (0.4%) [250 total]
15:06:39:WU01:FS02:0x23:  Global context and integrator variables write interval: disabled
15:06:40:WU01:FS02:0x23:There are 4 platforms available.
15:06:40:WU01:FS02:0x23:Platform 0: Reference
15:06:40:WU01:FS02:0x23:Platform 1: CPU
15:06:40:WU01:FS02:0x23:Platform 2: OpenCL
15:06:40:WU01:FS02:0x23:  opencl-device 0 specified
15:06:40:WU01:FS02:0x23:Platform 3: CUDA
15:06:40:WU01:FS02:0x23:  cuda-device 0 specified
15:07:00:WU01:FS02:0x23:Attempting to create CUDA context:
15:07:00:WU01:FS02:0x23:  Configuring platform CUDA
15:07:01:WU01:FS02:0x23:Failed to create CUDA context:
15:07:01:WU01:FS02:0x23:Error initializing FFT: 5
15:07:01:WU01:FS02:0x23:Attempting to create OpenCL context:
15:07:01:WU01:FS02:0x23:  Configuring platform OpenCL
15:07:29:WU01:FS02:0x23:  Using OpenCL on OpenCL platformId 1 and gpu 0
15:07:29:WU01:FS02:0x23:  GPU info: Platform: OpenCL: NVIDIA CUDA
15:07:29:WU01:FS02:0x23:  GPU info: PlatformIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Device: NVIDIA GeForce RTX 4070
15:07:29:WU01:FS02:0x23:  GPU info: DeviceIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Vendor: 0x10de
15:07:29:WU01:FS02:0x23:  GPU info: PCI: 01:00:00
15:07:29:WU01:FS02:0x23:  GPU info: Compute: 3.0
15:07:29:WU01:FS02:0x23:  GPU info: Driver: 561.9
15:07:29:WU01:FS02:0x23:  GPU info: GPU: true
15:07:29:WU01:FS02:0x23:Completed 0 out of 2500000 steps (0%)
16:42:33:WU01:FS02:0x23:Completed 25000 out of 2500000 steps (1%)
******************************* Date: 2024-10-05 *******************************
18:15:30:WU01:FS02:0x23:Completed 50000 out of 2500000 steps (2%)
19:51:07:WU01:FS02:0x23:Completed 75000 out of 2500000 steps (3%)
21:24:02:WU01:FS02:0x23:Completed 100000 out of 2500000 steps (4%)
21:24:07:WU01:FS02:0x23:Checkpoint completed at step 100000
22:55:10:WU01:FS02:0x23:Completed 125000 out of 2500000 steps (5%)
******************************* Date: 2024-10-06 *******************************
00:29:38:WU01:FS02:0x23:Completed 150000 out of 2500000 steps (6%)
02:04:24:WU01:FS02:0x23:Completed 175000 out of 2500000 steps (7%)
Rebooting the computer caused the WU to restart, this time with the 75 second frame time.

Can someone explain what happened?

Re: Frame time of more than 90 minutes for 16722 WU, after restart due to bad allocation.

Posted: Sun Oct 06, 2024 4:02 am
by Joe_H
I suspect you had a video driver crash that caused the bad allocation error, and the video system reinitialized. Apparently it also corrupted the checkpoint so the WU started over from the beginning. However the reinitialization of the drivers went, the client and folding core could not detect CUDA being available and started up in OpenCL mode:

Code: Select all

15:07:00:WU01:FS02:0x23:  Configuring platform CUDA
15:07:01:WU01:FS02:0x23:Failed to create CUDA context:
15:07:01:WU01:FS02:0x23:Error initializing FFT: 5
15:07:01:WU01:FS02:0x23:Attempting to create OpenCL context:
15:07:01:WU01:FS02:0x23:  Configuring platform OpenCL
15:07:29:WU01:FS02:0x23:  Using OpenCL on OpenCL platformId 1 and gpu 0
15:07:29:WU01:FS02:0x23:  GPU info: Platform: OpenCL: NVIDIA CUDA
15:07:29:WU01:FS02:0x23:  GPU info: PlatformIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Device: NVIDIA GeForce RTX 4070
Others in the past have reported sometimes having their GPU stuck in a slow clock mode as well after a driver reinitialization due to a crash, I suspect that also happened here and caused the very slow processing until you rebooted.

Re: Frame time increases from 75 seconds to 90+ minutes for 16722 WU, after restart due to bad allocation.

Posted: Sun Oct 06, 2024 3:02 pm
by toTOW
After a GPU/driver crash, the GPU often stays in low power mode until you reboot the system ...

Re: Frame time of more than 90 minutes for 16722 WU, after restart due to bad allocation.

Posted: Mon Oct 07, 2024 5:27 am
by ETA_2025
Joe_H wrote: Sun Oct 06, 2024 4:02 am I suspect you had a video driver crash that caused the bad allocation error, and the video system reinitialized. Apparently it also corrupted the checkpoint so the WU started over from the beginning. However the reinitialization of the drivers went, the client and folding core could not detect CUDA being available and started up in OpenCL mode:
...
Others in the past have reported sometimes having their GPU stuck in a slow clock mode as well after a driver reinitialization due to a crash, I suspect that also happened here and caused the very slow processing until you rebooted.
toTOW wrote: Sun Oct 06, 2024 3:02 pm After a GPU/driver crash, the GPU often stays in low power mode until you reboot the system ...
Thanks Joe_H and toTOW. It's good to know what to do if it happens again, though I don't know why the GPU/video driver crashed.