Frame time increases from 75 seconds to 90+ minutes for 16722 WU, after restart due to bad allocation.

Moderators: Site Moderators, FAHC Science Team

Post Reply
ETA_2025
Posts: 72
Joined: Mon Jan 30, 2023 10:43 am
Hardware configuration: NVIDIA RTX 4070
10 x Raspberry Pi 5 Model B 2GB RAM
10 x Raspberry Pi 4 Model B 2GB RAM
Location: VIC, Australia

Frame time increases from 75 seconds to 90+ minutes for 16722 WU, after restart due to bad allocation.

Post by ETA_2025 »

My GPU is doing Project: 16722 (Run 262, Clone 2, Gen 458), which will take more than six days to complete. The Timeout is three days.

It completed 83%, before the bad allocation occurred. Since it restarted, it's taking more than 90 minutes per frame, up from 75 seconds before it restarted!

Code: Select all

15:01:47:WU01:FS02:0x23:Checkpoint completed at step 2000000
15:03:05:WU01:FS02:0x23:Completed 2025000 out of 2500000 steps (81%)
15:04:23:WU01:FS02:0x23:Completed 2050000 out of 2500000 steps (82%)
15:05:38:WU01:FS02:0x23:Completed 2075000 out of 2500000 steps (83%)
15:06:38:WU01:FS02:0x23:An exception occurred at step 2094343: bad allocation
15:06:38:WU01:FS02:0x23:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
15:06:38:WU01:FS02:0x23:Folding@home Core Shutdown: CORE_RESTART
15:06:38:WARNING:WU01:FS02:FahCore returned: CORE_RESTART (98 = 0x62)
15:06:38:WU01:FS02:Starting
15:06:38:WU01:FS02:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" A:\FAHClient\cores/cores.foldingathome.org/openmm-core-23/windows-10-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23.exe -dir 01 -suffix 01 -version 706 -lifeline 13876 -checkpoint 5 -opencl-platform 1 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
15:06:38:WU01:FS02:Started FahCore on PID 491712
15:06:39:WU01:FS02:Core PID:775060
15:06:39:WU01:FS02:FahCore 0x23 started
15:06:39:WU01:FS02:0x23:*********************** Log Started 2024-10-05T15:06:39Z ***********************
15:06:39:WU01:FS02:0x23:*************************** Core23 Folding@home Core ***************************
15:06:39:WU01:FS02:0x23:       Core: Core23
15:06:39:WU01:FS02:0x23:       Type: 0x23
15:06:39:WU01:FS02:0x23:    Version: 8.0.3
15:06:39:WU01:FS02:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:06:39:WU01:FS02:0x23:  Copyright: 2022 foldingathome.org
15:06:39:WU01:FS02:0x23:   Homepage: https://foldingathome.org/
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:39:06
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:             -DOPENMM_VERSION="\"8.0.0\""
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
15:06:39:WU01:FS02:0x23:             <peastman@stanford.edu>
15:06:39:WU01:FS02:0x23:       Args: -dir 01 -suffix 01 -version 706 -lifeline 491712 -checkpoint 5
15:06:39:WU01:FS02:0x23:             -opencl-platform 1 -opencl-device 0 -cuda-device 0 -gpu-vendor
15:06:39:WU01:FS02:0x23:             nvidia -gpu 0 -gpu-usage 100
15:06:39:WU01:FS02:0x23:************************************ libFAH ************************************
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:37:55
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:************************************ CBang *************************************
15:06:39:WU01:FS02:0x23:    Version: 1.7.2
15:06:39:WU01:FS02:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:06:39:WU01:FS02:0x23:        Org: Cauldron Development LLC
15:06:39:WU01:FS02:0x23:  Copyright: Cauldron Development LLC, 2003-2023
15:06:39:WU01:FS02:0x23:   Homepage: https://cauldrondevelopment.com/
15:06:39:WU01:FS02:0x23:    License: GPL 2+
15:06:39:WU01:FS02:0x23:       Date: Aug 3 2023
15:06:39:WU01:FS02:0x23:       Time: 08:37:14
15:06:39:WU01:FS02:0x23:   Compiler: Visual C++
15:06:39:WU01:FS02:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:06:39:WU01:FS02:0x23:   Platform: win32 10
15:06:39:WU01:FS02:0x23:       Bits: 64
15:06:39:WU01:FS02:0x23:       Mode: Release
15:06:39:WU01:FS02:0x23:************************************ System ************************************
15:06:39:WU01:FS02:0x23:        CPU: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
15:06:39:WU01:FS02:0x23:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 9
15:06:39:WU01:FS02:0x23:       CPUs: 8
15:06:39:WU01:FS02:0x23:     Memory: 31.70GiB
15:06:39:WU01:FS02:0x23:Free Memory: 8.64GiB
15:06:39:WU01:FS02:0x23:    Threads: WINDOWS_THREADS
15:06:39:WU01:FS02:0x23: OS Version: 6.2
15:06:39:WU01:FS02:0x23:Has Battery: false
15:06:39:WU01:FS02:0x23: On Battery: false
15:06:39:WU01:FS02:0x23: UTC Offset: 10
15:06:39:WU01:FS02:0x23:        PID: 775060
15:06:39:WU01:FS02:0x23:        CWD: A:\FAHClient\work
15:06:39:WU01:FS02:0x23:       Exec: A:\FAHClient\cores\cores.foldingathome.org\openmm-core-23\windows-10-64bit\release\0x23-8.0.3\Core_23.fah\FahCore_23.exe
15:06:39:WU01:FS02:0x23:************************************ OpenMM ************************************
15:06:39:WU01:FS02:0x23:    Version: 8.0.0
15:06:39:WU01:FS02:0x23:********************************************************************************
15:06:39:WU01:FS02:0x23:Project: 16722 (Run 262, Clone 2, Gen 458)
15:06:39:WU01:FS02:0x23:Digital signatures verified
15:06:39:WU01:FS02:0x23:Folding@home GPU Core23 Folding@home Core
15:06:39:WU01:FS02:0x23:Version 8.0.3
15:06:39:WU01:FS02:0x23:  Checkpoint write interval: 100000 steps (4%) [25 total]
15:06:39:WU01:FS02:0x23:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
15:06:39:WU01:FS02:0x23:  XTC frame write interval: 10000 steps (0.4%) [250 total]
15:06:39:WU01:FS02:0x23:  Global context and integrator variables write interval: disabled
15:06:40:WU01:FS02:0x23:There are 4 platforms available.
15:06:40:WU01:FS02:0x23:Platform 0: Reference
15:06:40:WU01:FS02:0x23:Platform 1: CPU
15:06:40:WU01:FS02:0x23:Platform 2: OpenCL
15:06:40:WU01:FS02:0x23:  opencl-device 0 specified
15:06:40:WU01:FS02:0x23:Platform 3: CUDA
15:06:40:WU01:FS02:0x23:  cuda-device 0 specified
15:07:00:WU01:FS02:0x23:Attempting to create CUDA context:
15:07:00:WU01:FS02:0x23:  Configuring platform CUDA
15:07:01:WU01:FS02:0x23:Failed to create CUDA context:
15:07:01:WU01:FS02:0x23:Error initializing FFT: 5
15:07:01:WU01:FS02:0x23:Attempting to create OpenCL context:
15:07:01:WU01:FS02:0x23:  Configuring platform OpenCL
15:07:29:WU01:FS02:0x23:  Using OpenCL on OpenCL platformId 1 and gpu 0
15:07:29:WU01:FS02:0x23:  GPU info: Platform: OpenCL: NVIDIA CUDA
15:07:29:WU01:FS02:0x23:  GPU info: PlatformIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Device: NVIDIA GeForce RTX 4070
15:07:29:WU01:FS02:0x23:  GPU info: DeviceIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Vendor: 0x10de
15:07:29:WU01:FS02:0x23:  GPU info: PCI: 01:00:00
15:07:29:WU01:FS02:0x23:  GPU info: Compute: 3.0
15:07:29:WU01:FS02:0x23:  GPU info: Driver: 561.9
15:07:29:WU01:FS02:0x23:  GPU info: GPU: true
15:07:29:WU01:FS02:0x23:Completed 0 out of 2500000 steps (0%)
16:42:33:WU01:FS02:0x23:Completed 25000 out of 2500000 steps (1%)
******************************* Date: 2024-10-05 *******************************
18:15:30:WU01:FS02:0x23:Completed 50000 out of 2500000 steps (2%)
19:51:07:WU01:FS02:0x23:Completed 75000 out of 2500000 steps (3%)
21:24:02:WU01:FS02:0x23:Completed 100000 out of 2500000 steps (4%)
21:24:07:WU01:FS02:0x23:Checkpoint completed at step 100000
22:55:10:WU01:FS02:0x23:Completed 125000 out of 2500000 steps (5%)
******************************* Date: 2024-10-06 *******************************
00:29:38:WU01:FS02:0x23:Completed 150000 out of 2500000 steps (6%)
02:04:24:WU01:FS02:0x23:Completed 175000 out of 2500000 steps (7%)
Rebooting the computer caused the WU to restart, this time with the 75 second frame time.

Can someone explain what happened?
Image
Joe_H
Site Admin
Posts: 7920
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Frame time of more than 90 minutes for 16722 WU, after restart due to bad allocation.

Post by Joe_H »

I suspect you had a video driver crash that caused the bad allocation error, and the video system reinitialized. Apparently it also corrupted the checkpoint so the WU started over from the beginning. However the reinitialization of the drivers went, the client and folding core could not detect CUDA being available and started up in OpenCL mode:

Code: Select all

15:07:00:WU01:FS02:0x23:  Configuring platform CUDA
15:07:01:WU01:FS02:0x23:Failed to create CUDA context:
15:07:01:WU01:FS02:0x23:Error initializing FFT: 5
15:07:01:WU01:FS02:0x23:Attempting to create OpenCL context:
15:07:01:WU01:FS02:0x23:  Configuring platform OpenCL
15:07:29:WU01:FS02:0x23:  Using OpenCL on OpenCL platformId 1 and gpu 0
15:07:29:WU01:FS02:0x23:  GPU info: Platform: OpenCL: NVIDIA CUDA
15:07:29:WU01:FS02:0x23:  GPU info: PlatformIndex: 0
15:07:29:WU01:FS02:0x23:  GPU info: Device: NVIDIA GeForce RTX 4070
Others in the past have reported sometimes having their GPU stuck in a slow clock mode as well after a driver reinitialization due to a crash, I suspect that also happened here and caused the very slow processing until you rebooted.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
toTOW
Site Moderator
Posts: 6347
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Frame time increases from 75 seconds to 90+ minutes for 16722 WU, after restart due to bad allocation.

Post by toTOW »

After a GPU/driver crash, the GPU often stays in low power mode until you reboot the system ...
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
ETA_2025
Posts: 72
Joined: Mon Jan 30, 2023 10:43 am
Hardware configuration: NVIDIA RTX 4070
10 x Raspberry Pi 5 Model B 2GB RAM
10 x Raspberry Pi 4 Model B 2GB RAM
Location: VIC, Australia

Re: Frame time of more than 90 minutes for 16722 WU, after restart due to bad allocation.

Post by ETA_2025 »

Joe_H wrote: Sun Oct 06, 2024 4:02 am I suspect you had a video driver crash that caused the bad allocation error, and the video system reinitialized. Apparently it also corrupted the checkpoint so the WU started over from the beginning. However the reinitialization of the drivers went, the client and folding core could not detect CUDA being available and started up in OpenCL mode:
...
Others in the past have reported sometimes having their GPU stuck in a slow clock mode as well after a driver reinitialization due to a crash, I suspect that also happened here and caused the very slow processing until you rebooted.
toTOW wrote: Sun Oct 06, 2024 3:02 pm After a GPU/driver crash, the GPU often stays in low power mode until you reboot the system ...
Thanks Joe_H and toTOW. It's good to know what to do if it happens again, though I don't know why the GPU/video driver crashed.
Image
Post Reply