GPU WUs taking much longer than expected
Posted: Sat Feb 20, 2021 11:49 pm
Don't want to bump the thread so I'll edit it up here.
It's been a while since I've posted this, but ultimately this ended up having a boring diagnosis: a bad GPU! I RMA'd my GPU and plugged in the replacement I got from the company. Alas, everything works fine and I got through a GPU task no problem.
Hello, I've attached my logs at the bottom of the post.
I turned off CPU WUs to try figuring out what's going on with GPU performance - the GPU works fine in other programs (including video streaming with NVENC, GPU-accelerated Davinci Resolve exports), but while running F@H it seems to be performing very poorly.
The program is giving me an estimated PPD of 17402 and the ETA for this job (worth 34k credits) is 1.74 days - the previous job it worked on had a similar ETA and took about as long. This project is 17335 though I don't think there's anything wrong with the projects/WUs.
The account/passkey pair is working, it shows 36 WUs successfully completed at 100% success rate.
The GPU's clock seems to be going high, but the temperature is just 37 degC, which is basically idle temp.
Task Manager, for what it's worth, shows high usage only in the GPU "Copy" graph. 0% for Compute_0, Compute_1, CUDA
I left the task on from 23:15 to 23:45 and the situation is the same (GPU still only apparently being used in the Copy graph). Or are some jobs just expected to be in this phase, and not in the CUDA/compute phase, for that long?
Thanks for your time - please let me know if I'm missing anything.
It's been a while since I've posted this, but ultimately this ended up having a boring diagnosis: a bad GPU! I RMA'd my GPU and plugged in the replacement I got from the company. Alas, everything works fine and I got through a GPU task no problem.
Hello, I've attached my logs at the bottom of the post.
I turned off CPU WUs to try figuring out what's going on with GPU performance - the GPU works fine in other programs (including video streaming with NVENC, GPU-accelerated Davinci Resolve exports), but while running F@H it seems to be performing very poorly.
The program is giving me an estimated PPD of 17402 and the ETA for this job (worth 34k credits) is 1.74 days - the previous job it worked on had a similar ETA and took about as long. This project is 17335 though I don't think there's anything wrong with the projects/WUs.
The account/passkey pair is working, it shows 36 WUs successfully completed at 100% success rate.
The GPU's clock seems to be going high, but the temperature is just 37 degC, which is basically idle temp.
Task Manager, for what it's worth, shows high usage only in the GPU "Copy" graph. 0% for Compute_0, Compute_1, CUDA
I left the task on from 23:15 to 23:45 and the situation is the same (GPU still only apparently being used in the Copy graph). Or are some jobs just expected to be in this phase, and not in the CUDA/compute phase, for that long?
Thanks for your time - please let me know if I'm missing anything.
Code: Select all
*********************** Log Started 2021-02-20T23:14:21Z ***********************
23:14:21:******************************* libFAH ********************************
23:14:21: Date: Oct 20 2020
23:14:21: Time: 13:36:55
23:14:21: Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
23:14:21: Branch: master
23:14:21: Compiler: Visual C++ 2015
23:14:21: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
23:14:21: Platform: win32 10
23:14:21: Bits: 32
23:14:21: Mode: Release
23:14:21:****************************** FAHClient ******************************
23:14:21: Version: 7.6.21
23:14:21: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:14:21: Copyright: 2020 foldingathome.org
23:14:21: Homepage: https://foldingathome.org/
23:14:21: Date: Oct 20 2020
23:14:21: Time: 13:41:04
23:14:21: Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
23:14:21: Branch: master
23:14:21: Compiler: Visual C++ 2015
23:14:21: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
23:14:21: Platform: win32 10
23:14:21: Bits: 32
23:14:21: Mode: Release
23:14:21: Args: --open-web-control
23:14:21: Config: C:\ProgramData\FAHClient\config.xml
23:14:21:******************************** CBang ********************************
23:14:21: Date: Oct 20 2020
23:14:21: Time: 11:36:18
23:14:21: Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
23:14:21: Branch: master
23:14:21: Compiler: Visual C++ 2015
23:14:21: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
23:14:21: Platform: win32 10
23:14:21: Bits: 32
23:14:21: Mode: Release
23:14:21:******************************* System ********************************
23:14:21: CPU: AMD Ryzen 9 5900X 12-Core Processor
23:14:21: CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
23:14:21: CPUs: 24
23:14:21: Memory: 31.93GiB
23:14:21: Free Memory: 24.88GiB
23:14:21: Threads: WINDOWS_THREADS
23:14:21: OS Version: 6.2
23:14:21: Has Battery: false
23:14:21: On Battery: false
23:14:21: UTC Offset: -8
23:14:21: PID: 4716
23:14:21: CWD: C:\ProgramData\FAHClient
23:14:21: Win32 Service: false
23:14:21: OS: Windows 10 Enterprise
23:14:21: OS Arch: AMD64
23:14:21: GPUs: 1
23:14:21: GPU 0: Bus:7 Slot:0 Func:0 NVIDIA:8 TU104 [GeForce RTX 2070 SUPER]
23:14:21: 8218
23:14:21: CUDA Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:7.5 Driver:11.2
23:14:21:OpenCL Device 0: Platform:0 Device:0 Bus:7 Slot:0 Compute:1.2 Driver:461.40
23:14:21:***********************************************************************
23:14:21:<config>
23:14:21: <!-- Network -->
23:14:21: <proxy v=':8080'/>
23:14:21:
23:14:21: <!-- Slot Control -->
23:14:21: <power v='full'/>
23:14:21:
23:14:21: <!-- User Information -->
23:14:21: <passkey v='*****'/>
23:14:21: <team v='------'/>
23:14:21: <user v='------------'/>
23:14:21:
23:14:21: <!-- Folding Slots -->
23:14:21: <slot id='0' type='CPU'>
23:14:21: <paused v='true'/>
23:14:21: </slot>
23:14:21: <slot id='1' type='GPU'>
23:14:21: <paused v='true'/>
23:14:21: <pci-bus v='7'/>
23:14:21: <pci-slot v='0'/>
23:14:21: </slot>
23:14:21:</config>
23:14:21:Trying to access database...
23:14:21:Successfully acquired database lock
23:14:21:FS00:Initialized folding slot 00: cpu:23
23:14:21:FS01:Initialized folding slot 01: gpu:7:0 TU104 [GeForce RTX 2070 SUPER] 8218
23:14:22:3:127.0.0.1:New Web session
23:14:57:FS01:Unpaused
23:14:57:WU00:FS01:Starting
23:14:57:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 4716 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
23:14:57:WU00:FS01:Started FahCore on PID 25640
23:14:57:WU00:FS01:Core PID:5212
23:14:57:WU00:FS01:FahCore 0x22 started
23:14:58:WU00:FS01:0x22:*********************** Log Started 2021-02-20T23:14:57Z ***********************
23:14:58:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
23:14:58:WU00:FS01:0x22: Core: Core22
23:14:58:WU00:FS01:0x22: Type: 0x22
23:14:58:WU00:FS01:0x22: Version: 0.0.13
23:14:58:WU00:FS01:0x22: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:14:58:WU00:FS01:0x22: Copyright: 2020 foldingathome.org
23:14:58:WU00:FS01:0x22: Homepage: https://foldingathome.org/
23:14:58:WU00:FS01:0x22: Date: Sep 19 2020
23:14:58:WU00:FS01:0x22: Time: 02:35:58
23:14:58:WU00:FS01:0x22: Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
23:14:58:WU00:FS01:0x22: Branch: core22-0.0.13
23:14:58:WU00:FS01:0x22: Compiler: Visual C++ 2015
23:14:58:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:58:WU00:FS01:0x22: -DOPENMM_GIT_HASH="\"189320d0\""
23:14:58:WU00:FS01:0x22: Platform: win32 10
23:14:58:WU00:FS01:0x22: Bits: 64
23:14:58:WU00:FS01:0x22: Mode: Release
23:14:58:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
23:14:58:WU00:FS01:0x22: <peastman@stanford.edu>
23:14:58:WU00:FS01:0x22: Args: -dir 00 -suffix 01 -version 706 -lifeline 25640 -checkpoint 15
23:14:58:WU00:FS01:0x22: -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
23:14:58:WU00:FS01:0x22: nvidia -gpu 0 -gpu-usage 100
23:14:58:WU00:FS01:0x22:************************************ libFAH ************************************
23:14:58:WU00:FS01:0x22: Date: Sep 7 2020
23:14:58:WU00:FS01:0x22: Time: 19:09:56
23:14:58:WU00:FS01:0x22: Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
23:14:58:WU00:FS01:0x22: Branch: HEAD
23:14:58:WU00:FS01:0x22: Compiler: Visual C++ 2015
23:14:58:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:58:WU00:FS01:0x22: Platform: win32 10
23:14:58:WU00:FS01:0x22: Bits: 64
23:14:58:WU00:FS01:0x22: Mode: Release
23:14:58:WU00:FS01:0x22:************************************ CBang *************************************
23:14:58:WU00:FS01:0x22: Date: Sep 7 2020
23:14:58:WU00:FS01:0x22: Time: 19:08:30
23:14:58:WU00:FS01:0x22: Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
23:14:58:WU00:FS01:0x22: Branch: HEAD
23:14:58:WU00:FS01:0x22: Compiler: Visual C++ 2015
23:14:58:WU00:FS01:0x22: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:58:WU00:FS01:0x22: Platform: win32 10
23:14:58:WU00:FS01:0x22: Bits: 64
23:14:58:WU00:FS01:0x22: Mode: Release
23:14:58:WU00:FS01:0x22:************************************ System ************************************
23:14:58:WU00:FS01:0x22: CPU: AMD Ryzen 9 5900X 12-Core Processor
23:14:58:WU00:FS01:0x22: CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
23:14:58:WU00:FS01:0x22: CPUs: 24
23:14:58:WU00:FS01:0x22: Memory: 31.93GiB
23:14:58:WU00:FS01:0x22:Free Memory: 24.79GiB
23:14:58:WU00:FS01:0x22: Threads: WINDOWS_THREADS
23:14:58:WU00:FS01:0x22: OS Version: 6.2
23:14:58:WU00:FS01:0x22:Has Battery: false
23:14:58:WU00:FS01:0x22: On Battery: false
23:14:58:WU00:FS01:0x22: UTC Offset: -8
23:14:58:WU00:FS01:0x22: PID: 5212
23:14:58:WU00:FS01:0x22: CWD: C:\ProgramData\FAHClient\work
23:14:58:WU00:FS01:0x22:************************************ OpenMM ************************************
23:14:58:WU00:FS01:0x22: Revision: 189320d0
23:14:58:WU00:FS01:0x22:********************************************************************************
23:14:58:WU00:FS01:0x22:Project: 17335 (Run 17, Clone 616, Gen 10)
23:14:58:WU00:FS01:0x22:Unit: 0x00000000000000000000000000000000
23:14:58:WU00:FS01:0x22:Digital signatures verified
23:14:58:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
23:14:58:WU00:FS01:0x22:Version 0.0.13
23:14:58:WU00:FS01:0x22: Checkpoint write interval: 15000 steps (2%) [50 total]
23:14:58:WU00:FS01:0x22: JSON viewer frame write interval: 7500 steps (1%) [100 total]
23:14:58:WU00:FS01:0x22: XTC frame write interval: 250000 steps (33%) [3 total]
23:14:58:WU00:FS01:0x22: Global context and integrator variables write interval: disabled
23:14:58:WU00:FS01:0x22:There are 4 platforms available.
23:14:58:WU00:FS01:0x22:Platform 0: Reference
23:14:58:WU00:FS01:0x22:Platform 1: CPU
23:14:58:WU00:FS01:0x22:Platform 2: OpenCL
23:14:58:WU00:FS01:0x22: opencl-device 0 specified
23:14:58:WU00:FS01:0x22:Platform 3: CUDA
23:14:58:WU00:FS01:0x22: cuda-device 0 specified
23:15:07:WU00:FS01:0x22:Attempting to create CUDA context:
23:15:07:WU00:FS01:0x22: Configuring platform CUDA
23:15:22:Removing old file 'configs/config-20210217-045908.xml'
23:15:22:Saving configuration to config.xml
23:15:22:<config>
23:15:22: <!-- Network -->
23:15:22: <proxy v=':8080'/>
23:15:22:
23:15:22: <!-- Slot Control -->
23:15:22: <power v='full'/>
23:15:22:
23:15:22: <!-- User Information -->
23:15:22: <passkey v='*****'/>
23:15:22: <team v='------'/>
23:15:22: <user v='------------'/>
23:15:22:
23:15:22: <!-- Folding Slots -->
23:15:22: <slot id='0' type='CPU'>
23:15:22: <paused v='true'/>
23:15:22: </slot>
23:15:22: <slot id='1' type='GPU'>
23:15:22: <pci-bus v='7'/>
23:15:22: <pci-slot v='0'/>
23:15:22: </slot>
23:15:22:</config>