client v8.4.9 GPU WUs not completing properly
Posted: Tue May 06, 2025 1:29 pm
I'm not sure if it's specific to one core/project or not, but I've been seeing this issue where the GPU will run the WU to completion, do the shutdown, but then not actually do the steps to finish, unload that WU and get a new one.
In other words, it runs at full speed to completion, we get the following message:
04:21:10:I1:WU57:Saving result file ../logfile_01.txt
04:21:10:I1:WU57:Saving result file checkpointIntegrator.xml
04:21:10:I1:WU57:Saving result file checkpointState.xml.bz2
04:21:10:I1:WU57:Saving result file positions.xtc
04:21:10:I1:WU57:Saving result file science.log
04:21:10:I1:WU57:Saving result file xtcAtoms.csv.bz2
04:21:10:I1:WU57:Folding@home Core Shutdown: FINISHED_UNIT
And then it just... sits there. nvidia-smi shows that the core is still "running" on the GPU, but at 0% usage. That timestamp indicates that this WU finished 9 hours ago. Restarting the client causes it to load a new WU, but the old WU now shows as failed.
*snip out the "completed... steps"*
In other words, it runs at full speed to completion, we get the following message:
04:21:10:I1:WU57:Saving result file ../logfile_01.txt
04:21:10:I1:WU57:Saving result file checkpointIntegrator.xml
04:21:10:I1:WU57:Saving result file checkpointState.xml.bz2
04:21:10:I1:WU57:Saving result file positions.xtc
04:21:10:I1:WU57:Saving result file science.log
04:21:10:I1:WU57:Saving result file xtcAtoms.csv.bz2
04:21:10:I1:WU57:Folding@home Core Shutdown: FINISHED_UNIT
And then it just... sits there. nvidia-smi shows that the core is still "running" on the GPU, but at 0% usage. That timestamp indicates that this WU finished 9 hours ago. Restarting the client causes it to load a new WU, but the old WU now shows as failed.
Code: Select all
wes@deathstar:~$ nvidia-smi
Tue May 6 09:23:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro P2200 On | 00000000:01:00.0 Off | N/A |
| 44% 31C P8 4W / 75W | 175MiB / 5120MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 69626 C ...4bit-release-8.2.0/FahCore_26 170MiB |
+-----------------------------------------------------------------------------------------+
Code: Select all
*********************** Log Started 2025-05-05T22:48:20Z ***********************
*************************** Core26 Folding@home Core ***************************
Core: Core26
Type: 0x26
Version: 8.2.0
Author: Joseph Coffland <joseph@cauldrondevelopment.com>
Copyright: 2022 foldingathome.org
Homepage: https://foldingathome.org/
Date: Jan 7 2025
Time: 00:35:47
Revision: 4f149b599caa4725076ef2de3b47c8d7ce725787
Branch: HEAD
Compiler: GNU 7.5.0
Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
-DOPENMM_VERSION="\"8.2.0\""
Platform: linux 6.8.0-1017-azure
Bits: 64
Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
<peastman@stanford.edu>
Args: -dir uqwzvdZ49x1rE-bkUJEYtIJ7mlXg7TVdLQDTIeJ-HbA -suffix 01
-version 8.4.9 -lifeline 3588 -gpu-uuid
4980b18d-392b-58c5-5ee6-07f03d1988f1 -gpu-platform cuda -gpu-vendor
nvidia -cuda-platform 0 -cuda-device 0
************************************ libFAH ************************************
Date: Jan 7 2025
Time: 00:29:24
Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
Branch: HEAD
Compiler: GNU 7.5.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie
Platform: linux 6.8.0-1017-azure
Bits: 64
Mode: Release
************************************ CBang *************************************
Version: 1.7.2
Author: Joseph Coffland <joseph@cauldrondevelopment.com>
Org: Cauldron Development LLC
Copyright: Cauldron Development LLC, 2003-2024
Homepage: https://cauldrondevelopment.com/
License: LGPL-2.1-or-later
Date: Jan 7 2025
Time: 00:28:59
Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
Branch: HEAD
Compiler: GNU 7.5.0
Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
-fdata-sections -O3 -funroll-loops -fno-pie -fPIC
Platform: linux 6.8.0-1017-azure
Bits: 64
Mode: Release
************************************ System ************************************
CPU: Intel(R) Xeon(R) E-2244G CPU @ 3.80GHz
CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
CPUs: 8
Memory: 31.07GiB
Free Memory: 28.63GiB
OS Version: 6.1
Has Battery: false
On Battery: false
Hostname: deathstar
UTC Offset: -4
PID: 69626
CWD: /var/lib/fah-client/work
Exec: /var/lib/fah-client/cores/openmm-core-26/centos-7.9.2009-64bit/release/fahcore-26-centos-7.9.2009-64bit-release-8.2.0/FahCore_26
************************************ OpenMM ************************************
Version: 8.2.0
********************************************************************************
Project: 18243 (Run 367, Clone 2, Gen 4)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml.bz2
Reading tar file system.xml.bz2
Digital signatures verified
Folding@home GPU Core26 Folding@home Core
Version 8.2
Checkpoint write interval: 50000 steps (2%) [50 total]
JSON viewer frame write interval: 25000 steps (1%) [100 total]
XTC frame write interval: 10000 steps (0.4%) [250 total]
TRR frame write interval: disabled
Global context and integrator variables write interval: disabled
There are 4 platforms available.
Platform 0: Reference
Platform 1: CPU
Platform 2: OpenCL
Platform 3: CUDA
cuda-device 0 specified
Attempting to create CUDA context:
Configuring platform CUDA
Using CUDA on CUDA Platform and gpu 0
GPU info: Platform: CUDA
GPU info: PlatformIndex: 0
GPU info: Device: Quadro P2200
GPU info: DeviceIndex: 0
GPU info: Vendor: 0x10de
GPU info: PCI: 01:00:00
GPU info: Compute: 6.1
GPU info: Driver: 12.9
GPU info: GPU: true
Completed 0 out of 2500000 steps (0%)
Checkpoint completed at step 0
Completed 25000 out of 2500000 steps (1%)
Completed 50000 out of 2500000 steps (2%)
Code: Select all
Completed 2500000 out of 2500000 steps (100%)
Average performance: 43.0923 ns/day
Checkpoint completed at step 2500000
Saving result file ../logfile_01.txt
Saving result file checkpointIntegrator.xml
Saving result file checkpointState.xml.bz2
Saving result file positions.xtc
Saving result file science.log
Saving result file xtcAtoms.csv.bz2
Folding@home Core Shutdown: FINISHED_UNIT