Page 1 of 1

client v8.4.9 GPU WUs not completing properly

Posted: Tue May 06, 2025 1:29 pm
by wesgeorge
I'm not sure if it's specific to one core/project or not, but I've been seeing this issue where the GPU will run the WU to completion, do the shutdown, but then not actually do the steps to finish, unload that WU and get a new one.
In other words, it runs at full speed to completion, we get the following message:
04:21:10:I1:WU57:Saving result file ../logfile_01.txt
04:21:10:I1:WU57:Saving result file checkpointIntegrator.xml
04:21:10:I1:WU57:Saving result file checkpointState.xml.bz2
04:21:10:I1:WU57:Saving result file positions.xtc
04:21:10:I1:WU57:Saving result file science.log
04:21:10:I1:WU57:Saving result file xtcAtoms.csv.bz2
04:21:10:I1:WU57:Folding@home Core Shutdown: FINISHED_UNIT

And then it just... sits there. nvidia-smi shows that the core is still "running" on the GPU, but at 0% usage. That timestamp indicates that this WU finished 9 hours ago. Restarting the client causes it to load a new WU, but the old WU now shows as failed.

Code: Select all

wes@deathstar:~$ nvidia-smi
Tue May  6 09:23:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P2200                   On  |   00000000:01:00.0 Off |                  N/A |
| 44%   31C    P8              4W /   75W |     175MiB /   5120MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           69626      C   ...4bit-release-8.2.0/FahCore_26        170MiB |
+-----------------------------------------------------------------------------------------+

Code: Select all

*********************** Log Started 2025-05-05T22:48:20Z ***********************
*************************** Core26 Folding@home Core ***************************
       Core: Core26
       Type: 0x26
    Version: 8.2.0
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
  Copyright: 2022 foldingathome.org
   Homepage: https://foldingathome.org/
       Date: Jan 7 2025
       Time: 00:35:47
   Revision: 4f149b599caa4725076ef2de3b47c8d7ce725787
     Branch: HEAD
   Compiler: GNU 7.5.0
    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
             -DOPENMM_VERSION="\"8.2.0\""
   Platform: linux 6.8.0-1017-azure
       Bits: 64
       Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
             <peastman@stanford.edu>
       Args: -dir uqwzvdZ49x1rE-bkUJEYtIJ7mlXg7TVdLQDTIeJ-HbA -suffix 01
             -version 8.4.9 -lifeline 3588 -gpu-uuid
             4980b18d-392b-58c5-5ee6-07f03d1988f1 -gpu-platform cuda -gpu-vendor
             nvidia -cuda-platform 0 -cuda-device 0
************************************ libFAH ************************************
       Date: Jan 7 2025
       Time: 00:29:24
   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
     Branch: HEAD
   Compiler: GNU 7.5.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie
   Platform: linux 6.8.0-1017-azure
       Bits: 64
       Mode: Release
************************************ CBang *************************************
    Version: 1.7.2
     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
        Org: Cauldron Development LLC
  Copyright: Cauldron Development LLC, 2003-2024
   Homepage: https://cauldrondevelopment.com/
    License: LGPL-2.1-or-later
       Date: Jan 7 2025
       Time: 00:28:59
   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
     Branch: HEAD
   Compiler: GNU 7.5.0
    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
   Platform: linux 6.8.0-1017-azure
       Bits: 64
       Mode: Release
************************************ System ************************************
        CPU: Intel(R) Xeon(R) E-2244G CPU @ 3.80GHz
     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
       CPUs: 8
     Memory: 31.07GiB
Free Memory: 28.63GiB
 OS Version: 6.1
Has Battery: false
 On Battery: false
   Hostname: deathstar
 UTC Offset: -4
        PID: 69626
        CWD: /var/lib/fah-client/work
       Exec: /var/lib/fah-client/cores/openmm-core-26/centos-7.9.2009-64bit/release/fahcore-26-centos-7.9.2009-64bit-release-8.2.0/FahCore_26
************************************ OpenMM ************************************
    Version: 8.2.0
********************************************************************************
Project: 18243 (Run 367, Clone 2, Gen 4)
Reading tar file core.xml
Reading tar file integrator.xml
Reading tar file state.xml.bz2
Reading tar file system.xml.bz2
Digital signatures verified
Folding@home GPU Core26 Folding@home Core
Version 8.2
  Checkpoint write interval: 50000 steps (2%) [50 total]
  JSON viewer frame write interval: 25000 steps (1%) [100 total]
  XTC frame write interval: 10000 steps (0.4%) [250 total]
  TRR frame write interval: disabled
  Global context and integrator variables write interval: disabled
There are 4 platforms available.
Platform 0: Reference
Platform 1: CPU
Platform 2: OpenCL
Platform 3: CUDA
  cuda-device 0 specified
Attempting to create CUDA context:
  Configuring platform CUDA
  Using CUDA on CUDA Platform and gpu 0
  GPU info: Platform: CUDA
  GPU info: PlatformIndex: 0
  GPU info: Device: Quadro P2200
  GPU info: DeviceIndex: 0
  GPU info: Vendor: 0x10de
  GPU info: PCI: 01:00:00
  GPU info: Compute: 6.1
  GPU info: Driver: 12.9
  GPU info: GPU: true
  Completed 0 out of 2500000 steps (0%)
Checkpoint completed at step 0
Completed 25000 out of 2500000 steps (1%)
Completed 50000 out of 2500000 steps (2%)
*snip out the "completed... steps"*

Code: Select all

Completed 2500000 out of 2500000 steps (100%)
Average performance: 43.0923 ns/day
Checkpoint completed at step 2500000
Saving result file ../logfile_01.txt
Saving result file checkpointIntegrator.xml
Saving result file checkpointState.xml.bz2
Saving result file positions.xtc
Saving result file science.log
Saving result file xtcAtoms.csv.bz2
Folding@home Core Shutdown: FINISHED_UNIT

Re: client v8.4.9 GPU WUs not completing properly

Posted: Tue May 06, 2025 1:37 pm
by muziqaz
Is this happening every GPU WU?
Are you behind some firewall, or network infrastructure which would block uploads?
Is CPU doing anything during that time?
Is there any network activity during the "waiting" period?
That looks like some sort of compute instance running some docker, no?
Those are always bring their own can of worms to the troubleshooting process

Re: client v8.4.9 GPU WUs not completing properly

Posted: Tue May 06, 2025 11:52 pm
by arisu
Can you post more of the client logs, including when you restart it and it downloads a new WU? The big log you posted looks like it's the science.log which is only the output of the core, not the client, and the core does not look like it is the problem.

Also please post the output of this command while the WU is stuck (but before restarting the client):

Code: Select all

sudo lsof -E -i -a -u fah-client
It will output information about the network connections for any process running as the "fah-client" user.

Re: client v8.4.9 GPU WUs not completing properly

Posted: Wed May 07, 2025 12:48 pm
by wesgeorge
muziqaz wrote: Tue May 06, 2025 1:37 pm Is this happening every GPU WU?
Are you behind some firewall, or network infrastructure which would block uploads?
Is CPU doing anything during that time?
Is there any network activity during the "waiting" period?
That looks like some sort of compute instance running some docker, no?
Those are always bring their own can of worms to the troubleshooting process
It's not docker, I gave up on that when it became clear that it was not really supported by anyone anymore.
Direct install on Debian 12. CPU is running its own WU, and networking is not blocked, as CPU WUs are completing normally.

That said, this appears to not be a FAH problem. This is a relatively new install/new (to me) machine, and while troubleshooting this and a few other things, I discovered a bunch of kernel errors/stack traces in the system logs that my initial research indicates may be a kernel issue with the 6.x kernel and some of the virtualization extensions for PCI passthrough (VT-D) that is interrupting communication with the GPU (and my RAID card!) over PCIe. I'm not using that feature, so I disabled it in BIOS and the stack traces have disappeared and the client has completed a few GPU WUs successfully over the last 18h or so.

Code: Select all

May 06 10:17:36 deathstar kernel: DMAR: ERROR: DMA PTE for vPFN 0x617b3 already set (to 617b3003 not 14b082001)
May 06 10:17:36 deathstar kernel:  nv_dma_map_alloc+0x543/0x590 [nvidia]
May 06 10:17:36 deathstar kernel: ------------[ cut here ]------------
May 06 10:17:36 deathstar kernel: WARNING: CPU: 5 PID: 164 at drivers/iommu/intel/iommu.c:2305 __domain_mapping.co
ld+0x3a/0x41 

Re: client v8.4.9 GPU WUs not completing properly

Posted: Wed May 07, 2025 3:22 pm
by muziqaz
Result!

Re: client v8.4.9 GPU WUs not completing properly

Posted: Thu May 08, 2025 1:55 am
by arisu
Check if the kernel issue is fixed on occasion because VT-d (Intel's name for their IOMMU) improves performance and security so you don't want to keep it off forever.

There is a way to disable the IOMMU for the GPU only. If your CPU is an Intel, add this to the kernel boot parameters (like in GRUB):

Code: Select all

intel_iommu=igfx_off
That sometimes helps when the GPU is causing DMA problems.