Please use only one of the GPUs

It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.

Moderators: Site Moderators, FAHC Science Team

Post Reply
gw666
Posts: 14
Joined: Thu Apr 09, 2020 8:53 am

Please use only one of the GPUs

Post by gw666 »

Hi everyone,

I'm trying to do some backfilling on a farm machine, just like the friends at CERN are doing. My setup is Scientific Linux 7.7 on x86_64, the machines all have two Xeon CPUs and 6 or 8 Nvidia GPUs of several generations, in this example 6 NVidia Tesla P4. I'm using the latest CUDA 10.2.

Folding@Home isn't installed directly on the OS, but I'm using CERN's Docker container from lukasheinrich/folding:latest/ and I'm running it using singularity 3.5.3 using the --nv option to bind in the NVidia devices and libraries.

The CERN container is Ubuntu 18.04.1 with fahclient 7.5.1 installed.

I'm running everything in a batch system requesting 1 CPU core and 1 GPU. The command line options I've used are:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=$CUDA_VISIBLE_DEVICES --smp=false

CUDA_VISIBLE_DEVICES is set by the batch system, in this case to 0 for the first of 6 GPUs.

It seems, Folding@home is still trying to fetch jobs for the GPUs it sees but is aware it cannot use:

Code: Select all

08:43:22:        Version: 7.5.1
08:43:22:           Date: May 11 2018
08:43:22:           Time: 19:59:04
08:43:22:     Repository: Git
08:43:22:       Revision: 4705bf53c635f88b8fe85af7675557e15d491ff0
08:43:22:         Branch: master
08:43:22:       Compiler: GNU 6.3.0 20170516
08:43:22:        Options: -std=gnu++98 -O3 -funroll-loops
08:43:22:       Platform: linux2 4.14.0-3-amd64
08:43:22:           Bits: 64
08:43:22:           Mode: Release
08:43:22:******************************* System ********************************
08:43:22:            CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
08:43:22:         CPU ID: GenuineIntel Family 6 Model 85 Stepping 4
08:43:22:           CPUs: 64
08:43:22:         Memory: 376.20GiB
08:43:22:    Free Memory: 181.49GiB
08:43:22:        Threads: POSIX_THREADS
08:43:22:     OS Version: 3.10
08:43:22:    Has Battery: false
08:43:22:     On Battery: false
08:43:22:     UTC Offset: 2
08:43:22:            PID: 245167
08:43:22:            CWD: /batch/57329060.1.gpu.q
08:43:22:             OS: Linux 3.10.0-1062.7.1.el7.x86_64 x86_64
08:43:22:        OS Arch: AMD64
08:43:22:           GPUs: 0
08:43:22:  CUDA Device 0: Platform:0 Device:0 Bus:59 Slot:0 Compute:6.1 Driver:10.2
08:43:22:OpenCL Device 0: Platform:0 Device:0 Bus:59 Slot:0 Compute:1.2 Driver:440.33
08:43:22:***********************************************************************
08:43:22:<config>
08:43:22:  <!-- Folding Slots -->
08:43:22:</config>
08:43:22:Connecting to assign1.foldingathome.org:8080
08:43:22:Updated GPUs.txt
08:43:22:Read GPUs.txt
08:43:22:Trying to access database...
08:43:22:Successfully acquired database lock
08:43:22:FS00:Set client configured
08:43:22:Enabled folding slot 00: READY cpu:1
08:43:22:Enabled folding slot 01: READY gpu:0:GP104GL [Tesla P4]
08:43:22:Enabled folding slot 02: READY gpu:1:GP104GL [Tesla P4]
08:43:22:Enabled folding slot 03: READY gpu:2:GP104GL [Tesla P4]
08:43:22:Enabled folding slot 04: READY gpu:3:GP104GL [Tesla P4]
08:43:22:Enabled folding slot 05: READY gpu:4:GP104GL [Tesla P4]
08:43:22:Enabled folding slot 06: READY gpu:5:GP104GL [Tesla P4]
[91m08:43:22:ERROR:No compute devices matched GPU #1 NVIDIA:5 GP104GL [Tesla P4].  You may need to update your graphics drivers.[0m
[91m08:43:22:ERROR:No compute devices matched GPU #2 NVIDIA:5 GP104GL [Tesla P4].  You may need to update your graphics drivers.[0m
[91m08:43:22:ERROR:No compute devices matched GPU #3 NVIDIA:5 GP104GL [Tesla P4].  You may need to update your graphics drivers.[0m
[91m08:43:22:ERROR:No compute devices matched GPU #4 NVIDIA:5 GP104GL [Tesla P4].  You may need to update your graphics drivers.[0m
[91m08:43:22:ERROR:No compute devices matched GPU #5 NVIDIA:5 GP104GL [Tesla P4].  You may need to update your graphics drivers.[0m
and later:

Code: Select all

08:43:25:WU03:FS03:Requesting new work unit for slot 03: READY gpu:2:GP104GL [Tesla P4] from 40.114.52.201
08:43:25:WU03:FS03:Connecting to 40.114.52.201:8080
Sometimes, the GPU 0 is doing something as requested, sometimes only CPU tasks are running.

How can I tell folding to not to try to use the GPUs it cannot use?
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Please use only one of the GPUs

Post by foldy »

This is a bug in FAHClient, it sees 6 GPUs in HW but only one OpenCL device 0. So it should only create one GPU slot. As workaround you need to edit the config.xml file manually and delete the other GPU slots.
gw666
Posts: 14
Joined: Thu Apr 09, 2020 8:53 am

Re: Please use only one of the GPUs

Post by gw666 »

foldy wrote:This is a bug in FAHClient, it sees 6 GPUs in HW but only one OpenCL device 0. So it should only create one GPU slot. As workaround you need to edit the config.xml file manually and delete the other GPU slots.
How do these slots correspondent? In this case, the automatically generated config.xml looks like this:

Code: Select all

<config>
  <!-- Folding Slots -->
  <slot id='0' type='CPU'/>
  <slot id='1' type='GPU'/>
  <slot id='2' type='GPU'/>
  <slot id='3' type='GPU'/>
  <slot id='4' type='GPU'/>
  <slot id='5' type='GPU'/>
  <slot id='6' type='GPU'/>
</config>
Should I always remove slots 2 to 6? What if the batch system assigns CUDA_VISIBLE_DEVICES to a different number than 0?
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Please use only one of the GPUs

Post by foldy »

I can only compare with nvidia-ubuntu-docker-opencl and there if the server has 6 GPUs and configures it for 6 docker images one GPU each then the OpenCl index is always 0 for each docker image. But internally each docker is configured to use GPU 0 or 1 or 2 ...

So yes in your configuration remove GPU slots 2 to 6. You can use nvidia-smi command to check if really different GPUs get usage with different docker images for each GPU.
Post Reply