Page 1 of 1

Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 7:03 pm
by AndreasH
Hi!

I currently have about 130 CPU cores folding, most of them in my servers. I have one AMD GPU running fine, but for some days now I try to get fahclient to run on my workstations NVIDIA GPU without success so far.
All my computers run Linux. On my workstation I have OpenSUSE LEAP 15.1 with current NVIDIA drivers.
I installed fahclient 7.6.9 from the 64 Bit RPM.
The client does recognize the GPU, loads some WU, starts folding, but almost immediately stops with an error.

Here are some logs:

Code: Select all

16:52:33:WU00:FS01:Downloading 11.98MiB
16:52:39:WU00:FS01:Download 43.31%
16:52:45:WU00:FS01:Download 66.80%
16:52:49:WU00:FS01:Download complete
16:52:49:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:11747 run:0 clone:9683 gen:10 core:0x22 unit:0x0000001b8ca304e75e6baf22f72b2965
16:52:49:WU00:FS01:Starting
16:52:49:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 1475 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
16:52:49:WU00:FS01:Started FahCore on PID 28973
16:52:49:WU00:FS01:Core PID:28977
16:52:49:WU00:FS01:FahCore 0x22 started
16:52:50:WU00:FS01:0x22:*********************** Log Started 2020-04-22T16:52:49Z ***********************
16:52:50:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
16:52:50:WU00:FS01:0x22:       Type: 0x22
16:52:50:WU00:FS01:0x22:       Core: Core22
16:52:50:WU00:FS01:0x22:    Website: https://foldingathome.org/
16:52:50:WU00:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
16:52:50:WU00:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
16:52:50:WU00:FS01:0x22:             <rafal.wiewiora@choderalab.org>
16:52:50:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 28973 -checkpoint 15
16:52:50:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
16:52:50:WU00:FS01:0x22:             0 -gpu 0
16:52:50:WU00:FS01:0x22:     Config: <none>
16:52:50:WU00:FS01:0x22:************************************ Build *************************************
16:52:50:WU00:FS01:0x22:    Version: 0.0.2
16:52:50:WU00:FS01:0x22:       Date: Dec 6 2019
16:52:50:WU00:FS01:0x22:       Time: 21:20:17
16:52:50:WU00:FS01:0x22: Repository: Git
16:52:50:WU00:FS01:0x22:   Revision: f87d92b58abdf7e6bf2e173cfbc4dc3e837c7042
16:52:50:WU00:FS01:0x22:     Branch: core22
16:52:50:WU00:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
16:52:50:WU00:FS01:0x22:    Options: -std=gnu++98 -O3 -funroll-loops
16:52:50:WU00:FS01:0x22:   Platform: linux2 4.9.87-linuxkit-aufs
16:52:50:WU00:FS01:0x22:       Bits: 64
16:52:50:WU00:FS01:0x22:       Mode: Release
16:52:50:WU00:FS01:0x22:************************************ System ************************************
16:52:50:WU00:FS01:0x22:        CPU: AMD Ryzen 9 3900X 12-Core Processor
16:52:50:WU00:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
16:52:50:WU00:FS01:0x22:       CPUs: 24
16:52:50:WU00:FS01:0x22:     Memory: 62.85GiB
16:52:50:WU00:FS01:0x22:Free Memory: 48.94GiB
16:52:50:WU00:FS01:0x22:    Threads: POSIX_THREADS
16:52:50:WU00:FS01:0x22: OS Version: 4.12
16:52:50:WU00:FS01:0x22:Has Battery: false
16:52:50:WU00:FS01:0x22: On Battery: false
16:52:50:WU00:FS01:0x22: UTC Offset: 2
16:52:50:WU00:FS01:0x22:        PID: 28977
16:52:50:WU00:FS01:0x22:        CWD: /var/lib/fahclient/work
16:52:50:WU00:FS01:0x22:         OS: Linux 4.12.14-lp151.28.44-default x86_64
16:52:50:WU00:FS01:0x22:    OS Arch: AMD64
16:52:50:WU00:FS01:0x22:********************************************************************************
16:52:50:WU00:FS01:0x22:Project: 11747 (Run 0, Clone 9683, Gen 10)
16:52:50:WU00:FS01:0x22:Unit: 0x0000001b8ca304e75e6baf22f72b2965
16:52:50:WU00:FS01:0x22:Reading tar file core.xml
16:52:50:WU00:FS01:0x22:Reading tar file integrator.xml
16:52:50:WU00:FS01:0x22:Reading tar file state.xml
16:52:51:WU00:FS01:0x22:Reading tar file system.xml
16:52:52:WU00:FS01:0x22:Digital signatures verified
16:52:52:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
16:52:52:WU00:FS01:0x22:Version 0.0.2
16:52:52:WU00:FS01:0x22:ERROR:exception: There is no registered Platform called "OpenCL"
16:52:52:WU00:FS01:0x22:Saving result file ../logfile_01.txt
16:52:52:WU00:FS01:0x22:Saving result file science.log
16:52:52:WU00:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
16:52:52:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:52:52:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11747 run:0 clone:9683 gen:10 core:0x22 unit:0x0000001b8ca304e75e6baf22f72b2965
Please note the line:

Code: Select all

16:52:52:WU00:FS01:0x22:ERROR:exception: There is no registered Platform called "OpenCL"
I found that error mentioned in several postings, but no working solution for it.

Another log from today:

Code: Select all

18:40:23:WU00:FS01:Assigned to work server 128.252.203.10
18:40:23:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:TU116 [GeForce GTX 1660] from 128.252.203.10
18:40:23:WU00:FS01:Connecting to 128.252.203.10:8080
18:42:34:WARNING:WU00:FS01:WorkServer connection failed on port 8080 trying 80
18:42:34:WU00:FS01:Connecting to 128.252.203.10:80
18:43:37:WU00:FS01:Downloading 29.59MiB
18:43:43:WU00:FS01:Download 13.31%
18:43:49:WU00:FS01:Download 27.25%
18:43:55:WU00:FS01:Download 43.93%
18:44:01:WU00:FS01:Download 56.82%
18:44:07:WU00:FS01:Download 69.70%
18:44:13:WU00:FS01:Download 83.64%
18:44:19:WU00:FS01:Download 95.26%
18:44:21:WU00:FS01:Download complete
18:44:21:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:11761 run:0 clone:1500 gen:45 core:0x22 unit:0x0000005480fccb0a5e6d7d2c494df012
18:44:21:WU00:FS01:Starting
18:44:21:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 1577 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
18:44:21:WU00:FS01:Started FahCore on PID 2322
18:44:21:WU00:FS01:Core PID:2326
18:44:21:WU00:FS01:FahCore 0x22 started
18:44:21:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:44:21:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:11761 run:0 clone:1500 gen:45 core:0x22 unit:0x0000005480fccb0a5e6d7d2c494df012
18:44:21:WU00:FS01:Uploading 7.00KiB to 128.252.203.10
18:44:21:WU00:FS01:Connecting to 128.252.203.10:8080
In this case I got a BAD_WORK_UNIT error without the 'no registered Platform called "OpenCL"' message.

And another one:

Code: Select all

18:44:22:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:TU116 [GeForce GTX 1660] from 52.224.109.74
18:44:22:WU01:FS01:Connecting to 52.224.109.74:8080
18:45:39:WU01:FS01:Downloading 161.51MiB
18:45:45:WU01:FS01:Download 2.32%
18:45:51:WU01:FS01:Download 4.60%
[...]
18:49:57:WU01:FS01:Download 97.40%
18:50:03:WU01:FS01:Download 99.99%
18:50:03:WU01:FS01:Download complete
18:50:03:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:13877 run:0 clone:1458 gen:37 core:0x22 unit:0x0000003234e06d4a5e80cfeac2e5a163
18:50:03:WU01:FS01:Starting
18:50:03:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 01 -suffix 01 -version 706 -lifeline 1577 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
18:50:03:WU01:FS01:Started FahCore on PID 2421
18:50:03:WU01:FS01:Core PID:2425
18:50:03:WU01:FS01:FahCore 0x22 started
18:50:03:WU01:FS01:0x22:*********************** Log Started 2020-04-22T18:50:03Z ***********************
18:50:03:WU01:FS01:0x22:*************************** Core22 Folding@home Core ***************************
18:50:03:WU01:FS01:0x22:       Type: 0x22
18:50:03:WU01:FS01:0x22:       Core: Core22
18:50:03:WU01:FS01:0x22:    Website: https://foldingathome.org/
18:50:03:WU01:FS01:0x22:  Copyright: (c) 2009-2018 foldingathome.org
18:50:03:WU01:FS01:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
18:50:03:WU01:FS01:0x22:             <rafal.wiewiora@choderalab.org>
18:50:03:WU01:FS01:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 2421 -checkpoint 15
18:50:03:WU01:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
18:50:03:WU01:FS01:0x22:             0 -gpu 0
18:50:03:WU01:FS01:0x22:     Config: <none>
18:50:03:WU01:FS01:0x22:************************************ Build *************************************
18:50:03:WU01:FS01:0x22:    Version: 0.0.2
18:50:03:WU01:FS01:0x22:       Date: Dec 6 2019
18:50:03:WU01:FS01:0x22:       Time: 21:20:17
18:50:03:WU01:FS01:0x22: Repository: Git
18:50:03:WU01:FS01:0x22:   Revision: f87d92b58abdf7e6bf2e173cfbc4dc3e837c7042
18:50:03:WU01:FS01:0x22:     Branch: core22
18:50:03:WU01:FS01:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
18:50:03:WU01:FS01:0x22:    Options: -std=gnu++98 -O3 -funroll-loops
18:50:03:WU01:FS01:0x22:   Platform: linux2 4.9.87-linuxkit-aufs
18:50:03:WU01:FS01:0x22:       Bits: 64
18:50:03:WU01:FS01:0x22:       Mode: Release
18:50:03:WU01:FS01:0x22:************************************ System ************************************
18:50:03:WU01:FS01:0x22:        CPU: AMD Ryzen 9 3900X 12-Core Processor
18:50:03:WU01:FS01:0x22:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
18:50:03:WU01:FS01:0x22:       CPUs: 24
18:50:03:WU01:FS01:0x22:     Memory: 62.85GiB
18:50:03:WU01:FS01:0x22:Free Memory: 49.31GiB
18:50:03:WU01:FS01:0x22:    Threads: POSIX_THREADS
18:50:03:WU01:FS01:0x22: OS Version: 4.12
18:50:03:WU01:FS01:0x22:Has Battery: false
18:50:03:WU01:FS01:0x22: On Battery: false
18:50:03:WU01:FS01:0x22: UTC Offset: 2
18:50:03:WU01:FS01:0x22:        PID: 2425
18:50:03:WU01:FS01:0x22:        CWD: /var/lib/fahclient/work
18:50:03:WU01:FS01:0x22:         OS: Linux 4.12.14-lp151.28.44-default x86_64
18:50:03:WU01:FS01:0x22:    OS Arch: AMD64
18:50:03:WU01:FS01:0x22:********************************************************************************
18:50:03:WU01:FS01:0x22:Project: 13877 (Run 0, Clone 1458, Gen 37)
18:50:03:WU01:FS01:0x22:Unit: 0x0000003234e06d4a5e80cfeac2e5a163
18:50:03:WU01:FS01:0x22:Reading tar file core.xml
18:50:03:WU01:FS01:0x22:Reading tar file integrator.xml
18:50:03:WU01:FS01:0x22:Reading tar file state.xml
18:50:03:WU01:FS01:0x22:Reading tar file system.xml
18:50:04:WU01:FS01:0x22:Digital signatures verified
18:50:04:WU01:FS01:0x22:Folding@home GPU Core22 Folding@home Core
18:50:04:WU01:FS01:0x22:Version 0.0.2
18:50:04:WU01:FS01:0x22:ERROR:exception: There is no registered Platform called "OpenCL"
18:50:04:WU01:FS01:0x22:Saving result file ../logfile_01.txt
18:50:04:WU01:FS01:0x22:Saving result file science.log
18:50:04:WU01:FS01:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
18:50:04:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:50:04:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:13877 run:0 clone:1458 gen:37 core:0x22 unit:0x0000003234e06d4a5e80cfeac2e5a163
18:50:04:WU01:FS01:Uploading 7.00KiB to 52.224.109.74
This time it logged the 'There is no registered Platform called "OpenCL"' error again.

I have the impression my NVIDIA CUDA/OpenCL installation is broken somehow.
I tried to check the installation as thoroughly as possible, but I just can't find the error.

Here's what I have found so far:

nvidia-smi tells me the installed driver and CUDA version:

Code: Select all

andreas@ws1:~> nvidia-smi        
Wed Apr 22 20:17:41 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1660    Off  | 00000000:2D:00.0  On |                  N/A |
| 47%   36C    P8     9W / 130W |    870MiB /  5941MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      3875      G   /usr/bin/X                                   615MiB |
|    0      7417      G   kwin_x11                                      98MiB |
|    0      7422      G   /usr/bin/krunner                               2MiB |
|    0      7468      G   /usr/bin/nextcloud                             3MiB |
|    0      9004      G   /usr/lib64/firefox/firefox                     2MiB |
|    0     21618      G   /usr/bin/plasmashell                         138MiB |
+-----------------------------------------------------------------------------+
This looks ok IMHO.

Programs like hashcat are actually able to use the GPU:

Code: Select all

andreas@ws1:~> hashcat -b
hashcat (v3.00) starting in benchmark-mode...

OpenCL Platform #1: NVIDIA Corporation
======================================
- Device #1: GeForce GTX 1660, 1485/5941 MB allocatable, 22MCU

Hashtype: MD4

Speed.Dev.#1.: 32367.8 MH/s (95.09ms)

Hashtype: MD5

Speed.Dev.#1.: 17449.0 MH/s (97.95ms)

Hashtype: Half MD5

Speed.Dev.#1.: 11695.3 MH/s (96.60ms)

Hashtype: SHA1

Speed.Dev.#1.:  6489.4 MH/s (97.61ms)
...
fahclient does seem to correctly find and identify GPU, CUDA and OpenCL libraries:

Code: Select all

*********************** Log Started 2020-04-22T18:23:14Z ***********************
18:23:14:****************************** FAHClient ******************************
18:23:14:        Version: 7.6.9
18:23:14:         Author: Joseph Coffland <joseph@cauldrondevelopment.com>
18:23:14:      Copyright: 2020 foldingathome.org
18:23:14:       Homepage: https://foldingathome.org/
18:23:14:           Date: Apr 17 2020
18:23:14:           Time: 18:11:30
18:23:14:       Revision: 398c2b17fa535e0cc6c9d10856b2154c32771646
18:23:14:         Branch: master
18:23:14:       Compiler: GNU 4.9.4
18:23:14:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
18:23:14:                 -funroll-loops
18:23:14:       Platform: linux2 4.19.0-5-amd64
18:23:14:           Bits: 64
18:23:14:           Mode: Release
18:23:14:           Args: --child /etc/fahclient/config.xml --run-as fahclient
18:23:14:                 --pid-file=/var/run/fahclient.pid --daemon
18:23:14:         Config: /etc/fahclient/config.xml
18:23:14:******************************** CBang ********************************
18:23:14:           Date: Apr 17 2020
18:23:14:           Time: 18:10:08
18:23:14:       Revision: 2fb0be7809c5e45287a122ca5fbc15b5ae859a3b
18:23:14:         Branch: master
18:23:14:       Compiler: GNU 4.9.4
18:23:14:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
18:23:14:                 -funroll-loops -fPIC
18:23:14:       Platform: linux2 4.19.0-5-amd64
18:23:14:           Bits: 64
18:23:14:           Mode: Release
18:23:14:******************************* System ********************************
18:23:14:            CPU: AMD Ryzen 9 3900X 12-Core Processor
18:23:14:         CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
18:23:14:           CPUs: 24
18:23:14:         Memory: 62.85GiB
18:23:14:    Free Memory: 49.88GiB
18:23:14:        Threads: POSIX_THREADS
18:23:14:     OS Version: 4.12
18:23:14:    Has Battery: false
18:23:14:     On Battery: false
18:23:14:     UTC Offset: 2
18:23:14:            PID: 1577
18:23:14:            CWD: /var/lib/fahclient
18:23:14:             OS: Linux 4.12.14-lp151.28.44-default x86_64
18:23:14:        OS Arch: AMD64
18:23:14:           GPUs: 1
18:23:14:          GPU 0: Bus:45 Slot:0 Func:0 NVIDIA:7 TU116 [GeForce GTX 1660]
18:23:14:  CUDA Device 0: Platform:0 Device:0 Bus:45 Slot:0 Compute:7.5 Driver:10.2
18:23:14:OpenCL Device 0: Platform:0 Device:0 Bus:45 Slot:0 Compute:1.2 Driver:440.82
18:23:14:******************************* libFAH ********************************
18:23:14:           Date: Apr 15 2020
18:23:14:           Time: 21:43:27
18:23:14:       Revision: 216968bc7025029c841ed6e36e81a03a316890d3
18:23:14:         Branch: master
18:23:14:       Compiler: GNU 4.9.4
18:23:14:        Options: -std=c++11 -ffunction-sections -fdata-sections -O3
18:23:14:                 -funroll-loops
18:23:14:       Platform: linux2 4.19.0-5-amd64
18:23:14:           Bits: 64
18:23:14:           Mode: Release
18:23:14:***********************************************************************
fahclient is running as user fahclient which belongs to group video and should be able to access the nvidia device files:

Code: Select all

andreas@ws1:/var/lib/fahclient> id fahclient
uid=454(fahclient) gid=100(users) Gruppen=100(users),484(video)
andreas@ws1:/var/lib/fahclient> ll /dev/nvidia*
crw-rw----+ 1 root video 195,   0 21. Apr 17:10 /dev/nvidia0
crw-rw----+ 1 root video 195, 255 21. Apr 17:10 /dev/nvidiactl
crw-rw----+ 1 root video 195, 254 21. Apr 17:10 /dev/nvidia-modeset
crw-rw-rw-+ 1 root root  241,   0 21. Apr 17:10 /dev/nvidia-uvm
crw-rw-rw-  1 root root  241,   1 21. Apr 17:17 /dev/nvidia-uvm-tools
I have the following packages installed, which might be relevant:

Code: Select all

andreas@ws1:/var/lib/fahclient> rpm -qa | grep -i "nvidia\|mesa\|icd\|clinfo" | sort
clinfo-2.2.18.04.06-lp151.2.3.x86_64
libOSMesa8-18.3.2-lp151.23.9.1.x86_64
libOSMesa8-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-18.3.2-lp151.23.9.1.x86_64
Mesa-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-demo-x-8.3.0-lp151.2.3.x86_64
Mesa-dri-18.3.2-lp151.23.9.1.x86_64
Mesa-dri-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-gallium-18.3.2-lp151.23.9.1.x86_64
Mesa-gallium-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-KHR-devel-18.3.2-lp151.23.9.1.x86_64
Mesa-libEGL1-18.3.2-lp151.23.9.1.x86_64
Mesa-libEGL-devel-18.3.2-lp151.23.9.1.x86_64
Mesa-libGL1-18.3.2-lp151.23.9.1.x86_64
Mesa-libGL1-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-libglapi0-18.3.2-lp151.23.9.1.x86_64
Mesa-libglapi0-32bit-18.3.2-lp151.23.9.1.x86_64
Mesa-libGL-devel-18.3.2-lp151.23.9.1.x86_64
Mesa-libGLESv1_CM1-18.3.2-lp151.23.9.1.x86_64
Mesa-libGLESv2-2-18.3.2-lp151.23.9.1.x86_64
Mesa-libva-18.3.2-lp151.23.9.1.x86_64
nvidia-computeG05-440.82-lp151.25.1.x86_64
nvidia-gfxG05-kmp-default-440.82_k4.12.14_lp151.27-lp151.25.1.x86_64
nvidia-glG05-440.82-lp151.25.1.x86_64
ocl-icd-devel-2.2.11-lp151.3.1.x86_64
x11-video-nvidiaG05-440.82-lp151.25.1.x86_64
Where else should I look?
What else can I do to get f@h to run on this GPU?

It seems I'm running out of ideas...

Any help to get that thing up and running is much appreciated!

- andreas

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 8:25 pm
by Joe_H
Some (recent) versions of Linux also require the fahclient user to be added to the 'render' group along with the video group from mentions here on the forum. Don't know if that applies to your version of OpenSUSE.

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 10:03 pm
by excelblue
The tipoff here is that it's trying to use OpenCL. Don't folding cores use CUDA when it's on nVIDIA?

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 10:07 pm
by Joe_H
excelblue wrote:The tipoff here is that it's trying to use OpenCL. Don't folding cores use CUDA when it's on nVIDIA?
No, they use OpenCL on both AMD and nVidia currently. The F@h developers may release a CUDA version of the GPU folding core at some point in the future, but no timeframe has been given.

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 10:11 pm
by AndreasH
Hi!

There is no such group "render" in OpenSUSE LEAP 15.1
But you made me think again!

I did a quick strace on the fahclient process, catched a failing GPU run and look what I found in the trace:

The core is executed:

Code: Select all

set_robust_list(0x7efbf7fc6e60, 24)     = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY) = 3
dup2(3, 1)                              = 1
dup2(3, 2)                              = 2
execve("/var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22", ["/var/lib/fahclient/cores/cores.f"..., "-dir", "01", "-suffix", "01", "-version", "706", "-lifeline", "8280", "-checkpoint", "15", "-gpu-vendor", "nvidia", "-opencl-platform", "0", "-opencl-device", "0", "-cuda-device", "0", "-gpu", "0"], 0x7ffbffffb1b0 /* 63 vars */) = 0
...
The it looks up and loads all the needed libraries:

Code: Select all

stat("/var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah", {st_mode=S_IFDIR|0777, st_size=20, ...}) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=212383, ...}) = 0
mmap(NULL, 212383, PROT_READ, MAP_PRIVATE, 4, 0) = 0x7efbf7fca000
close(4)                                = 0
openat(AT_FDCWD, "/usr/lib64/libOpenCL.so.1", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360$\0\0\0\0\0\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=27504, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7efbf7fc8000
mmap(NULL, 2122704, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7efbf7bd2000
mprotect(0x7efbf7bd9000, 2093056, PROT_NONE) = 0
mmap(0x7efbf7dd8000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x6000) = 0x7efbf7dd8000
close(4)                                = 0
...
It then does a lot of work, starts a child process, etc.
Eventually it opens the NVIDIA OpenCL library:

Code: Select all

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 13
fstat(13, {st_mode=S_IFREG|0644, st_size=212383, ...}) = 0
mmap(NULL, 212383, PROT_READ, MAP_PRIVATE, 13, 0) = 0x7efbf7fca000
close(13)                               = 0
openat(AT_FDCWD, "/usr/lib64/libnvidia-opencl.so.1", O_RDONLY|O_CLOEXEC) = 13
read(13, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\223\f\0\0\0\0\0"..., 832) = 832
fstat(13, {st_mode=S_IFREG|0755, st_size=29201480, ...}) = 0
mmap(NULL, 31386152, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 13, 0) = 0x7efbf4322000
mprotect(0x7efbf5ddf000, 2097152, PROT_NONE) = 0
mmap(0x7efbf5fdf000, 1163264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 13, 0x1abd000) = 0x7efbf5fdf000
mmap(0x7efbf60fb000, 88616, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7efbf60fb000
close(13)                               = 0
...
Finally it tries to open /dev/nvidiactl in O_RDWR mode:

Code: Select all

openat(AT_FDCWD, "/proc/driver/nvidia/params", O_RDONLY) = 13
fstat(13, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(13, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 693
close(13)                               = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0660, st_rdev=makedev(195, 255), ...}) = 0
openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = -1 EACCES (Permission denied)
And this fails with EACCESS!

And now the process writes the well known error message, cleans up and terminates:

Code: Select all

...
write(6, "ERROR:exception: There is no reg"..., 64) = 64
write(1, "ERROR:exception: There is no reg"..., 64) = 64
write(6, "\n", 1)                       = 1
write(1, "\n", 1)                       = 1
kill(8280, SIG_0)                       = 0
...
So, the process (running as user fahclient) is not allowed to open the device file /dev/nvidiactl even though the user is a member of group video and /dev/nvidiactl should be read- and writable by this group:

Code: Select all

andreas@ws1:/var/lib/fahclient> id fahclient
uid=454(fahclient) gid=100(users) Gruppen=100(users),484(video)
andreas@ws1:/var/lib/fahclient> ll /dev/nvidiactl
crw-rw----+ 1 root video 195, 255 21. Apr 17:10 /dev/nvidiactl
This is strange, I have to admit.

I then quickly removed the option "--run-as" from the FAHClient start script (just for testing) and started the client as root.
And guess what: the GPU immediately got a work unit and started folding...!
Its running fine for several minutes now, has already folded about 25% of its work unit and I feel the heat coming out from the box under my desk... ;-)

So it's actually a simple permission problem!

It's a pity that the error message is misleading.
I wouldn't have thought of a permission problem, your reply gave me the right direction and idea where to look next. Thank you so much!

I still have to check why it can't open the file as user fahclient, given the group and file permission configuration (I don't want to let it run as root) but I'm confident I can manage that now that I know where to look and that the overall setup of the NVIDIA driver and OpenCL subsystem is fine.

I hope this will help others facing the same problem.

- andreas

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Wed Apr 22, 2020 10:30 pm
by Kebast
Glad you figured it out, I would certainly never guessed that!

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Thu Apr 23, 2020 1:31 pm
by AndreasH
Hi again!

I analyzed the problem further and to me it looks like a bug in the "run-as" implementation of the Linux FAHClient program.

Let me explain...

As I said before, on OpenSUSE LEAP 15.1 access to the /dev/nvidiactl device file is restricted to the user "root" and group "video":

Code: Select all

dyn254:~ # ll /dev/nvidiactl
crw-rw----+ 1 root video 195, 255 23. Apr 13:43 /dev/nvidiactl
(There are also file access control lists on this file, I come to that later)

When installing the fahclient package, a user "fahclient" is automatically installed, too.

In order to let the FAHClient process drop privileges, the FAHClient program provides a commandline switch "--run-as" which accepts a username.
When using this switch, the process tries to switch to the given user id before it starts working on any work unit.
The provided start script /etc/init.d/FAHClient uses this feature to let the FAHClient be started at boot time by the root user, but later run safely without root permissions as user "fahclient".

This is of course good practice. But IMHO it is not implemented correctly.

Why do I think so?

When I first noticed the user "fahclient" needs access to the /dev/nvidiactl device file, I added the user to the "video" group.
This is IMHO a standard procedure on Unix and Unix like systems to allow unprivileged users access to selected resources and certainly this concept is used on OpenSUSE installations.

To prove it works I first just switched to the fahclient user and then started the FAHClient programm like this:

Code: Select all

dyn254:~ # su -s /bin/bash -l fahclient

fahclient@dyn254:~> id
uid=456(fahclient) gid=100(users) Gruppen=100(users),484(video)

fahclient@dyn254:~> /usr/bin/FAHClient /etc/fahclient/config.xml
In this case, the FAHClient process correctly gets access to the /dev/nvidiactl device file and starts folding on the GPU.

But when started as user "root" with the commandline option "--run-as" like this it doesn't work:

Code: Select all

dyn254:~ # id
uid=0(root) gid=0(root) Gruppen=0(root)

dyn254:~ # /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient
This is basically the start procedure used in the boot script-
But why does it fail?

I haven't found the sources for the program, so I couldn't check the source, but with strace I found this:

Right after start the process loads required libraries:

Code: Select all

execve("/usr/bin/FAHClient", ["/usr/bin/FAHClient", "/etc/fahclient/config.xml", "--run-as", "fahclient"], 0x7fff33bbef28 /* 59 vars */) = 0
brk(NULL)                               = 0x176e000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=168172, ...}) = 0
mmap(NULL, 168172, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0a8566c000
close(3)                                = 0
...
It then initializes itself and basically makes itself comfortably on the system.
Then comes the "run-as" part: The process first tries to read the necessary information for the given user ("fahclient"):

Code: Select all

openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 9
lseek(9, 0, SEEK_CUR)                   = 0
fstat(9, {st_mode=S_IFREG|0644, st_size=2377, ...}) = 0
mmap(NULL, 2377, PROT_READ, MAP_SHARED, 9, 0) = 0x7f0a85675000
lseek(9, 2377, SEEK_SET)                = 2377
munmap(0x7f0a85675000, 2377)            = 0
close(9)                                = 0
...
and then it executes the setuid(2) system call

Code: Select all

setuid(456)                             = 0
456 is the userid of the "fahclient" user on this machine.

But this is it. FAHClient now just proceeds with its business logic: it writes into files, tries to acquire work units and starts child processes to do the actual work.

It does not perform system calls to get and to set all the privileges of the user it just switched to!
In more detail, it does neither use the setgid(2) nor the setgroups(2) system call to switch to the group and supplementary groups of the "fahclient" user.
So the process proceeds with its group identifiers inherited from the starting process which in this case is the group of the "root" user which on a typical system is just "root" (gid 0) without any supplementary group.

To prove this I added the user root to the "video" group and started FAHClient with the normal boot script with the --run-as switch.
And with this configuration it works: the process switches to the "fahclient" user, but inherits its supplementary groups from the "root" user (including the "video" group). Now it can open the /dev/nvidiactl device file successfully and can run a folding job on the GPU.

You can see this here: "root" now is member of the "video" group, the "FAHClient" process started by root with the "--run-as fahclient" option runs with uid 456 (fahclient), group 0 (root) and supplementary group 484 (video):

Code: Select all

dyn254:~ # id
uid=0(root) gid=0(root) Gruppen=0(root),484(video)

dyn254:~ # cat /proc/8008/status 
Name:   FAHClient
Umask:  0000
State:  S (sleeping)
Tgid:   8008
Ngid:   0
Pid:    8008
PPid:   8006
TracerPid:      0
Uid:    456     456     456     456
Gid:    0       0       0       0
FDSize: 64
Groups: 0 484 
...
So, to finally solve this problem I think we have several options:

1) Fix the "run-as" logic in FAHClient.
FAHClient should not only call setuid(2) with the uid of the user given on the commandline, but also read the full group configuration of this user and then also call setgid(2) and setgroups(2) with all the groups of this user.
This is simple Unix application programming and should work on Linux and all Unix and Unix like systems. See credentials(7) or any Unix programming book (like "Advanced Programming in the Unix environment" by the late W. Richard Stevens, which still is a classic) for more information.

2) One could work around this by adding the root user to the appropriate group (like "video" on OpenSUSE systems)
This is a little bit counter-intuitive, but as shown above this actually works.

3) One could set a special access control list on the /dev/nvidiactl device file to directly grant the fahclient user read/write access to this file.
Sett setfacl(1) for more information.

I would prefer (1)... ;-)

I hope this analysis will help the developers of the FAHClient code to fix the "run-as" option.

- andreas (Unix developer since 1990 who had much fun analyzing this problem :-)

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Thu Apr 23, 2020 4:42 pm
by Joe_H
Could you post up your findings on GitHub site for the F@h software - https://github.com/FoldingAtHome/fah-issues/issues? That could be useful for the developer and some others who have volunteered to work on the client software.

Re: Linux: There is no registered Platform called "OpenCL"

Posted: Thu Apr 23, 2020 8:17 pm
by AndreasH
Joe_H wrote:Could you post up your findings on GitHub site for the F@h software - https://github.com/FoldingAtHome/fah-issues/issues? That could be useful for the developer and some others who have volunteered to work on the client software.
Done.

See https://github.com/FoldingAtHome/fah-issues/issues/1418