FAHClient does not support more than 10 GPUs

caseymdk · Post by **caseymdk** » Mon Mar 23, 2020 4:59 am

Hi all,

I setup an AWS p2.16xlarge EC2 instance, which has 16 K80 Tesla's in it. I wanted to bang out some workunits, while also learning some about how AWS GPU instances work. I was able to get assigned workunits for all slots, but once they started running, I saw reasonable ETAs of a few hours for 10 GPU slots, but an ETA of 3 days for the other 6. This didn't make sense as the base credit was roughly the same for each workunit.

I used the "nvidia-smi" command to see what the resource utilization of each GPU was, and this is what I found. Notice how 7 FahCore processes are all on one GPU, while 6 GPUs are idle.

Code: Select all

ubuntu@ip-172-31-33-143:/var/lib/fahclient/work/11/01$ nvidia-smi
Mon Mar 23 02:40:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:0F.0 Off |                    0 |
| N/A   79C    P0   147W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:10.0 Off |                    0 |
| N/A   60C    P0   148W / 149W |   1001MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:00:11.0 Off |                    0 |
| N/A   80C    P0   147W / 149W |    216MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:00:12.0 Off |                    0 |
| N/A   63C    P0   145W / 149W |    121MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:00:13.0 Off |                    0 |
| N/A   78C    P0   148W / 149W |    158MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:00:14.0 Off |                    0 |
| N/A   62C    P0   148W / 149W |    158MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:00:15.0 Off |                    0 |
| N/A   80C    P0   145W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:00:16.0 Off |                    0 |
| N/A   63C    P0   149W / 149W |    121MiB / 11441MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla K80           On   | 00000000:00:17.0 Off |                    0 |
| N/A   81C    P0   150W / 149W |    159MiB / 11441MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla K80           On   | 00000000:00:18.0 Off |                    0 |
| N/A   63C    P0   151W / 149W |    205MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla K80           On   | 00000000:00:19.0 Off |                    0 |
| N/A   40C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   36C    P8    31W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla K80           On   | 00000000:00:1B.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla K80           On   | 00000000:00:1C.0 Off |                    0 |
| N/A   30C    P8    30W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla K80           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P8    26W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    31W / 149W |     11MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15920      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    1     15934      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15948      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15955      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15962      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    1     15969      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   194MiB |
|    1     15976      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   205MiB |
|    1     19333      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    2     16190      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   205MiB |
|    3     16843      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    4     15927      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    5     16985      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   147MiB |
|    6     17325      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    7     17462      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   109MiB |
|    8     17469      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   148MiB |
|    9     19100      C   ...org/v7/lin/64bit/Core_22.fah/FahCore_22   194MiB |
+-----------------------------------------------------------------------------+

I tried explicitly setting the GPU indexes (setting them from 0-15) on each slot, but after reloading folding@home with the new config, I still saw the same behaviour. I feel like maybe the GPU indexes from folding@home are being passed to CUDA/Drivers in decimal, but they need to be in hex? No idea.

Let me know if anyone has any thoughts on this, or if you've seen similar behaviour.

Thanks!

Post by **Jesse_V** » Mon Mar 23, 2020 6:39 am

Pretty cool setup, thanks and welcome to the forum.

The GPU indices should be in standard notation. Did you change the "opencl-index" or "cuda-index" option?

It might be helpful to post some of the log. Any clues in there?

caseymdk · Post by **caseymdk** » Mon Mar 23, 2020 6:49 am

Looks like the indexes are being passed to FahCore correctly. The fact that the multiple GPUs were all assigned to GPU 1 makes me think that it's only taking the first character in the argument...though I can't see why it would be doing that. This is the log from when the OpenCL, GPU, and CUDA indexes were left at -1. All looks correct, but this instance got assigned to GPU 1 instead of GPU 14.

Code: Select all

02:08:06:WU15:FS15:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/Core_22.fah/FahCore_22 -dir 15 -suffix 01 -version 705 -lifeline 15900 -checkpoint 3 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 14 -cuda-device 14 -gpu 14
02:08:06:WU15:FS15:Started FahCore on PID 15944
02:08:06:Started thread 16 on PID 15900
02:08:06:WU15:FS15:Core PID:15948
02:08:06:WU15:FS15:FahCore 0x22 started
02:08:06:WU14:FS14:0x22:*********************** Log Started 2020-03-23T02:08:06Z ***********************
02:08:06:WU14:FS14:0x22:*************************** Core22 Folding@home Core ***************************
02:08:06:WU14:FS14:0x22:       Type: 0x22
02:08:06:WU14:FS14:0x22:       Core: Core22
02:08:06:WU14:FS14:0x22:    Website: https://foldingathome.org/
02:08:06:WU14:FS14:0x22:  Copyright: (c) 2009-2018 foldingathome.org
02:08:06:WU14:FS14:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
02:08:06:WU14:FS14:0x22:             <rafal.wiewiora@choderalab.org>
02:08:06:WU14:FS14:0x22:       Args: -dir 14 -suffix 01 -version 705 -lifeline 15937 -checkpoint 3
02:08:06:WU14:FS14:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 13
02:08:06:WU14:FS14:0x22:             -cuda-device 13 -gpu 13
02:08:06:WU14:FS14:0x22:     Config: <none>

_r2w_ben · Post by **_r2w_ben** » Mon Mar 23, 2020 10:48 am

Someone else came across this with 13 GPUs. As you've noticed, the core seems to only parse the first character of the argument into a digit.

foldy · Post by **foldy** » Mon Mar 23, 2020 11:07 am

https://github.com/FoldingAtHome/fah-issues/issues/1245

cfhdev · Post by **cfhdev** » Thu Apr 23, 2020 4:01 pm

Now a new beta version is out has anyone tried it with more then 10 GPU's? I just want to see if this is still an issue with 7.6.10

cfhdev · Post by **cfhdev** » Mon Apr 27, 2020 11:02 pm

Well since I received no answer, I tried on a 16 GPU system and received the same results. 0-9 is fine 10 and up count as slot 1

MeeLee · Post by **MeeLee** » Tue Apr 28, 2020 1:38 am

With modern hardware it's less feasible to hit 10 GPUs.
Especially those that are good for crunching, 1,5kW on the outlet could drive about 6 or 7 GPUs, 8 if you tune them.
It makes more sense to run 2 powerful GPUs (RTX2080Ti), than 10 slower ones (eg: GT 1030, GTX 1050, or older Kepler GPUs).
I believe the trend will only continue, meaning it'll make more sense in 5 years, to run one or two modern GPUs at that time, than run relativey 'older' RTX 2080 Ti GPUs.

Folding Forum

FAHClient does not support more than 10 GPUs

FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs

Re: FAHClient does not support more than 10 GPUs