I setup an AWS p2.16xlarge EC2 instance, which has 16 K80 Tesla's in it. I wanted to bang out some workunits, while also learning some about how AWS GPU instances work. I was able to get assigned workunits for all slots, but once they started running, I saw reasonable ETAs of a few hours for 10 GPU slots, but an ETA of 3 days for the other 6. This didn't make sense as the base credit was roughly the same for each workunit.
I used the "nvidia-smi" command to see what the resource utilization of each GPU was, and this is what I found. Notice how 7 FahCore processes are all on one GPU, while 6 GPUs are idle.
Code: Select all
ubuntu@ip-172-31-33-143:/var/lib/fahclient/work/11/01$ nvidia-smi
Mon Mar 23 02:40:39 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:0F.0 Off | 0 |
| N/A 79C P0 147W / 149W | 159MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:00:10.0 Off | 0 |
| N/A 60C P0 148W / 149W | 1001MiB / 11441MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:00:11.0 Off | 0 |
| N/A 80C P0 147W / 149W | 216MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:00:12.0 Off | 0 |
| N/A 63C P0 145W / 149W | 121MiB / 11441MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:00:13.0 Off | 0 |
| N/A 78C P0 148W / 149W | 158MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:00:14.0 Off | 0 |
| N/A 62C P0 148W / 149W | 158MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:00:15.0 Off | 0 |
| N/A 80C P0 145W / 149W | 159MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:00:16.0 Off | 0 |
| N/A 63C P0 149W / 149W | 121MiB / 11441MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
| 8 Tesla K80 On | 00000000:00:17.0 Off | 0 |
| N/A 81C P0 150W / 149W | 159MiB / 11441MiB | 98% Default |
+-------------------------------+----------------------+----------------------+
| 9 Tesla K80 On | 00000000:00:18.0 Off | 0 |
| N/A 63C P0 151W / 149W | 205MiB / 11441MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 10 Tesla K80 On | 00000000:00:19.0 Off | 0 |
| N/A 40C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 Tesla K80 On | 00000000:00:1A.0 Off | 0 |
| N/A 36C P8 31W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 Tesla K80 On | 00000000:00:1B.0 Off | 0 |
| N/A 36C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 Tesla K80 On | 00000000:00:1C.0 Off | 0 |
| N/A 30C P8 30W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 Tesla K80 On | 00000000:00:1D.0 Off | 0 |
| N/A 39C P8 26W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P8 31W / 149W | 11MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15920 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 148MiB |
| 1 15934 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 1 15948 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 1 15955 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 1 15962 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 1 15969 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 194MiB |
| 1 15976 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 205MiB |
| 1 19333 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 147MiB |
| 2 16190 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 205MiB |
| 3 16843 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 4 15927 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 147MiB |
| 5 16985 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 147MiB |
| 6 17325 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 148MiB |
| 7 17462 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 109MiB |
| 8 17469 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 148MiB |
| 9 19100 C ...org/v7/lin/64bit/Core_22.fah/FahCore_22 194MiB |
+-----------------------------------------------------------------------------+
Let me know if anyone has any thoughts on this, or if you've seen similar behaviour.
Thanks!