Page 1 of 1

please add nvidia "PH402 SKU 200"

Posted: Sat Aug 17, 2024 9:33 am
by 84036980
This is a very rare Tesla GPU with two GP100 chips.

0x10de:0x15fa:3:0:0:NVIDIA Corporation:
0x10de:0x15fa:4:0:0:NVIDIA Corporation:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 PH402 SKU 200 On | 00000000:03:00.0 Off | 0 |
| N/A 44C P0 36W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 PH402 SKU 200 On | 00000000:04:00.0 Off | 0 |
| N/A 40C P0 36W / 140W | 0MiB / 32630MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Re: please add nvidia "PH402 SKU 200"

Posted: Sat Aug 17, 2024 12:20 pm
by toTOW
Your drivers are too old, current GPU cores require CUDA 11.2 (and a CPU with SSE4.2).

Added the follding GPUs that were missing :
0x15fa / GP100GL [DGX Station / PH402 SKU 200]
0x15fb / GP100GL [GP100 SKU 200]
0x15fc / GP100GL [Tesla P100-DGXS-16GB]
0x15ff / GP100GL [GP100 SKU 15ff]

Re: please add nvidia "PH402 SKU 200"

Posted: Sat Aug 17, 2024 5:59 pm
by 84036980
Great! Thanks for the information.
I'm trying to find a newer driver to work with this GPU.
some of the most recent versions of drivers will cause OS crashes on this GPU.

Re: please add nvidia "PH402 SKU 200"

Posted: Sat Aug 17, 2024 9:29 pm
by 84036980
after some debugging, it turns out that the crash was caused by a bad Nvlink between two chips.

now I'm able to get it to work with the latest driver after disabled the Nvlink while loading the driver.

Code: Select all

modprobe nvidia NVreg_RegistryDwords="RMNvLinkControl=1"

Code: Select all

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA PH402 SKU 200           On  |   00000000:03:00.0 Off |                    0 |
| N/A   47C    P0             89W /  140W |     313MiB /  32768MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA PH402 SKU 200           On  |   00000000:04:00.0 Off |                    0 |
| N/A   41C    P0             88W /  140W |     289MiB /  32768MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4104      C   ...e/0x23-8.0.3/Core_23.fah/FahCore_23        310MiB |
|    1   N/A  N/A      4112      C   ...e/0x23-8.0.3/Core_23.fah/FahCore_23        286MiB |
+-----------------------------------------------------------------------------------------+