I'm running FAH v8.5.5 on an HPE DL380a Gen12 server with 4x NVIDIA H200 NVL GPUs. The client only detects 2 of the 4 GPUs.
System Info:
OS: Ubuntu (kernel 6.17)
CPU: Intel Xeon 6760P (256 threads)
RAM: 2TB
GPU: 4x NVIDIA H200 NVL
Driver: 580.95
FAH Client: v8.5.5
GPU topology (nvidia-smi topo -m):
Code: Select all
GPU0 GPU1 GPU2 GPU3
GPU0 X NV18 SYS SYS
GPU1 NV18 X SYS SYS
GPU2 SYS SYS X NV18
GPU3 SYS SYS NV18 XCode: Select all
0, NVIDIA H200 NVL, 00000000:AC:00.0
1, NVIDIA H200 NVL, 00000000:D4:00.0
2, NVIDIA H200 NVL, 00000001:AC:00.0
3, NVIDIA H200 NVL, 00000001:D4:00.0FAH log shows only 2 GPUs detected:
Code: Select all
"gpu:172:00:00": {
"description": "NVIDIA H200 NVL",
"uuid": "a03316c8-9bba-37a6-5a7e-af0e1a66797b", ← GPU2
...
},
"gpu:212:00:00": {
"description": "NVIDIA H200 NVL",
"uuid": "4b963b13-f43a-2b27-c8c6-de3c9a2446af", ← GPU3
...
}Would it be possible to include the PCIe domain in the GPU identifier (e.g., gpu:DOMAIN:BUS:DEV:FN) so that all GPUs are properly enumerated on multi-domain systems?
Thanks for your time!