Mixed GPU Types with Thunderbolt 3 eGPU

Moderators: Site Moderators, FAHC Science Team

Post Reply
prowave
Posts: 2
Joined: Sun Apr 26, 2020 6:23 pm

Mixed GPU Types with Thunderbolt 3 eGPU

Post by prowave »

I have 2 internal Nvidia RTX 2080 GPUS and an eGPU enclosure with an AMD Vega64. The AMD runs fine sometimes, then the eGPU disconnects, the fan on the GPU spikes to full blast, and I see this in the logs. This only happens when using the FAH Client. Is this the common issue I have read about where the client tries to assign an Nvidia WU to an AMD GPU? If so, is there a solution? Thank you.

Code: Select all

18:00:20:WU01:FS03:Connecting to 65.254.110.245:80
18:00:20:WU01:FS03:Assigned to work server 140.163.4.241
18:00:20:WU01:FS03:Requesting new work unit for slot 03: READY gpu:2:Vega 10 XL/XT [Radeon RX Vega 56/64] from 140.163.4.241
18:00:20:WU01:FS03:Connecting to 140.163.4.241:8080
18:01:12:WU01:FS03:Downloading 4.53MiB
18:01:14:WU01:FS03:Download complete
18:01:14:WU01:FS03:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:11742 run:0 clone:5312 gen:69 core:0x22 unit:0x000000688ca304f15e6bc531c5f3c075
18:01:14:WU01:FS03:Starting
18:01:14:WU01:FS03:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:\Users\David Webb\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe" -dir 01 -suffix 01 -version 706 -lifeline 21148 -checkpoint 5 -gpu-vendor amd -opencl-platform 1 -opencl-device 0 -gpu 0
18:01:14:WU01:FS03:Started FahCore on PID 2976
18:01:14:WU01:FS03:Core PID:19692
18:01:14:WU01:FS03:FahCore 0x22 started
18:01:14:WU01:FS03:0x22:*********************** Log Started 2020-04-26T18:01:14Z ***********************
18:01:14:WU01:FS03:0x22:*************************** Core22 Folding@home Core ***************************
18:01:14:WU01:FS03:0x22:       Type: 0x22
18:01:14:WU01:FS03:0x22:       Core: Core22
18:01:14:WU01:FS03:0x22:    Website: https://foldingathome.org/
18:01:14:WU01:FS03:0x22:  Copyright: (c) 2009-2018 foldingathome.org
18:01:14:WU01:FS03:0x22:     Author: John Chodera <john.chodera@choderalab.org> and Rafal Wiewiora
18:01:14:WU01:FS03:0x22:             <rafal.wiewiora@choderalab.org>
18:01:14:WU01:FS03:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 2976 -checkpoint 5
18:01:14:WU01:FS03:0x22:             -gpu-vendor amd -opencl-platform 1 -opencl-device 0 -gpu 0
18:01:14:WU01:FS03:0x22:     Config: <none>
18:01:14:WU01:FS03:0x22:************************************ Build *************************************
18:01:14:WU01:FS03:0x22:    Version: 0.0.2
18:01:14:WU01:FS03:0x22:       Date: Dec 6 2019
18:01:14:WU01:FS03:0x22:       Time: 21:30:31
18:01:14:WU01:FS03:0x22: Repository: Git
18:01:14:WU01:FS03:0x22:   Revision: abeb39247cc72df5af0f63723edafadb23d5dfbe
18:01:14:WU01:FS03:0x22:     Branch: HEAD
18:01:14:WU01:FS03:0x22:   Compiler: Visual C++ 2008
18:01:14:WU01:FS03:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
18:01:14:WU01:FS03:0x22:   Platform: win32 10
18:01:14:WU01:FS03:0x22:       Bits: 64
18:01:14:WU01:FS03:0x22:       Mode: Release
18:01:14:WU01:FS03:0x22:************************************ System ************************************
18:01:14:WU01:FS03:0x22:        CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
18:01:14:WU01:FS03:0x22:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 12
18:01:14:WU01:FS03:0x22:       CPUs: 16
18:01:14:WU01:FS03:0x22:     Memory: 31.93GiB
18:01:14:WU01:FS03:0x22:Free Memory: 24.07GiB
18:01:14:WU01:FS03:0x22:    Threads: WINDOWS_THREADS
18:01:14:WU01:FS03:0x22: OS Version: 6.2
18:01:14:WU01:FS03:0x22:Has Battery: false
18:01:14:WU01:FS03:0x22: On Battery: false
18:01:14:WU01:FS03:0x22: UTC Offset: -4
18:01:14:WU01:FS03:0x22:        PID: 19692
18:01:14:WU01:FS03:0x22:        CWD: C:\Users\David Webb\AppData\Roaming\FAHClient\work
18:01:14:WU01:FS03:0x22:         OS: Windows 10 Pro
18:01:14:WU01:FS03:0x22:    OS Arch: AMD64
18:01:14:WU01:FS03:0x22:********************************************************************************
18:01:14:WU01:FS03:0x22:Project: 11742 (Run 0, Clone 5312, Gen 69)
18:01:14:WU01:FS03:0x22:Unit: 0x000000688ca304f15e6bc531c5f3c075
18:01:14:WU01:FS03:0x22:Reading tar file core.xml
18:01:14:WU01:FS03:0x22:Reading tar file integrator.xml
18:01:14:WU01:FS03:0x22:Reading tar file state.xml
18:01:15:WU01:FS03:0x22:Reading tar file system.xml
18:01:16:WU01:FS03:0x22:Digital signatures verified
18:01:16:WU01:FS03:0x22:Folding@home GPU Core22 Folding@home Core
18:01:16:WU01:FS03:0x22:Version 0.0.2
18:01:16:WU01:FS03:0x22:ERROR:126: Bad platformId size.
18:01:16:WU01:FS03:0x22:Saving result file ..\logfile_01.txt
18:01:16:WU01:FS03:0x22:Saving result file science.log
18:01:16:WU01:FS03:0x22:Folding@home Core Shutdown: BAD_WORK_UNIT
18:01:16:WARNING:WU01:FS03:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:01:16:WU01:FS03:Sending unit results: id:01 state:SEND error:FAULTY project:11742 run:0 clone:5312 gen:69 core:0x22 unit:0x000000688ca304f15e6bc531c5f3c075
18:01:16:WU01:FS03:Uploading 2.25KiB to 140.163.4.241
Joe_H
Site Admin
Posts: 8224
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: Mixed GPU Types with Thunderbolt 3 eGPU

Post by Joe_H »

Hard to say. This could be related, or it also looks a bit like the error messages we see in logs from persons with laptops that use both an iGPU and a separate higher power GPU active. If the WU starts, and the laptop is using the iGPU, then the folding core does not "see" the full power GPU and the WU errors out.

Any chance your eGPU is going offline in between working and not working?
Image
foldy
Posts: 2040
Joined: Sat Dec 01, 2012 3:43 pm
Hardware configuration: Folding@Home Client 7.6.13 (1 GPU slots)
Windows 7 64bit
Intel Core i5 2500k@4Ghz
Nvidia gtx 1080ti driver 441

Re: Mixed GPU Types with Thunderbolt 3 eGPU

Post by foldy »

Maybe you can run FahViewer on the AMD eGPU so it gets never idle between work units.
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Mixed GPU Types with Thunderbolt 3 eGPU

Post by bruce »

There are several reports of problems with certain driver versions for various Vega devices. This might be the same issue and you might fix it by installing an earlier driver version ... but you'll have to read the reports and the solutions and figure out if they apply to you.
prowave
Posts: 2
Joined: Sun Apr 26, 2020 6:23 pm

Re: Mixed GPU Types with Thunderbolt 3 eGPU

Post by prowave »

@Joe_H - Its a desktop with the onboard graphics disabled, but I'll double-check all that again.
@foldy and @bruce - Thanks - I hope I don't have to track down an old driver.

The last change I made, and I am hoping to see if it continues to work like it has over the last 24 hours, is that I explicitly set the types of GPUs by setting the gpu-index on each slot, then based on the GPU, I set the cuda-index for the NVidias and the opencl-index for the AMD. If this was an issue of passing the -cuda-device flag to the AMD GPU, then it should be resolved. If not, I'll move to the next suggestion...thanks all!
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Mixed GPU Types with Thunderbolt 3 eGPU

Post by MeeLee »

Sometimes I see similar behavior on Nvidia GPUs that are plugged in pcie risers with another GPU sharing the sata power cable.
Sata power cables should carry 70W, and each pcie riser should feed the GPU with 35W, but when there are spikes a GPU gets kicked offline. I'd probably check the power cable to the riser, and see if the PSU isn't overloaded (or running close to its max wattage).
Post Reply