RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.0?)
Moderators: Site Moderators, FAHC Science Team
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I just did a DDU/install of 441.87 and the first GPU WU completed successfully while I was afraid to touch the computer for fear of jinxing it. I'm now running a second WU and monitoring temps to see if it was a fluke. This was with the CPU slot turned off, but that same config failed on 445.78 so it would appear that the nVidia driver version was the main issue here.
I'll report back when I've done a few WUs and as I start to restore the old system states.
I'll report back when I've done a few WUs and as I start to restore the old system states.
-
- Posts: 390
- Joined: Sun Dec 02, 2007 4:53 am
- Hardware configuration: FX8320e (6 cores enabled) @ stock,
- 16GB DDR3,
- Zotac GTX 1050Ti @ Stock.
- Gigabyte GTX 970 @ Stock
Debian 9.
Running GPU since it came out, CPU since client version 3.
Folding since Folding began (~2000) and ran Genome@Home for a while too.
Ran Seti@Home prior to that. - Location: UK
- Contact:
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I just wrote a post saying to try an older version of the driver.
Glad you seem to have fixed it.
Glad you seem to have fixed it.
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
Thanks!v00d00 wrote:I just wrote a post saying to try an older version of the driver.
Glad you seem to have fixed it.
Yeah, I've mentioned a few times through the discussion that I've been suspecting the driver, but I was hoping that I'd get some empirical guidance about which versions to try. The recommendations are always to try the latest versions, but that didn't seem to be helpful for my case.
In this case, I simply chose a driver that I had used before a few months back (and still had in my cache of downloads on my NAS) that also happened to be a couple of major versions behind and I lucked out. It would be nice for the FaH servers to track the driver releases that contributed good results for diagnostic purposes.
(And, it would be nice if there were sample WUs that could be used for vetting an install without having to wait for a new one to come available, but I think that's only been a problem in the last few weeks. For instance, I had to wait nearly an hour to get a WU after I downgraded to 441.87. At that point I could care less if I was going to get credit for something or if it contributed to the cause, I just wanted any valid WU to test the setup. Such sample WUs could also help determine if issues were network related or compute related...)
-
- Posts: 390
- Joined: Sun Dec 02, 2007 4:53 am
- Hardware configuration: FX8320e (6 cores enabled) @ stock,
- 16GB DDR3,
- Zotac GTX 1050Ti @ Stock.
- Gigabyte GTX 970 @ Stock
Debian 9.
Running GPU since it came out, CPU since client version 3.
Folding since Folding began (~2000) and ran Genome@Home for a while too.
Ran Seti@Home prior to that. - Location: UK
- Contact:
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
If you look in the Nvidia area theres usually some discussion on viable driver versions.
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I have a couple of things that may be completely off-base ... or not.
1) Is this happening on a system with a hybrid GPU that switches between an iGPU and a "real" GPU based on the system-perceived Graphics load?
2) Is this happening when Intel's OpenCL is also present on the system? (Even though Intel claims their OpenCL doesn't interfere with other installed versions of OpenCL, there are exceptions) Uninstalling the Intel version is a reasonable test.
1) Is this happening on a system with a hybrid GPU that switches between an iGPU and a "real" GPU based on the system-perceived Graphics load?
2) Is this happening when Intel's OpenCL is also present on the system? (Even though Intel claims their OpenCL doesn't interfere with other installed versions of OpenCL, there are exceptions) Uninstalling the Intel version is a reasonable test.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
It looks like both AMD and nV are having driver issues ... More people are paying attention ... and With all this activity, more events are being seen.
Also, iI may turn out that some of the issues are inside of OpenCL or in the way the API stack interfaces with the next level code.
Also, iI may turn out that some of the issues are inside of OpenCL or in the way the API stack interfaces with the next level code.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
1) Single GPU, no switching, with a Ryzen 3900x that has no integrated graphics.
2) Both tests under 445.78 and 441.87 were after a safe mode DDU + network-disabled install of the nVidia drivers. I also did a complete clean of AMD and Intel drivers just in case even though I've never had an AMD or Intel GPU or iGPU in the system.
2) Both tests under 445.78 and 441.87 were after a safe mode DDU + network-disabled install of the nVidia drivers. I also did a complete clean of AMD and Intel drivers just in case even though I've never had an AMD or Intel GPU or iGPU in the system.
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I am using the latest "Studio Driver" version 442.19 without a hitch. Maybe you give a shot with that driver. I stayed away from the latest "Game Ready Driver" when I heard about the problems with it... (so I haven´t even tried 445.75)
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I doubt the following is going to be helpful but I did see one person post that they were having issues on a Linux platform so I thought I'd post this just in case it provides a clue to someone who is frustrated. I have not seen any errors while processing with this GPU. That doesn't mean that they haven't happened I just haven't seen any and I've been at the computer for a great many hours since the GPU was installed.
Code: Select all
System:
Host: OMITTED Kernel: 5.3.0-45-generic x86_64 bits: 64
compiler: gcc v: 7.5.0 Desktop: Cinnamon 4.4.8
Distro: Linux Mint 19.3 Tricia base: Ubuntu 18.04 bionic
Machine:
Type: Desktop Mobo: Micro-Star model: B450 TOMAHAWK MAX (MS-7C02) v: 1.0
serial: <filter> UEFI: American Megatrends v: 3.50 date: 11/07/2019
CPU:
Topology: 8-Core model: AMD Ryzen 7 3700X bits: 64 type: MT MCP arch: Zen
L2 cache: 4096 KiB
flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
bogomips: 115202
Speed: 2199 MHz min/max: 2200/3600 MHz Core speeds (MHz): 1: 2200 2: 2199
3: 2200 4: 2199 5: 2141 6: 2159 7: 2155 8: 4308 9: 2199 10: 2201 11: 2160
12: 2159 13: 2161 14: 4304 15: 2200 16: 2200
Graphics:
Device-1: NVIDIA vendor: eVga.com. driver: nvidia v: 435.21
bus ID: 26:00.0
Display: x11 server: X.Org 1.19.6 driver: nvidia
unloaded: fbdev,modesetting,nouveau,vesa resolution: 2560x1080~60Hz
OpenGL: renderer: GeForce RTX 2070 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 435.21
direct render: Yes
Audio:
Device-1: NVIDIA vendor: eVga.com. driver: snd_hda_intel v: kernel
bus ID: 26:00.1
Device-2: AMD vendor: Micro-Star MSI driver: snd_hda_intel v: kernel
bus ID: 28:00.4
Sound Server: ALSA v: k5.3.0-45-generic
Network:
Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
vendor: Micro-Star MSI driver: r8169 v: kernel port: f000 bus ID: 22:00.0
IF: enp34s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
Local Storage: total: 2.77 TiB used: 316.89 GiB (11.2%)
ID-1: /dev/nvme0n1 model: Sabrent size: 953.87 GiB
ID-2: /dev/sda vendor: A-Data model: SU800 size: 953.87 GiB
ID-3: /dev/sdb vendor: Western Digital model: WD10EZEX-00BN5A0
size: 931.51 GiB
Partition:
ID-1: / size: 937.40 GiB used: 158.44 GiB (16.9%) fs: ext4
dev: /dev/nvme0n1p2
ID-2: swap-1 size: 2.00 GiB used: 0 KiB (0.0%) fs: swap dev: /dev/dm-0
Sensors:
System Temperatures: cpu: 59.8 C mobo: N/A gpu: nvidia temp: 54 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 49%
Info:
Processes: 347 Uptime: 22h 32m Memory: 15.65 GiB used: 4.44 GiB (28.3%)
Init: systemd runlevel: 5 Compilers: gcc: 7.5.0 Shell: bash v: 4.4.20
inxi: 3.0.32
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
I'd definitely look into getting that GPU replaced. Especially if your CPU is completing units successfully.
You could try reducing GPU memory and core clock speeds, but I've found that only delays symptoms not eliminate them.
You could try reducing GPU memory and core clock speeds, but I've found that only delays symptoms not eliminate them.
-
- Site Moderator
- Posts: 6359
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error
Some first series of 2080 Ti (and 2080) Founder edition had design flaws that were causing them to die prematurely. Are you sure it's not one of these ?
Re: RTX 2080Ti on 445.75/78 clWaitForEvents error (441.87 wo
Thank you scOOp for a driver suggestion. I'll try that in a few days after I give the current drivers a chance to get some work done.
The GPU is quite fine, the drivers I was using appear to be the problem (update - but then again, see next post). I did not have to underclock the GPU in any way (update, or do I? see below about Boost 4.0), the drivers I'm now using (441.87) make it rock solid for folding.
As I stated early on, my GPU has Samsung memory chips which are not the Micron chips that were a problem with early models.
The GPU is quite fine, the drivers I was using appear to be the problem (update - but then again, see next post). I did not have to underclock the GPU in any way (update, or do I? see below about Boost 4.0), the drivers I'm now using (441.87) make it rock solid for folding.
As I stated early on, my GPU has Samsung memory chips which are not the Micron chips that were a problem with early models.
Last edited by flarbear on Thu Apr 09, 2020 1:00 am, edited 1 time in total.
Re: RTX 2080Ti clWaitForEvents error on 445.75/78 (441.87 wo
[Perhaps this is the real culprit here?]
Mysteriouser and Mysteriouser...
OK, so that stability lasted about 4-5 days performing somewhere between 15-20 WUs flawlessly and then I started getting the dreaded clWaitForEvents errors again. I did the full reinstall a couple of times and the same thing was happening and it was back to its "error before it even gets 1% of the WU done" tricks. I could have tried yet another driver version, but I decided to go another route instead...
And here is where things get strange...
Since the system was never very hot - the max GPU temp was 53c monitored over several days and the thermal limit is around 85c, I was skeptical that it could be a hardware problem, but I thought I would look at the clocks anyway.
I downloaded MSI afterburner which I've never used before and looked through its settings. I found the place to adjust the processor and memory clocks and bumped them down by about -200MHz. The first GPU WU came through and made it halfway before it errored out.
So I downclocked to -300MHz (both clocks) and now the GPU WUs are running successfully again. But, the thermals were well below any thermal limit...? And isn't that a pretty huge underclock? (Or so I thought, but here is the key...)
So, what were my final speeds that stabilized it this time? 1650MHz on the processor and 6500MHz on the memory. Wait, what? The FE edition of the card is already "factory overclocked" to have a BOOST clock of 1635. I was still 15MHz over that with a clock adjustment of -300MHz? I looked at the dial on MSI AB and it showed that my base clock was indeed 1050 which is the stock base clock less 300, and my boost was 1300ish, which is around 300 less than the FE boost. And it was currently running at 1650MHz - 300MHz over that...?
Investigating I discovered that the RTX cards have the latest Boost (4.0) technology and the card will automatically overclock until it reaches thermal, power, or GPU limits. And since this thing is so over-cooled, it wasn't being held back by much on the thermal front. The stock design was overclocking for me and likely going up to around 1950MHz on its own without me even checking a checkbox anywhere. I've owned this card since last June (just looked up the receipt, I thought I bought it in the fall, but it was last summer). That whole time, though, it was on the stock fan cooler and probably running hot and probably didn't overclock itself by much. I only added the GPU water block when I built my most recent system last month. I'll be monitoring it now and see if the -300MHz is enough to keep it stable - and it is still above the FE factory overclock at that speed anyway.
And one thing which may have made it worse is that I bumped the fan speeds after a couple of days of stability because I wanted to keep the GPU from heat-soaking the CPU. My CPU never reached thermal limits, but it was right around 90c (with a thermal limit of 95c) and when the GPU wasn't running it was much lower - in the mid to low 80s. So, I figured I should take some more heat out of the system to keep the CPU from running so hot all the time. And that may have made the GPU Boost 4.0 issue worse because now the GPU was even less limited by its thermals - and eventually it ran into some other limit. I didn't see the running MHz before/after I modified the fan curves to get a comparison if that caused the GPU to run even faster, though, but it is clearly a believable smoking gun.
Long story short, RTX cards overclock themselves by default and aggressive cooling may double down on that and have it run up against other limits - if you don't adjust them down...
Mysteriouser and Mysteriouser...
OK, so that stability lasted about 4-5 days performing somewhere between 15-20 WUs flawlessly and then I started getting the dreaded clWaitForEvents errors again. I did the full reinstall a couple of times and the same thing was happening and it was back to its "error before it even gets 1% of the WU done" tricks. I could have tried yet another driver version, but I decided to go another route instead...
And here is where things get strange...
Since the system was never very hot - the max GPU temp was 53c monitored over several days and the thermal limit is around 85c, I was skeptical that it could be a hardware problem, but I thought I would look at the clocks anyway.
I downloaded MSI afterburner which I've never used before and looked through its settings. I found the place to adjust the processor and memory clocks and bumped them down by about -200MHz. The first GPU WU came through and made it halfway before it errored out.
So I downclocked to -300MHz (both clocks) and now the GPU WUs are running successfully again. But, the thermals were well below any thermal limit...? And isn't that a pretty huge underclock? (Or so I thought, but here is the key...)
So, what were my final speeds that stabilized it this time? 1650MHz on the processor and 6500MHz on the memory. Wait, what? The FE edition of the card is already "factory overclocked" to have a BOOST clock of 1635. I was still 15MHz over that with a clock adjustment of -300MHz? I looked at the dial on MSI AB and it showed that my base clock was indeed 1050 which is the stock base clock less 300, and my boost was 1300ish, which is around 300 less than the FE boost. And it was currently running at 1650MHz - 300MHz over that...?
Investigating I discovered that the RTX cards have the latest Boost (4.0) technology and the card will automatically overclock until it reaches thermal, power, or GPU limits. And since this thing is so over-cooled, it wasn't being held back by much on the thermal front. The stock design was overclocking for me and likely going up to around 1950MHz on its own without me even checking a checkbox anywhere. I've owned this card since last June (just looked up the receipt, I thought I bought it in the fall, but it was last summer). That whole time, though, it was on the stock fan cooler and probably running hot and probably didn't overclock itself by much. I only added the GPU water block when I built my most recent system last month. I'll be monitoring it now and see if the -300MHz is enough to keep it stable - and it is still above the FE factory overclock at that speed anyway.
And one thing which may have made it worse is that I bumped the fan speeds after a couple of days of stability because I wanted to keep the GPU from heat-soaking the CPU. My CPU never reached thermal limits, but it was right around 90c (with a thermal limit of 95c) and when the GPU wasn't running it was much lower - in the mid to low 80s. So, I figured I should take some more heat out of the system to keep the CPU from running so hot all the time. And that may have made the GPU Boost 4.0 issue worse because now the GPU was even less limited by its thermals - and eventually it ran into some other limit. I didn't see the running MHz before/after I modified the fan curves to get a comparison if that caused the GPU to run even faster, though, but it is clearly a believable smoking gun.
Long story short, RTX cards overclock themselves by default and aggressive cooling may double down on that and have it run up against other limits - if you don't adjust them down...
Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.
A weak PSU can cause vexing and confusing issues. I would look there.
Re: RTX 2080Ti clWaitForEvents error (driver issue? Boost 4.
It's a Corsair SF750, 80 PLUS Platinum rated and the UPS indicates that the maximum draw under load with F@H is under 500w including 2 monitors and the Nas that are also plugged into the same UPS. With the computer asleep the draw is 70w before the monitors enter sleep state so the computer would be drawing under 400w on a 750w PSU.Roadpower wrote:A weak PSU can cause vexing and confusing issues. I would look there.
I have custom cables, but they look at least as beefy as the stock cables that came with the PSU - 18 awg and shorter than the stock cables by almost half.