Page 1 of 2
18601 multiple bad work units
Posted: Wed Oct 05, 2022 2:24 am
by BobWilliams757
Just wondering if anyone else has had issues with this particular WU.
I've had six bad work units in a fairly short time. It is the only project to fail of the nearing 90 WU's with new GPU, and I've tried driver changes and such, with the same trend. Though it's not enough to kill a bonus status at this time, it does seem to be that one particular project.
Of the first three that failed, two of them failed with another folder after reassignment. The other one folded fine for the second person to pick it up.
Has anyone else had more problems with this project?
Re: 18601 multiple bad work units
Posted: Wed Oct 05, 2022 8:21 am
by PaulTV
My trusty space heaters have been folding for this project a lot in the last couple weeks, no issues at all. One is Windows 11, RTX 3070 TI (recent drivers), the other runs Ubuntu 20.04 and has a 3060 TI (slightly older drivers, because I was too lazy to upgrade them). What specs does your machine have?
Re: 18601 multiple bad work units
Posted: Wed Oct 05, 2022 2:33 pm
by Joe_H
This project has been running for about 6 months, and I don't see any earlier reports of problems. It is possible that some run(s) have become less stable or are having other issues. Did you see any consistent pattern to the WUs that failed?
Re: 18601 multiple bad work units
Posted: Thu Oct 06, 2022 12:10 am
by BobWilliams757
PaulTV,
Specs are as indicated in my profile info under my name. A new PSU went in with the GPU, but the system overall has been solid for years now. I did testing of just about any bench I could find (often multiple at once) and couldn't even find a hint of instability anywhere.
Joe H,
The only common pattern I can find is that all that have crashed are Clone 1 and Gen 23. One of those also crashed for another user, the other did not. I just checked the remainder and now the original two that I find have crashed for a second user, the other four completed with the next user.
One that failed made it 44 frames, I think it was the one with multiple Nan errors. All others failed with 0 to 5 frames completed. I grabbed some of the error codes but there seems to be no real consistency or trend.
Re: 18601 multiple bad work units
Posted: Thu Oct 06, 2022 2:33 pm
by BobWilliams757
Well crap. This one is a head scratcher.
I've been trying to watch this project more closely, but last night forgot to set the client to "Finish", a precaution I've been taking to monitor this project and attempt to figure out what's happening. Four more failed. Two have returned after reassignment, 1 was completed and the other failed a second folder. I can only assume the other two are still being run or waiting reassignment.
Most seemed to be nan errors, with i think one CUDA error. None of them made it more than 12 frames, and most had issues from the start. All Clone 1, Gen 25 this time.
The only trend that I see at this point is that for those that fail for me and finish for others they are using reasonably quick GPUs, with return times that indicate their hardware is at least 3x quicker than my 1660. Being at the bottom of the Turing line of GPU's I'd assume this is just the assignment/species cutoff causing this.
Now at 90+ work units on this GPU, I've only had 1 error show up in logs on any other project. It shut down, restarted and completed just fine. All other errors are 18601. Full stock settings, reduced power settings, and reduced memory settings. The problem is, those same settings also completed projects.....
Any ideas for the most exhaustive GPU benchmarks that will log any errors? I've often run several at a time, but don't have issues, so I don't know which ones might load the GPU the way folding does. I've also loaded and benched the memory, CPU side, etc using the same methods, usually multiple load and benchmark utilities run at the same time to max stress. Other than confirming my fan curves are ok, I've found nothing.
Re: 18601 multiple bad work units
Posted: Thu Oct 06, 2022 3:24 pm
by Joe_H
I agree a head scratcher. I have sent a message to the researcher, possibly something is going on with the WUs after 6 months of progress, or with specific runs of it.
Re: 18601 multiple bad work units
Posted: Fri Oct 07, 2022 12:31 pm
by artoar_11
In some projects I had stability issues (NANs, other errors and restarts). This project18601 is one of them. My 6 year old GTX 1070 runs with a small overclock. I had to reduce the overclock by 30 MHz on the video processor and 200 MHz on the VRAM. It's been working for a month+ now with no visible errors. I have several WUs from p18601 (Rxxxx, C1, G20+) completed normally (w/o errors).
Re: 18601 multiple bad work units
Posted: Sat Oct 08, 2022 4:41 am
by BobWilliams757
Joe_H wrote: ↑Thu Oct 06, 2022 3:24 pm
I agree a head scratcher. I have sent a message to the researcher, possibly something is going on with the WUs after 6 months of progress, or with specific runs of it.
Thanks Joe. I had another fail last night. I still have to catch up on which ones were completed by other users, but I really don't even know what the "average" is for failed/dumped/otherwise not completed work units, so doing so will only give me a guess. I just picked up another one right now, we'll see what happens.
On any other project and with all benchmarks, stress testers, etc, I can't even make the card flinch. I've tested multiple settings over a single project WU and it's rock solid from underclocked and undervolted, to full stock power settings, to overpowering it by 5% in MSI Afterburner, and it runs like a clock.
One things I have noticed that I didn't mention before. As mentioned I've been testing quite an array of settings using MSI Afterburner, mostly power limited settings. So if I'm here and a WU finishes I start a new one, see what it picks up, and then choose a setting I haven't used for that particular project to see where the efficiency curve is for each project. Usually when most projects pick up and start folding, the GPU and Memory scales shoot up to the appropriate level and stay there (or fluctuate with PL limiting), and the GPU temp naturally rises from the cooler temp it was at while paused.
With project 18601, the GPU and Memory scales seem to bounce up and down a few times within maybe just a couple seconds, then go to the appropriate for the settings level. It's almost as if it doesn't load the GPU as quickly or something. And yes, I've been diving that deep into the rabbit hole.
Re: 18601 multiple bad work units
Posted: Mon Oct 10, 2022 4:12 pm
by BobWilliams757
Another piece of info I had forgotten about but found in my notes:
18601 is the only project I've had that will limit power and cap when set at 100% power limit. This happens without excessive GPU or memory clocks and stock settings. I've run a couple more successfully at 100% and at some point in the WU, all of them seem to do this. I've been running them all at 100%/stock settings on everything to see if I can find a trend. The only other thing I've noticed is that the most recent PRCG also gave me the highest "Bus Interface Load" that has been recorded by GPU-Z. Though I haven't logged it from the start, it seems to stay at a max of 2%. The PCRG I ran most recently showed the max at 6%.
Re: 18601 multiple bad work units
Posted: Sat Oct 15, 2022 1:49 pm
by toTOW
Don't forget to clean your heatsinks and fans ...
I also noticed some projects (but I don't remember which ones) that were slightly unstable on the old 980 which was overclocked ... everything was back to normal after removing overclocking.
Re: 18601 multiple bad work units
Posted: Sun Oct 16, 2022 3:10 pm
by BobWilliams757
toTOW,
All brand new, clean, stock GPU. Since I had to install the GPU and PS I blew out the case and other stuff while digging in there.
I am now seeing what I HOPE was the issue, but I'm not ready to declare victory yet. I haven't been picking up as many of this project so I want to be sure first, but it seems it just doesn't like running them at lower power reduction levels. I've witnessed two of them toss and error within a few percent when running at 53% power limit, which is the max allowed by Afterburner. On both of those work units, bumping power up higher let them finish the work unit without any issues.
Maybe it's related to the power hungry nature of the project vs others. So far every one I've watched hits power limits even at 100%. It also seems to log the highest peak power and slightly higher average power when power limited by a percentage.
I'm going to keep an eye on it for a while, but hopefully it's figure out as to what causes it, not sure I'll ever know for sure the why part.
Re: 18601 multiple bad work units
Posted: Mon Oct 17, 2022 3:55 pm
by Dayle Diamond
I've gotten a few work units in a row that are roughly ten generations behind the average.
Perhaps the broken ones have been fixed and released.
Edit: Either there were a lot more bad units than I anticipated, or the whole project has suffered a rollback.
My computer has been only doing 18601 for weeks, and the average generation was clone 1, gen 35 until a few days ago.
Now I'm seeing late 20's and early 30's, when I'd expect to see the units reaching 37.
Re: 18601 multiple bad work units
Posted: Mon Oct 31, 2022 11:42 am
by BobWilliams757
**Update**
After watching these whenever I could to see errors and possible causes, it seems to be related to power limiting alone. Though some work units will finish fine at max low power limit (53%), others will throw errors. I found a few that would error below about 56-57% and bumping the power allowed them to finish just fine.
Over time I experimented some, and selected a new low power base of 60%. I've done about 25-28 more of the 18601's at that power level with no errors that I've noticed. Power alone seems to be the thing that impacts it. Core and memory clock changes, temps, background use, etc, etc... nothing. If nothing else I've found that I can stress my system to crazy levels and everything works as long as I don't set the power limit too low.
I'm hoping this is a case closed for me on this project. Good thing too, because there seem to be plenty of them being issued.
Re: 18601 multiple bad work units
Posted: Mon Dec 12, 2022 10:07 pm
by Lazvon
18601 still causes me problems every so often. I'm going to limit based on GPU Temp I think and see if I can get it to work more reliably.
Code: Select all
[quote]*********************** Log Started 2022-12-12T22:00:41Z ***********************
22:00:41:******************************* libFAH ********************************
22:00:41: Date: Oct 20 2020
22:00:41: Time: 13:36:55
22:00:41: Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41:****************************** FAHClient ******************************
22:00:41: Version: 7.6.21
22:00:41: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:00:41: Copyright: 2020 foldingathome.org
22:00:41: Homepage: https://foldingathome.org/
22:00:41: Date: Oct 20 2020
22:00:41: Time: 13:41:04
22:00:41: Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41: Config: C:\ProgramData\FAHClient\config.xml
22:00:41:******************************** CBang ********************************
22:00:41: Date: Oct 20 2020
22:00:41: Time: 11:36:18
22:00:41: Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41:******************************* System ********************************
22:00:41: CPU: 12th Gen Intel(R) Core(TM) i9-12900K
22:00:41: CPU ID: GenuineIntel Family 6 Model 151 Stepping 2
22:00:41: CPUs: 24
22:00:41: Memory: 31.75GiB
22:00:41: Free Memory: 27.89GiB
22:00:41: Threads: WINDOWS_THREADS
22:00:41: OS Version: 6.2
22:00:41: Has Battery: false
22:00:41: On Battery: false
22:00:41: UTC Offset: -5
22:00:41: PID: 16068
22:00:41: CWD: C:\ProgramData\FAHClient
22:00:41: Win32 Service: false
22:00:41: OS: Windows 10 Home
22:00:41: OS Arch: AMD64
22:00:41: GPUs: 1
22:00:41: GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 GA102 [GeForce RTX 3090]
22:00:41: CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:8.6 Driver:11.7
22:00:41:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:3.0 Driver:516.94
22:00:41:***********************************************************************
22:00:41:<config>
22:00:41: <!-- HTTP Server -->
22:00:41: <allow v='10.10.10.0/24 127.0.0.1'/>
22:00:41:
22:00:41: <!-- Network -->
22:00:41: <proxy v=':8080'/>
22:00:41:
22:00:41: <!-- Remote Command Server -->
22:00:41: <password v='*****'/>
22:00:41:
22:00:41: <!-- Slot Control -->
22:00:41: <power v='FULL'/>
22:00:41:
22:00:41: <!-- User Information -->
22:00:41: <passkey v='*****'/>
22:00:41: <team v='11108'/>
22:00:41: <user v='Lazvon'/>
22:00:41:
22:00:41: <!-- Folding Slots -->
22:00:41: <slot id='1' type='GPU'>
22:00:41: <pci-bus v='1'/>
22:00:41: <pci-slot v='0'/>
22:00:41: </slot>
22:00:41:</config>
22:00:41:Trying to access database...
22:00:41:Successfully acquired database lock
22:00:41:FS01:Initialized folding slot 01: gpu:1:0 GA102 [GeForce RTX 3090]
22:00:41:WU00:FS01:Starting
22:00:41:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:00:41:WU00:FS01:Started FahCore on PID 15680
22:00:41:WU00:FS01:Core PID:15684
22:00:41:WU00:FS01:FahCore 0x22 started
22:00:42:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:00:42:WU00:FS01:Starting
22:00:42:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:00:42:WU00:FS01:Started FahCore on PID 17412
22:00:42:WU00:FS01:Core PID:17444
22:00:42:WU00:FS01:FahCore 0x22 started
22:00:43:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:01:42:WU00:FS01:Starting
22:01:42:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:01:42:WU00:FS01:Started FahCore on PID 24572
22:01:42:WU00:FS01:Core PID:23772
22:01:42:WU00:FS01:FahCore 0x22 started
22:01:43:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
[/quote]
Re: 18601 multiple bad work units
Posted: Tue Dec 13, 2022 10:22 pm
by BobWilliams757
Lazvon wrote: ↑Mon Dec 12, 2022 10:07 pm
18601 still causes me problems every so often. I'm going to limit based on GPU Temp I think and see if I can get it to work more reliably.
Code: Select all
[quote]*********************** Log Started 2022-12-12T22:00:41Z ***********************
22:00:41:******************************* libFAH ********************************
22:00:41: Date: Oct 20 2020
22:00:41: Time: 13:36:55
22:00:41: Revision: 5ca109d295a6245e2a2f590b3d0085ad5e567aeb
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41:****************************** FAHClient ******************************
22:00:41: Version: 7.6.21
22:00:41: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:00:41: Copyright: 2020 foldingathome.org
22:00:41: Homepage: https://foldingathome.org/
22:00:41: Date: Oct 20 2020
22:00:41: Time: 13:41:04
22:00:41: Revision: 6efbf0e138e22d3963e6a291f78dcb9c6422a278
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41: Config: C:\ProgramData\FAHClient\config.xml
22:00:41:******************************** CBang ********************************
22:00:41: Date: Oct 20 2020
22:00:41: Time: 11:36:18
22:00:41: Revision: 7e4ce85225d7eaeb775e87c31740181ca603de60
22:00:41: Branch: master
22:00:41: Compiler: Visual C++ 2015
22:00:41: Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
22:00:41: Platform: win32 10
22:00:41: Bits: 32
22:00:41: Mode: Release
22:00:41:******************************* System ********************************
22:00:41: CPU: 12th Gen Intel(R) Core(TM) i9-12900K
22:00:41: CPU ID: GenuineIntel Family 6 Model 151 Stepping 2
22:00:41: CPUs: 24
22:00:41: Memory: 31.75GiB
22:00:41: Free Memory: 27.89GiB
22:00:41: Threads: WINDOWS_THREADS
22:00:41: OS Version: 6.2
22:00:41: Has Battery: false
22:00:41: On Battery: false
22:00:41: UTC Offset: -5
22:00:41: PID: 16068
22:00:41: CWD: C:\ProgramData\FAHClient
22:00:41: Win32 Service: false
22:00:41: OS: Windows 10 Home
22:00:41: OS Arch: AMD64
22:00:41: GPUs: 1
22:00:41: GPU 0: Bus:1 Slot:0 Func:0 NVIDIA:8 GA102 [GeForce RTX 3090]
22:00:41: CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:8.6 Driver:11.7
22:00:41:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:3.0 Driver:516.94
22:00:41:***********************************************************************
22:00:41:<config>
22:00:41: <!-- HTTP Server -->
22:00:41: <allow v='10.10.10.0/24 127.0.0.1'/>
22:00:41:
22:00:41: <!-- Network -->
22:00:41: <proxy v=':8080'/>
22:00:41:
22:00:41: <!-- Remote Command Server -->
22:00:41: <password v='*****'/>
22:00:41:
22:00:41: <!-- Slot Control -->
22:00:41: <power v='FULL'/>
22:00:41:
22:00:41: <!-- User Information -->
22:00:41: <passkey v='*****'/>
22:00:41: <team v='11108'/>
22:00:41: <user v='Lazvon'/>
22:00:41:
22:00:41: <!-- Folding Slots -->
22:00:41: <slot id='1' type='GPU'>
22:00:41: <pci-bus v='1'/>
22:00:41: <pci-slot v='0'/>
22:00:41: </slot>
22:00:41:</config>
22:00:41:Trying to access database...
22:00:41:Successfully acquired database lock
22:00:41:FS01:Initialized folding slot 01: gpu:1:0 GA102 [GeForce RTX 3090]
22:00:41:WU00:FS01:Starting
22:00:41:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:00:41:WU00:FS01:Started FahCore on PID 15680
22:00:41:WU00:FS01:Core PID:15684
22:00:41:WU00:FS01:FahCore 0x22 started
22:00:42:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:00:42:WU00:FS01:Starting
22:00:42:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:00:42:WU00:FS01:Started FahCore on PID 17412
22:00:42:WU00:FS01:Core PID:17444
22:00:42:WU00:FS01:FahCore 0x22 started
22:00:43:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:01:42:WU00:FS01:Starting
22:01:42:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.20/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 16068 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
22:01:42:WU00:FS01:Started FahCore on PID 24572
22:01:42:WU00:FS01:Core PID:23772
22:01:42:WU00:FS01:FahCore 0x22 started
22:01:43:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
[/quote]
Interesting that someone else is seeing something here. Are you thinking in your case it's a matter of pushing temps too high?
In my case, the lower power limit seemed to be the issue. Having run at various power levels from 60-100% since my initial reported issues, I lost only 1 instance of this project. It just happened to be the night I posted my last update of thinking "problem solved".
Since then I have done just shy of 70 more of this project and no issues, not even a reported error noticed on any of them.
I did eventually change my fan curves slightly, but during the issues I was running at stock fan profiles which on this card kept it almost as cool and at the lower power levels it was always chilly.