Page 2 of 3

Re: Bad work unit??

Posted: Sat Oct 03, 2020 2:51 pm
by gunnarre
In case you want to test "--disable-cuda", you can add it in the "extra-core-args" option on the GPU slot. Reducing the clock would be the best idea because CUDA gives so much faster folding than OpenCL right now.
Image

Re: Particle coordinate is nan???

Posted: Sat Oct 03, 2020 2:59 pm
by bruce
Depending on how long it has been used, you may have accumulated dust in the heatsink which may be causing overheating ... or the thermal paste between the CPU and the heatsink may have degraded ... or maybe the factory overclock was always inadequate to handle the full load imposed by a serious processing loads (perhaps just games were expected).

FAH has recently distributed CUDA code for NVidia GPUs which may load it more fully that it has ever been loaded before. If you manually increased the overclocking yourself, back off.

I'd start by reducing the overclocking to original factory settings and see if that eliminates the problem ... but check the heatsink for dust first.

Re: Particle coordinate is nan???

Posted: Sat Oct 03, 2020 8:46 pm
by PantherX
In addition to what bruce mentioned, please note that AIDA64 Extreme, Hashing, etc. benchmarking GPU tools do not replicate the results of folding. Generally speaking, folding is more intensive than those applications. Thus, see if stock settings or even underclocking your GPU a bit helps out.

Re: Particle coordinate is nan???

Posted: Sat Oct 03, 2020 10:52 pm
by stratocastor
Thank you all for the suggestions! I have tried down locking with similar results, still odd errors. Last week, I actually tore it apart, cleaned and applied new thermal paste as well. What other program could I try that would emulate the sort of cuda load that folding does? Literally every thing else runs fine that I can throw at it.

Re: Bad work unit??

Posted: Sat Oct 03, 2020 11:05 pm
by stratocastor
JohnChodera wrote:> When did the cuda update in folding drop again?? T

core22 0.0.13 with CUDA support rolled out on Mon 28 Sep for most folks (though BETA users had it more than a week earlier).

Is just this project giving you trouble, or all of them?

~ John Chodera // MSKCC

This seems to align to when I started experiencing the problems. Will search through logs to confirm.

When I went back to air cooling, I used new thermal pads, as the old ones were toast. Using thermal grizzly minus pad 8s. I have tried stressing the GPU with every program imaginable. Can get the temps up to 75C with most demanding benchmarks, mining apps, aida64 etc. I tried to downclock with no change in the error frequency. Perhaps my GPU just doesn't like being that warm. When I was on water, was maxing at 45C. Currently waiting on a few parks from EKWB to rebuild my loop. Currently have cuda disabled, and will run overnight to see how it goes. Currently, seems to be progressing past the point where I would have experienced errors. Will post back to tomorrow with updates. The points difference!!!! :(

Re: Bad work unit??

Posted: Sat Oct 03, 2020 11:31 pm
by stratocastor
Well... that was short lived....

Code: Select all

23:14:20:WU00:FS01:0x22:*********************** Log Started 2020-10-03T23:14:19Z ***********************
23:14:20:WU00:FS01:0x22:*************************** Core22 Folding@home Core ***************************
23:14:20:WU00:FS01:0x22:       Core: Core22
23:14:20:WU00:FS01:0x22:       Type: 0x22
23:14:20:WU00:FS01:0x22:    Version: 0.0.13
23:14:20:WU00:FS01:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
23:14:20:WU00:FS01:0x22:  Copyright: 2020 foldingathome.org
23:14:20:WU00:FS01:0x22:   Homepage: https://foldingathome.org/
23:14:20:WU00:FS01:0x22:       Date: Sep 19 2020
23:14:20:WU00:FS01:0x22:       Time: 02:35:58
23:14:20:WU00:FS01:0x22:   Revision: 571cf95de6de2c592c7c3ed48fcfb2e33e9ea7d3
23:14:20:WU00:FS01:0x22:     Branch: core22-0.0.13
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:             -DOPENMM_GIT_HASH="\"189320d0\""
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
23:14:20:WU00:FS01:0x22:             <peastman@stanford.edu>
23:14:20:WU00:FS01:0x22:       Args: -dir 00 -suffix 01 -version 706 -lifeline 16336 -checkpoint 15
23:14:20:WU00:FS01:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
23:14:20:WU00:FS01:0x22:             0 -gpu 0 --disable-cuda
23:14:20:WU00:FS01:0x22:************************************ libFAH ************************************
23:14:20:WU00:FS01:0x22:       Date: Sep 7 2020
23:14:20:WU00:FS01:0x22:       Time: 19:09:56
23:14:20:WU00:FS01:0x22:   Revision: 44301ed97b996b63fe736bb8073f22209cb2b603
23:14:20:WU00:FS01:0x22:     Branch: HEAD
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:************************************ CBang *************************************
23:14:20:WU00:FS01:0x22:       Date: Sep 7 2020
23:14:20:WU00:FS01:0x22:       Time: 19:08:30
23:14:20:WU00:FS01:0x22:   Revision: 33fcfc2b3ed2195a423606a264718e31e6b3903f
23:14:20:WU00:FS01:0x22:     Branch: HEAD
23:14:20:WU00:FS01:0x22:   Compiler: Visual C++ 2015
23:14:20:WU00:FS01:0x22:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Ob3 /Zc:throwingNew /MT
23:14:20:WU00:FS01:0x22:   Platform: win32 10
23:14:20:WU00:FS01:0x22:       Bits: 64
23:14:20:WU00:FS01:0x22:       Mode: Release
23:14:20:WU00:FS01:0x22:************************************ System ************************************
23:14:20:WU00:FS01:0x22:        CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
23:14:20:WU00:FS01:0x22:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
23:14:20:WU00:FS01:0x22:       CPUs: 8
23:14:20:WU00:FS01:0x22:     Memory: 15.94GiB
23:14:20:WU00:FS01:0x22:Free Memory: 11.92GiB
23:14:20:WU00:FS01:0x22:    Threads: WINDOWS_THREADS
23:14:20:WU00:FS01:0x22: OS Version: 6.2
23:14:20:WU00:FS01:0x22:Has Battery: false
23:14:20:WU00:FS01:0x22: On Battery: false
23:14:20:WU00:FS01:0x22: UTC Offset: -6
23:14:20:WU00:FS01:0x22:        PID: 15640
23:14:20:WU00:FS01:0x22:        CWD: C:\Users\Beast\AppData\Roaming\FAHClient\work
23:14:20:WU00:FS01:0x22:************************************ OpenMM ************************************
23:14:20:WU00:FS01:0x22:   Revision: 189320d0
23:14:20:WU00:FS01:0x22:********************************************************************************
23:14:20:WU00:FS01:0x22:Project: 11751 (Run 0, Clone 15165, Gen 12)
23:14:20:WU00:FS01:0x22:Unit: 0x0000001e8ca304e75e6d6f7f5be4be71
23:14:20:WU00:FS01:0x22:Digital signatures verified
23:14:20:WU00:FS01:0x22:Folding@home GPU Core22 Folding@home Core
23:14:20:WU00:FS01:0x22:Version 0.0.13
23:14:20:WU00:FS01:0x22:  Checkpoint write interval: 50000 steps (5%) [20 total]
23:14:20:WU00:FS01:0x22:  JSON viewer frame write interval: 10000 steps (1%) [100 total]
23:14:20:WU00:FS01:0x22:  XTC frame write interval: 50000 steps (5%) [20 total]
23:14:20:WU00:FS01:0x22:  Global context and integrator variables write interval: disabled
23:14:20:WU00:FS01:0x22:There are 4 platforms available.
23:14:20:WU00:FS01:0x22:Platform 0: Reference
23:14:20:WU00:FS01:0x22:Platform 1: CPU
23:14:20:WU00:FS01:0x22:Platform 2: OpenCL
23:14:20:WU00:FS01:0x22:  opencl-device 0 specified
23:14:20:WU00:FS01:0x22:Platform 3: CUDA
23:14:20:WU00:FS01:0x22:  cuda-device 0 specified
23:14:20:WU00:FS01:0x22:Disabling CUDA platform because 'disable-cuda' argument was specified.
23:14:32:WU00:FS01:0x22:Attempting to create OpenCL context:
23:14:32:WU00:FS01:0x22:  Configuring platform OpenCL
23:14:45:WU00:FS01:0x22:  Using OpenCL on platformId 0 and gpu 0
23:14:45:WU00:FS01:0x22:Completed 50000 out of 1000000 steps (5%)
23:15:53:WU00:FS01:0x22:Completed 60000 out of 1000000 steps (6%)
23:16:59:WU00:FS01:0x22:Completed 70000 out of 1000000 steps (7%)
23:18:05:WU00:FS01:0x22:Completed 80000 out of 1000000 steps (8%)
23:19:10:WU00:FS01:0x22:Completed 90000 out of 1000000 steps (9%)
23:20:15:WU00:FS01:0x22:Completed 100000 out of 1000000 steps (10%)
23:20:16:WU00:FS01:0x22:Checkpoint completed at step 100000
23:21:21:WU00:FS01:0x22:Completed 110000 out of 1000000 steps (11%)
23:22:26:WU00:FS01:0x22:Completed 120000 out of 1000000 steps (12%)
23:23:32:WU00:FS01:0x22:Completed 130000 out of 1000000 steps (13%)
23:24:44:WU00:FS01:0x22:An exception occurred at step 139555: Particle coordinate is nan
23:24:44:WU00:FS01:0x22:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
23:24:44:WU00:FS01:0x22:Folding@home Core Shutdown: CORE_RESTART
23:24:45:WARNING:WU00:FS01:FahCore returned: CORE_RESTART (98 = 0x62)
23:24:45:WU00:FS01:Starting
23:24:45:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\Beast\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/win/64bit/22-0.0.13/Core_22.fah/FahCore_22.exe -dir 00 -suffix 01 -version 706 -lifeline 12136 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0 --disable-cuda
23:24:45:WU00:FS01:Started FahCore on PID 17684
23:24:45:WU00:FS01:Core PID:14788
23:24:45:WU00:FS01:FahCore 0x22 started
Mod Edit: Added Code Tags - PantherX

Re: Particle coordinate is nan???

Posted: Sun Oct 04, 2020 7:04 am
by PantherX
stratocastor wrote:... What other program could I try that would emulate the sort of cuda load that folding does? Literally every thing else runs fine that I can throw at it.
Unfortunately, FAHBench hasn't been updated to FahCore_22 specifications. There are plans to do it but there's no ETA.

Re: Particle coordinate is nan???

Posted: Sun Oct 04, 2020 7:47 am
by foldy
@stratocastor: Maybe your GPU works better with OpenCL instead of CUDA for FAH? You can disable CUDA in extra core options but its slower

Image

Re: Bad work unit??

Posted: Sun Oct 04, 2020 7:55 am
by NormalDiffusion
So even underclocked she won't fold anymore? Or were you trying with cuda disabled and delivery clock?

Re: Particle coordinate is nan???

Posted: Sun Oct 04, 2020 8:34 am
by gunnarre
I think you need two dashes, so "--disable-cuda".

Re: Particle coordinate is nan???

Posted: Sun Oct 04, 2020 8:58 am
by PantherX
gunnarre wrote:I think you need two dashes, so "--disable-cuda".
FYI, a single dash or a double dash would work fine.

Re: Particle coordinate is nan???

Posted: Sun Oct 04, 2020 9:27 pm
by gunnarre
Thanks. Good to know. By convention, many *nix tools use a single dash for single-letter options, but it's more user friendly to allow both.

Re: Bad work unit??

Posted: Sun Oct 04, 2020 9:54 pm
by stratocastor
For the last 22 hours, it took reducing the power limit to 80%, core clock running at 1329 currently. Seems to be stable on cuda folding for the time being. The factory overclock on this 980ti is 1404mhz.

Re: Bad work unit??

Posted: Sun Oct 04, 2020 10:15 pm
by bruce
For some people, overclocking is a way to increase throughput. For others, running CUDA accomplishes the same thing. Apparently you can't have both. Officially, FAH does not support overclocking. If you choose take responsibility for your own overclock settings and your own cooling methodology, you're welcome to disable CUDA or not ... or figure out what is optimum for your kit. OpenCL is still a choice you can make, but we can't really help you with those decisions.

NaNs are a common result of unstable calculations and such errors are not produce resuts that are useful to science. There are lots and lots of people with non-overclocked systems that are very happy with CUDA's increase in productivity. The FAHCores are reportedly a more strenuous benchmark that others programs, but please don't waste production assignments.

Re: Bad work unit??

Posted: Sun Oct 04, 2020 10:22 pm
by Joe_H
stratocastor wrote:For the last 22 hours, it took reducing the power limit to 80%, core clock running at 1329 currently. Seems to be stable on cuda folding for the time being. The factory overclock on this 980ti is 1404mhz.
You are still at a higher clock than the reference design for the 980 ti. Base clock for the reference design is 1000 MHz and boost of 1075 MHz.