Page 1 of 1

Project 9208 (0, 13, 50) Bad state errors, over power limit

Posted: Sun Nov 08, 2015 6:56 pm
by CygnusXI
9208 (0, 13, 50) with a stock non oc GTX 780 , I noticed when the work unit fired up it was getting a massive PPD of around 294k PPD ( almost the same as my stock 980 ). I thought "Excellent, core 21 is zooming on my old 780". Not the case when I checked on it this morning. There were some bad state errors that pushed back the progress a bit, I looked at MSI afterburner and noticed that my gpu power % was peaking up to 110%. Being that the card is likely out of warranty I stopped the gpu. I was running on only 1 of 2 gpus, so there was plenty of psu wiggle room. I even tried under clocking the gpu core by over 100 mhz and still was seeing the same over 100% power usage. The stock clock speed on this card was running 1100mhz.

Driver version is 355.82 win 7 pro

Now this looks a lot like the problems described supposedly only affecting maxwell cards in this post viewtopic.php?f=19&t=28226 but this is not a maxwell card. Different wu though also from that post, seemingly identical issue though minus the mention of high power utilization. 110% *should* be safe to run at 24/7, mix that in with the errors though and I felt like it was time to report and drop the WU.

I guess my biggest issue with this is the power utilization being so high, or I would have just let the WU attempt to finish.

Code: Select all

*********************** Log Started 2015-11-08T02:45:40Z ***********************
02:45:40:************************* Folding@home Client *************************
02:45:40:      Website: http://folding.stanford.edu/
02:45:40:    Copyright: (c) 2009-2014 Stanford University
02:45:40:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
02:45:40:         Args: --client-type=advanced
02:45:40:       Config: C:/Users/twin780folder/AppData/Roaming/FAHClient/config.xml
02:45:40:******************************** Build ********************************
02:45:40:      Version: 7.4.4
02:45:40:         Date: Mar 4 2014
02:45:40:         Time: 20:26:54
02:45:40:      SVN Rev: 4130
02:45:40:       Branch: fah/trunk/client
02:45:40:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
02:45:40:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
02:45:40:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
02:45:40:     Platform: win32 XP
02:45:40:         Bits: 32
02:45:40:         Mode: Release
02:45:40:******************************* System ********************************
02:45:40:          CPU: AMD FX(tm)-9370 Eight-Core Processor
02:45:40:       CPU ID: AuthenticAMD Family 21 Model 2 Stepping 0
02:45:40:         CPUs: 8
02:45:40:       Memory: 15.90GiB
02:45:40:  Free Memory: 12.19GiB
02:45:40:      Threads: WINDOWS_THREADS
02:45:40:   OS Version: 6.1
02:45:40:  Has Battery: false
02:45:40:   On Battery: false
02:45:40:   UTC Offset: -5
02:45:40:          PID: 1564
02:45:40:          CWD: C:/Users/twin780folder/AppData/Roaming/FAHClient
02:45:40:           OS: Windows 7 Professional
02:45:40:      OS Arch: AMD64
02:45:40:         GPUs: 2
02:45:40:        GPU 0: NVIDIA:3 GK110 [GeForce GTX 780]
02:45:40:        GPU 1: NVIDIA:3 GK110 [GeForce GTX 780]
02:45:40:         CUDA: 3.5
02:45:40:  CUDA Driver: 7050
02:45:40:Win32 Service: false
02:45:40:***********************************************************************
02:45:40:<config>
02:45:40:  <!-- Network -->
02:45:40:  <proxy v=':8080'/>
02:45:40:
02:45:40:  <!-- Slot Control -->
02:45:40:  <power v='full'/>
02:45:40:
02:45:40:  <!-- User Information -->
02:45:40:  <passkey v='********************************'/>
02:45:40:  <team v='224497'/>
02:45:40:  <user v='Cygnus-XI'/>
02:45:40:
02:45:40:  <!-- Folding Slots -->
02:45:40:  <slot id='1' type='GPU'>
02:45:40:    <paused v='true'/>
02:45:40:  </slot>
02:45:40:  <slot id='0' type='GPU'>
02:45:40:    <paused v='true'/>
02:45:40:  </slot>
02:45:40:</config>
02:45:40:Trying to access database...
02:45:40:Successfully acquired database lock
02:45:40:Enabled folding slot 01: PAUSED gpu:0:GK110 [GeForce GTX 780] (by user)
02:45:40:Enabled folding slot 00: PAUSED gpu:1:GK110 [GeForce GTX 780] (by user)
02:51:23:FS00:Unpaused
02:51:24:WU00:FS00:Connecting to 171.67.108.45:80
02:51:24:WU00:FS00:Assigned to work server 171.64.65.104
02:51:24:WU00:FS00:Requesting new work unit for slot 00: READY gpu:1:GK110 [GeForce GTX 780] from 171.64.65.104
02:51:24:WU00:FS00:Connecting to 171.64.65.104:8080
02:51:25:WU00:FS00:Downloading 10.04MiB
02:51:31:WU00:FS00:Download 57.28%
02:51:34:WU00:FS00:Download complete
02:51:34:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9208 run:0 clone:13 gen:50 core:0x21 unit:0x00000040664f2dd055edd356ac29db8b
02:51:34:WU00:FS00:Starting
02:51:34:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/twin780folder/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 704 -lifeline 1564 -checkpoint 15 -gpu 1 -gpu-vendor nvidia
02:51:35:WU00:FS00:Started FahCore on PID 7544
02:51:35:WU00:FS00:Core PID:8156
02:51:35:WU00:FS00:FahCore 0x21 started
02:51:36:WU00:FS00:0x21:*********************** Log Started 2015-11-08T02:51:35Z ***********************
02:51:36:WU00:FS00:0x21:Project: 9208 (Run 0, Clone 13, Gen 50)
02:51:36:WU00:FS00:0x21:Unit: 0x00000040664f2dd055edd356ac29db8b
02:51:36:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
02:51:36:WU00:FS00:0x21:Machine: 0
02:51:36:WU00:FS00:0x21:Reading tar file core.xml
02:51:36:WU00:FS00:0x21:Reading tar file system.xml
02:51:37:WU00:FS00:0x21:Reading tar file integrator.xml
02:51:37:WU00:FS00:0x21:Reading tar file state.xml
02:51:38:WU00:FS00:0x21:Digital signatures verified
02:51:38:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
02:51:38:WU00:FS00:0x21:Version 0.0.11
02:51:46:Removing old file 'configs/config-20150822-155516.xml'
02:51:46:Saving configuration to config.xml
02:51:46:<config>
02:51:46:  <!-- Network -->
02:51:46:  <proxy v=':8080'/>
02:51:46:
02:51:46:  <!-- Slot Control -->
02:51:46:  <power v='full'/>
02:51:46:
02:51:46:  <!-- User Information -->
02:51:46:  <passkey v='********************************'/>
02:51:46:  <team v='224497'/>
02:51:46:  <user v='Cygnus-XI'/>
02:51:46:
02:51:46:  <!-- Folding Slots -->
02:51:46:  <slot id='1' type='GPU'>
02:51:46:    <paused v='true'/>
02:51:46:  </slot>
02:51:46:  <slot id='0' type='GPU'/>
02:51:46:</config>
02:52:45:WU00:FS00:0x21:Completed 0 out of 2500000 steps (0%)
02:52:45:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
02:59:51:WU00:FS00:0x21:Completed 25000 out of 2500000 steps (1%)
03:06:40:WU00:FS00:0x21:Completed 50000 out of 2500000 steps (2%)
03:13:29:WU00:FS00:0x21:Completed 75000 out of 2500000 steps (3%)
03:20:17:WU00:FS00:0x21:Completed 100000 out of 2500000 steps (4%)
03:27:26:WU00:FS00:0x21:Completed 125000 out of 2500000 steps (5%)
03:34:14:WU00:FS00:0x21:Completed 150000 out of 2500000 steps (6%)
03:41:02:WU00:FS00:0x21:Completed 175000 out of 2500000 steps (7%)
03:47:51:WU00:FS00:0x21:Completed 200000 out of 2500000 steps (8%)
03:55:00:WU00:FS00:0x21:Completed 225000 out of 2500000 steps (9%)
04:01:48:WU00:FS00:0x21:Completed 250000 out of 2500000 steps (10%)
04:08:36:WU00:FS00:0x21:Completed 275000 out of 2500000 steps (11%)
04:15:24:WU00:FS00:0x21:Completed 300000 out of 2500000 steps (12%)
04:22:34:WU00:FS00:0x21:Completed 325000 out of 2500000 steps (13%)
04:29:20:WU00:FS00:0x21:Completed 350000 out of 2500000 steps (14%)
04:36:08:WU00:FS00:0x21:Completed 375000 out of 2500000 steps (15%)
04:42:57:WU00:FS00:0x21:Completed 400000 out of 2500000 steps (16%)
04:50:07:WU00:FS00:0x21:Completed 425000 out of 2500000 steps (17%)
04:56:56:WU00:FS00:0x21:Completed 450000 out of 2500000 steps (18%)
05:03:45:WU00:FS00:0x21:Completed 475000 out of 2500000 steps (19%)
05:10:34:WU00:FS00:0x21:Completed 500000 out of 2500000 steps (20%)
05:17:44:WU00:FS00:0x21:Completed 525000 out of 2500000 steps (21%)
05:24:33:WU00:FS00:0x21:Completed 550000 out of 2500000 steps (22%)
05:31:23:WU00:FS00:0x21:Completed 575000 out of 2500000 steps (23%)
05:38:12:WU00:FS00:0x21:Completed 600000 out of 2500000 steps (24%)
05:45:21:WU00:FS00:0x21:Completed 625000 out of 2500000 steps (25%)
05:52:09:WU00:FS00:0x21:Completed 650000 out of 2500000 steps (26%)
05:58:58:WU00:FS00:0x21:Completed 675000 out of 2500000 steps (27%)
06:05:47:WU00:FS00:0x21:Completed 700000 out of 2500000 steps (28%)
06:12:57:WU00:FS00:0x21:Completed 725000 out of 2500000 steps (29%)
06:19:45:WU00:FS00:0x21:Completed 750000 out of 2500000 steps (30%)
06:26:35:WU00:FS00:0x21:Completed 775000 out of 2500000 steps (31%)
06:33:25:WU00:FS00:0x21:Completed 800000 out of 2500000 steps (32%)
06:40:37:WU00:FS00:0x21:Completed 825000 out of 2500000 steps (33%)
06:47:28:WU00:FS00:0x21:Completed 850000 out of 2500000 steps (34%)
06:54:19:WU00:FS00:0x21:Completed 875000 out of 2500000 steps (35%)
07:01:10:WU00:FS00:0x21:Completed 900000 out of 2500000 steps (36%)
07:08:21:WU00:FS00:0x21:Completed 925000 out of 2500000 steps (37%)
07:15:10:WU00:FS00:0x21:Completed 950000 out of 2500000 steps (38%)
07:21:59:WU00:FS00:0x21:Completed 975000 out of 2500000 steps (39%)
07:47:22:WU00:FS00:0x21:Completed 1000000 out of 2500000 steps (40%)
07:47:23:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
07:54:14:WU00:FS00:0x21:Completed 925000 out of 2500000 steps (37%)
08:01:04:WU00:FS00:0x21:Completed 950000 out of 2500000 steps (38%)
08:07:54:WU00:FS00:0x21:Completed 975000 out of 2500000 steps (39%)
08:14:43:WU00:FS00:0x21:Completed 1000000 out of 2500000 steps (40%)
08:21:53:WU00:FS00:0x21:Completed 1025000 out of 2500000 steps (41%)
08:28:43:WU00:FS00:0x21:Completed 1050000 out of 2500000 steps (42%)
08:35:31:WU00:FS00:0x21:Completed 1075000 out of 2500000 steps (43%)
08:42:20:WU00:FS00:0x21:Completed 1100000 out of 2500000 steps (44%)
******************************* Date: 2015-11-08 *******************************
08:49:29:WU00:FS00:0x21:Completed 1125000 out of 2500000 steps (45%)
08:56:18:WU00:FS00:0x21:Completed 1150000 out of 2500000 steps (46%)
09:03:07:WU00:FS00:0x21:Completed 1175000 out of 2500000 steps (47%)
09:09:57:WU00:FS00:0x21:Completed 1200000 out of 2500000 steps (48%)
09:17:08:WU00:FS00:0x21:Completed 1225000 out of 2500000 steps (49%)
09:23:58:WU00:FS00:0x21:Completed 1250000 out of 2500000 steps (50%)
09:30:48:WU00:FS00:0x21:Completed 1275000 out of 2500000 steps (51%)
09:37:38:WU00:FS00:0x21:Completed 1300000 out of 2500000 steps (52%)
09:44:49:WU00:FS00:0x21:Completed 1325000 out of 2500000 steps (53%)
09:51:39:WU00:FS00:0x21:Completed 1350000 out of 2500000 steps (54%)
09:58:29:WU00:FS00:0x21:Completed 1375000 out of 2500000 steps (55%)
10:05:19:WU00:FS00:0x21:Completed 1400000 out of 2500000 steps (56%)
10:12:28:WU00:FS00:0x21:Completed 1425000 out of 2500000 steps (57%)
10:19:17:WU00:FS00:0x21:Completed 1450000 out of 2500000 steps (58%)
11:09:48:WU00:FS00:0x21:Completed 1475000 out of 2500000 steps (59%)
******************************* Date: 2015-11-08 *******************************
15:33:35:WU00:FS00:0x21:Completed 1500000 out of 2500000 steps (60%)
15:33:36:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
15:40:25:WU00:FS00:0x21:Completed 1425000 out of 2500000 steps (57%)
15:47:13:WU00:FS00:0x21:Completed 1450000 out of 2500000 steps (58%)
15:54:03:WU00:FS00:0x21:Completed 1475000 out of 2500000 steps (59%)
17:55:24:FS00:Paused
17:55:24:FS00:Shutting core down
17:55:24:WU00:FS00:0x21:WARNING:Console control signal 1 on PID 8156
17:55:24:WU00:FS00:0x21:Exiting, please wait. . .
17:55:26:WU00:FS00:0x21:Folding@home Core Shutdown: INTERRUPTED
17:55:27:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
17:55:35:Removing old file 'configs/config-20150822-210622.xml'
17:55:35:Saving configuration to config.xml
17:55:35:<config>
17:55:35:  <!-- Network -->
17:55:35:  <proxy v=':8080'/>
17:55:35:
17:55:35:  <!-- Slot Control -->
17:55:35:  <power v='full'/>
17:55:35:
17:55:35:  <!-- User Information -->
17:55:35:  <passkey v='********************************'/>
17:55:35:  <team v='224497'/>
17:55:35:  <user v='Cygnus-XI'/>
17:55:35:
17:55:35:  <!-- Folding Slots -->
17:55:35:  <slot id='1' type='GPU'>
17:55:35:    <paused v='true'/>
17:55:35:  </slot>
17:55:35:  <slot id='0' type='GPU'>
17:55:35:    <paused v='true'/>
17:55:35:  </slot>
17:55:35:</config>

Re: Project 9208 (0, 13, 50) Bad state errors, over power li

Posted: Sun Nov 08, 2015 7:19 pm
by CygnusXI
Just fired up both gpus, have a core 18 and core 21 running, noticed this in the log. I'm going to go out on a limb here and say the problem might have had something to do with this. Seems my core21.exe was outdated. Currently running 9631 (1, 4, 33) and only using 80% power draw, much better than 110%.

Code: Select all

19:11:11:WU01:FS01:0x21:ERROR:110: Need version 12
19:11:11:WU01:FS01:0x21:Folding@home Core Shutdown: CORE_OUTDATED
19:11:11:WARNING:WU01:FS01:FahCore returned: CORE_OUTDATED (110 = 0x6e)
19:11:11:WU01:FS01:Downloading core from http://web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah
19:11:11:WU01:FS01:Connecting to web.stanford.edu:80
19:11:12:WU01:FS01:FahCore 21: Downloading 3.34MiB
19:11:17:WU01:FS01:FahCore 21: Download complete
19:11:17:WU01:FS01:Valid core signature
19:11:17:WU01:FS01:Unpacked 11.43MiB to cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe
19:11:17:WU01:FS01:Starting
19:11:17:WU01:FS01:Running FahCore

Re: Project 9208 (0, 13, 50) Bad state errors, over power li

Posted: Sun Nov 08, 2015 9:37 pm
by toTOW
Some project require the new core version, that's why you got an update.

If you get p9208 again, let us know how it folds with the update core.

Re: Project 9208 (0, 13, 50) Bad state errors, over power li

Posted: Mon Nov 09, 2015 3:19 am
by bruce
CygnusXI wrote:I noticed when the work unit fired up it was getting a massive PPD of around 294k PPD ... Not the case when I checked on it this morning.
The estimated PPD is based on recent history of that project. It was NOT getting a massive PPD, it was just overestimated. Initial estimates often have very little data on which to base that estimate. After a reasonable amount of progress using that GPU on that Project, the estimate will improve.
I looked at MSI afterburner and noticed that my gpu power % was peaking up to 110%. I guess my biggest issue with this is the power utilization being so high
NVidia is responsible for providing drivers that manage the GPU power. If they're not doing a good job (assuming you didn't modify their settings) complain to NVidia.
There were some bad state errors ....
Bad State issues are much more frequent with non-OverClocked Maxwell than with older GPUs, but they do happen from time to time with older GPUs. When their cause is determined for Maxwell, hopefully that will also reduce or eliminate them for pre=Maxwell GPUs. It should be noted that the cause has not been determined -- but I'll do some (unfounded ?) speculation: Excess power draw MIGHT be a contributing factor. Also, this could be a simple memory error. (Memory parity is only checked in the more expensive commercial-grade GPUs, not the consumer-grade GPUs.) FAH is designed to work within the limitations of consumer-grade hardware. Again. this is all speculation, the facts are not yet known.

It should also be noted that the error message using the term "bad state" is new but FAH has always had an occasional error reported in different ways. Since many of these errors are recoverable by restarting from the previous checkpoint, the code recover whenever possible was developed. If that happens repeatedly in the same WU, the WU may be aborted. (Did yours abort?)