Project: 9408 (Run 628, Clone 0, Gen 0) [RESOLVED]
Posted: Fri Apr 18, 2014 10:40 am
				
				Got to the machine this morning and found this:
The "Bad State detected" messages seem to be preceded by a very long frame time … at one point (just after 08:00) I thought it was stuck and was about to dump it, but it picked up again just in time so I let it run. Even the horrible PPD it was returning (only ~70% of typical) was better than zero from a dumped WU!
But then it did it once too often and errored out.
The GPU temperature varied quite widely during the folding- from ~63ºC up to its normal 82-83ºC, not sure if that's useful.
After that I'm a bit confused as various parts of the log interleaved, but it got a p9406 which was immediately rejected:
(It had earlier completed a P9406 (Run 177, Clone 0, Gen 6) without a hitch)
It then got another P9408 (Run 496, Clone 0, Gen 3) which started like the first one (very low PPD) and I'm afraid I'd had enough at that point- I dumped it and removed the advanced flag. It's now, happily so far, crunching a P13000.
My main question- the 780 Ti is from Gigabyte and slightly manufacturer overclocked- is this likely to be the problem?
If it is, can I reduce/remove the overclock (it's in a Linux box) and if so how?
Otherwise I'll have to try to get it swapped for a stock speed one… which would be a pity 
 
System info:
Config:
			Code: Select all
23:00:05:WU01:FS01:0x17:Project: 9408 (Run 628, Clone 0, Gen 0)
23:00:05:WU01:FS01:0x17:Unit: 0x000000000a3b1e5c5342df4fb1e621a8
23:00:05:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
23:00:05:WU01:FS01:0x17:Machine: 1
23:00:05:WU01:FS01:0x17:Reading tar file system.xml
23:00:05:WU01:FS01:0x17:Reading tar file integrator.xml
23:00:05:WU01:FS01:0x17:Reading tar file state.xml
23:00:05:WU01:FS01:0x17:Reading tar file core.xml
23:00:05:WU01:FS01:0x17:Digital signatures verified
23:01:36:WU01:FS01:0x17:Completed 0 out of 5000000 steps (0%)
23:07:01:WU01:FS01:0x17:Completed 50000 out of 5000000 steps (1%)
23:12:23:WU01:FS01:0x17:Completed 100000 out of 5000000 steps (2%)
23:17:46:WU01:FS01:0x17:Completed 150000 out of 5000000 steps (3%)
23:23:08:WU01:FS01:0x17:Completed 200000 out of 5000000 steps (4%)
23:28:31:WU01:FS01:0x17:Completed 250000 out of 5000000 steps (5%)
23:33:53:WU01:FS01:0x17:Completed 300000 out of 5000000 steps (6%)
23:39:15:WU01:FS01:0x17:Completed 350000 out of 5000000 steps (7%)
23:44:37:WU01:FS01:0x17:Completed 400000 out of 5000000 steps (8%)
23:50:00:WU01:FS01:0x17:Completed 450000 out of 5000000 steps (9%)
23:55:22:WU01:FS01:0x17:Completed 500000 out of 5000000 steps (10%)
******************************* Date: 2014-04-18 *******************************
00:00:45:WU01:FS01:0x17:Completed 550000 out of 5000000 steps (11%)
00:06:07:WU01:FS01:0x17:Completed 600000 out of 5000000 steps (12%)
00:11:29:WU01:FS01:0x17:Completed 650000 out of 5000000 steps (13%)
00:16:51:WU01:FS01:0x17:Completed 700000 out of 5000000 steps (14%)
00:22:14:WU01:FS01:0x17:Completed 750000 out of 5000000 steps (15%)
00:27:36:WU01:FS01:0x17:Completed 800000 out of 5000000 steps (16%)
00:32:58:WU01:FS01:0x17:Completed 850000 out of 5000000 steps (17%)
00:38:21:WU01:FS01:0x17:Completed 900000 out of 5000000 steps (18%)
00:43:43:WU01:FS01:0x17:Completed 950000 out of 5000000 steps (19%)
00:49:05:WU01:FS01:0x17:Completed 1000000 out of 5000000 steps (20%)
00:54:28:WU01:FS01:0x17:Completed 1050000 out of 5000000 steps (21%)
00:59:50:WU01:FS01:0x17:Completed 1100000 out of 5000000 steps (22%)
01:05:13:WU01:FS01:0x17:Completed 1150000 out of 5000000 steps (23%)
01:10:35:WU01:FS01:0x17:Completed 1200000 out of 5000000 steps (24%)
01:15:57:WU01:FS01:0x17:Completed 1250000 out of 5000000 steps (25%)
01:21:20:WU01:FS01:0x17:Completed 1300000 out of 5000000 steps (26%)
01:26:42:WU01:FS01:0x17:Completed 1350000 out of 5000000 steps (27%)
01:32:05:WU01:FS01:0x17:Completed 1400000 out of 5000000 steps (28%)
01:37:28:WU01:FS01:0x17:Completed 1450000 out of 5000000 steps (29%)
01:42:50:WU01:FS01:0x17:Completed 1500000 out of 5000000 steps (30%)
01:50:12:WU01:FS01:0x17:Completed 1550000 out of 5000000 steps (31%)
01:50:18:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
02:00:58:WU01:FS01:0x17:Completed 1600000 out of 5000000 steps (32%)
02:06:20:WU01:FS01:0x17:Completed 1650000 out of 5000000 steps (33%)
02:11:42:WU01:FS01:0x17:Completed 1700000 out of 5000000 steps (34%)
02:17:05:WU01:FS01:0x17:Completed 1750000 out of 5000000 steps (35%)
02:22:27:WU01:FS01:0x17:Completed 1800000 out of 5000000 steps (36%)
02:27:49:WU01:FS01:0x17:Completed 1850000 out of 5000000 steps (37%)
02:33:11:WU01:FS01:0x17:Completed 1900000 out of 5000000 steps (38%)
02:38:33:WU01:FS01:0x17:Completed 1950000 out of 5000000 steps (39%)
02:43:56:WU01:FS01:0x17:Completed 2000000 out of 5000000 steps (40%)
02:49:18:WU01:FS01:0x17:Completed 2050000 out of 5000000 steps (41%)
02:54:40:WU01:FS01:0x17:Completed 2100000 out of 5000000 steps (42%)
03:00:02:WU01:FS01:0x17:Completed 2150000 out of 5000000 steps (43%)
03:05:24:WU01:FS01:0x17:Completed 2200000 out of 5000000 steps (44%)
03:24:13:WU01:FS01:0x17:Completed 2250000 out of 5000000 steps (45%)
03:24:19:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
03:34:58:WU01:FS01:0x17:Completed 2300000 out of 5000000 steps (46%)
03:40:21:WU01:FS01:0x17:Completed 2350000 out of 5000000 steps (47%)
03:45:43:WU01:FS01:0x17:Completed 2400000 out of 5000000 steps (48%)
03:51:05:WU01:FS01:0x17:Completed 2450000 out of 5000000 steps (49%)
03:56:28:WU01:FS01:0x17:Completed 2500000 out of 5000000 steps (50%)
04:01:50:WU01:FS01:0x17:Completed 2550000 out of 5000000 steps (51%)
04:07:12:WU01:FS01:0x17:Completed 2600000 out of 5000000 steps (52%)
04:12:34:WU01:FS01:0x17:Completed 2650000 out of 5000000 steps (53%)
04:17:56:WU01:FS01:0x17:Completed 2700000 out of 5000000 steps (54%)
04:23:18:WU01:FS01:0x17:Completed 2750000 out of 5000000 steps (55%)
04:28:40:WU01:FS01:0x17:Completed 2800000 out of 5000000 steps (56%)
04:34:03:WU01:FS01:0x17:Completed 2850000 out of 5000000 steps (57%)
04:39:25:WU01:FS01:0x17:Completed 2900000 out of 5000000 steps (58%)
04:44:47:WU01:FS01:0x17:Completed 2950000 out of 5000000 steps (59%)
04:50:09:WU01:FS01:0x17:Completed 3000000 out of 5000000 steps (60%)
04:55:32:WU01:FS01:0x17:Completed 3050000 out of 5000000 steps (61%)
05:00:54:WU01:FS01:0x17:Completed 3100000 out of 5000000 steps (62%)
05:06:17:WU01:FS01:0x17:Completed 3150000 out of 5000000 steps (63%)
05:11:39:WU01:FS01:0x17:Completed 3200000 out of 5000000 steps (64%)
05:17:01:WU01:FS01:0x17:Completed 3250000 out of 5000000 steps (65%)
05:22:24:WU01:FS01:0x17:Completed 3300000 out of 5000000 steps (66%)
05:27:46:WU01:FS01:0x17:Completed 3350000 out of 5000000 steps (67%)
05:33:08:WU01:FS01:0x17:Completed 3400000 out of 5000000 steps (68%)
05:38:30:WU01:FS01:0x17:Completed 3450000 out of 5000000 steps (69%)
05:43:52:WU01:FS01:0x17:Completed 3500000 out of 5000000 steps (70%)
05:49:14:WU01:FS01:0x17:Completed 3550000 out of 5000000 steps (71%)
05:54:37:WU01:FS01:0x17:Completed 3600000 out of 5000000 steps (72%)
05:59:59:WU01:FS01:0x17:Completed 3650000 out of 5000000 steps (73%)
******************************* Date: 2014-04-18 *******************************
06:05:21:WU01:FS01:0x17:Completed 3700000 out of 5000000 steps (74%)
06:10:43:WU01:FS01:0x17:Completed 3750000 out of 5000000 steps (75%)
06:16:05:WU01:FS01:0x17:Completed 3800000 out of 5000000 steps (76%)
06:21:27:WU01:FS01:0x17:Completed 3850000 out of 5000000 steps (77%)
06:26:50:WU01:FS01:0x17:Completed 3900000 out of 5000000 steps (78%)
06:32:12:WU01:FS01:0x17:Completed 3950000 out of 5000000 steps (79%)
06:37:34:WU01:FS01:0x17:Completed 4000000 out of 5000000 steps (80%)
06:42:56:WU01:FS01:0x17:Completed 4050000 out of 5000000 steps (81%)
06:46:07:WU01:FS01:FahCore returned: INTERRUPTED (102 = 0x66)
06:46:08:WU01:FS01:Starting
06:46:08:WU01:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17 -dir 01 -suffix 01 -version 703 -lifeline 1300 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
06:46:08:WU01:FS01:Started FahCore on PID 8095
06:46:08:WU01:FS01:Core PID:8099
06:46:08:WU01:FS01:FahCore 0x17 started
06:46:08:WU01:FS01:0x17:*********************** Log Started 2014-04-18T06:46:08Z ***********************
06:46:08:WU01:FS01:0x17:Project: 9408 (Run 628, Clone 0, Gen 0)
06:46:08:WU01:FS01:0x17:Unit: 0x000000000a3b1e5c5342df4fb1e621a8
06:46:08:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
06:46:08:WU01:FS01:0x17:Machine: 1
06:46:08:WU01:FS01:0x17:Digital signatures verified
06:46:08:WU01:FS01:0x17:  Found a checkpoint file
06:47:34:WU01:FS01:0x17:Completed 4050000 out of 5000000 steps (81%)
06:53:14:WU01:FS01:0x17:Completed 4100000 out of 5000000 steps (82%)
06:58:50:WU01:FS01:0x17:Completed 4150000 out of 5000000 steps (83%)
07:31:26:WU01:FS01:0x17:Completed 4200000 out of 5000000 steps (84%)
07:31:31:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
08:09:38:WU01:FS01:0x17:Completed 4250000 out of 5000000 steps (85%)
08:09:44:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
08:20:58:WU01:FS01:0x17:Completed 4300000 out of 5000000 steps (86%)
08:26:33:WU01:FS01:0x17:Completed 4350000 out of 5000000 steps (87%)
08:32:09:WU01:FS01:0x17:Completed 4400000 out of 5000000 steps (88%)
08:37:45:WU01:FS01:0x17:Completed 4450000 out of 5000000 steps (89%)
08:43:21:WU01:FS01:0x17:Completed 4500000 out of 5000000 steps (90%)
08:48:57:WU01:FS01:0x17:Completed 4550000 out of 5000000 steps (91%)
08:54:33:WU01:FS01:0x17:Completed 4600000 out of 5000000 steps (92%)
09:00:08:WU01:FS01:0x17:Completed 4650000 out of 5000000 steps (93%)
09:05:44:WU01:FS01:0x17:Completed 4700000 out of 5000000 steps (94%)
09:38:15:WU01:FS01:0x17:Completed 4750000 out of 5000000 steps (95%)
09:38:21:WU01:FS01:0x17:Bad State detected... attempting to resume from last good checkpoint
09:38:21:WU01:FS01:0x17:Max number of retries reached. Aborting.
09:38:21:WU01:FS01:0x17:ERROR:exception: Max Retries Reached
But then it did it once too often and errored out.
The GPU temperature varied quite widely during the folding- from ~63ºC up to its normal 82-83ºC, not sure if that's useful.
After that I'm a bit confused as various parts of the log interleaved, but it got a p9406 which was immediately rejected:
Code: Select all
09:39:40:WU00:FS01:0x17:Project: 9406 (Run 184, Clone 0, Gen 4)
09:39:40:WU00:FS01:0x17:Unit: 0x000000040a3b1e5c533deab006a95130
09:39:40:WU00:FS01:0x17:CPU: 0x00000000000000000000000000000000
09:39:40:WU00:FS01:0x17:Machine: 1
09:39:40:WU00:FS01:0x17:Reading tar file state.xml
09:39:41:WU00:FS01:0x17:Reading tar file system.xml
09:39:41:WU00:FS01:0x17:Reading tar file integrator.xml
09:39:41:WU00:FS01:0x17:Reading tar file core.xml
09:39:41:WU00:FS01:0x17:Digital signatures verified
09:43:01:WU00:FS01:0x17:ERROR:exception: Potential energy error of 10.2812, threshold of 10
09:43:01:WU00:FS01:0x17:ERROR:Reference Potential Energy: -1.08118e+06 | Given Potential Energy: -1.08119e+06
It then got another P9408 (Run 496, Clone 0, Gen 3) which started like the first one (very low PPD) and I'm afraid I'd had enough at that point- I dumped it and removed the advanced flag. It's now, happily so far, crunching a P13000.
My main question- the 780 Ti is from Gigabyte and slightly manufacturer overclocked- is this likely to be the problem?
If it is, can I reduce/remove the overclock (it's in a Linux box) and if so how?
Otherwise I'll have to try to get it swapped for a stock speed one… which would be a pity
 
 System info:
Code: Select all
11:54:09:************************* Folding@home Client *************************
11:54:09:    Website: http://folding.stanford.edu/
11:54:09:  Copyright: (c) 2009-2013 Stanford University
11:54:09:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
11:54:09:       Args: --child --lifeline 1087 /etc/fahclient/config.xml --run-as
11:54:09:             fahclient --pid-file=/var/run/fahclient.pid --daemon
11:54:09:     Config: /etc/fahclient/config.xml
11:54:09:******************************** Build ********************************
11:54:09:    Version: 7.3.6
11:54:09:       Date: Feb 18 2013
11:54:09:       Time: 07:24:08
11:54:09:    SVN Rev: 3923
11:54:09:     Branch: fah/trunk/client
11:54:09:   Compiler: GNU 4.4.7
11:54:09:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
11:54:09:             -fno-unsafe-math-optimizations -msse2
11:54:09:   Platform: linux2 3.2.0-1-amd64
11:54:09:       Bits: 64
11:54:09:       Mode: Release
11:54:09:******************************* System ********************************
11:54:09:        CPU: Intel(R) Core(TM) i5-4440 CPU @ 3.10GHz
11:54:09:     CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
11:54:09:       CPUs: 4
11:54:09:     Memory: 3.82GiB
11:54:09:Free Memory: 3.59GiB
11:54:09:    Threads: POSIX_THREADS
11:54:09:Has Battery: false
11:54:09: On Battery: false
11:54:09: UTC offset: 1
11:54:09:        PID: 1300
11:54:09:        CWD: /var/lib/fahclient
11:54:09:         OS: Linux 3.11.0-12-generic x86_64
11:54:09:    OS Arch: AMD64
11:54:09:       GPUs: 1
11:54:09:      GPU 0: NVIDIA:3 GK110 [GeForce GTX 780 Ti]
11:54:09:       CUDA: 3.5
11:54:09:CUDA Driver: 5050
11:54:09:***********************************************************************
Code: Select all
12:30:03:<config>
12:30:03:  <!-- Client Control -->
12:30:03:  <fold-anon v='true'/>
12:30:03:
12:30:03:  <!-- Folding Slot Configuration -->
12:30:03:  <power v='full'/>
12:30:03:
12:30:03:  <!-- HTTP Server -->
12:30:03:  <allow v='127.0.0.1 192.168.1.0/24'/>
12:30:03:
12:30:03:  <!-- Network -->
12:30:03:  <proxy v=':8080'/>
12:30:03:
12:30:03:  <!-- Remote Command Server -->
12:30:03:  <command-allow-no-pass v='127.0.0.1 192.168.1.0/24'/>
12:30:03:
12:30:03:  <!-- User Information -->
12:30:03:  <passkey v='********************************'/>
12:30:03:  <user v='[removed'/>
12:30:03:
12:30:03:  <!-- Folding Slots -->
12:30:03:  <slot id='0' type='CPU'>
12:30:03:    <client-type v='advanced'/>
12:30:03:    <cpus v='3'/>
12:30:03:    <next-unit-percentage v='100'/>
12:30:03:    <pause-on-start v='yes'/>
12:30:03:  </slot>
12:30:03:  <slot id='1' type='GPU'>
12:30:03:    <client-type v='advanced'/>
12:30:03:    <next-unit-percentage v='100'/>
12:30:03:    <pause-on-start v='yes'/>
12:30:03:  </slot>
12:30:03:</config>