Page 4 of 4

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 4:42 pm
by billford
toTOW wrote:... but the boost algorithm is still there :(
I've just had a look around some of the options- can't find anything relating to the boost but I may not have been looking in the right places.

However, on the PowerMizer page of the app I referred to earlier it says

Adaptive Clocking: Enabled

If adaptive clocking means the same as boost then it implies that it can be disabled or there would be no point in the message... but the how is beyond my pay grade :ewink:

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 5:06 pm
by bruce
Grandpa_01 wrote:
bruce wrote:One of tho posts in that topic suggests that NV should fix it but that it's probably not high on their list. That seems to still be the case.
I agree and they probably never will, so who will that leave the responsibility with ?

Remember in many cases these are factory OC cards that work with all of the other WU's including Core21 except the new 9xxx on Linux. Windows does not appear to have the same problem.
Does "adaptive clocking" make those WUs unstable? I don't know, but I'd like to gather enough information to figure that out.

My responsibility is to find reports of problems where overclocking is not a factor. As I said, it's very difficult to know which reports to include since many people do not specify whether they overclock or not. It's FAH's responsibilty to figure out what to do with those reports.

There seems to be a strong indication that Adaptive clocking + Overclocking contributes to stability problems. What's not clear is if the reports can be classified into ((_(A) Adaptive overclocking/not otherwise overclocked, (B) Overclocked without Adaptive overclocking (C) Both _)) that we might learn something useful. Actually, we'd probably have to subdivide overclocked into Factory Overclocked vs. Personally overclocked.

There's no doubt that Core_21 pushes the 9xx series harder. NV is responsible for making group A work correctly, but somebody would need to give them some way to reproduce the problem.

Please help rather than trying to find somebody to blame.

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 5:26 pm
by Grandpa_01
Prefer Maximum Performance in Xserver PowerMizer page is supposed to disable the voltage adjustment it is supposed to hold the voltage to whatever you have it set to. Which it does work with the 980 Classified's when they are switched to the 3rd bios setting which is Ln2 bios and I have flashed that bios to one of kingpin's bios which removes all limitations for testing on one of the cards. I can take it to supposedly 1.4v which I am not going to test that out no Ln2 here. :D Right now I am up to 1.212275 + .145000 = 1.227000 according to Nvidia settings

Any way Adaptive is supposed to enable driver boosted voltage. I do not know if these settings work on all cards they work on the GTX 980 Classifieds but they do not work on the 970 SC

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 5:45 pm
by billford
Grandpa_01 wrote:Adaptive is supposed to enable driver boosted voltage.
Ah, right, thank you :)

A question, based on your comment:
Grandpa_01 wrote:Prefer Maximum Performance in Xserver PowerMizer page is supposed to disable the voltage adjustment it is supposed to hold the voltage to whatever you have it set to.
I've always left mine at Adaptive (or Auto on the later drivers) so as not too push the card too hard, but if the instabilities may be caused by (possibly) "out of step" changes to the voltage might it be better to set them at Prefer Maximum Performance?

Or am I completely off-beam?

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:00 pm
by Grandpa_01
bruce wrote:
Grandpa_01 wrote:
bruce wrote:One of tho posts in that topic suggests that NV should fix it but that it's probably not high on their list. That seems to still be the case.
I agree and they probably never will, so who will that leave the responsibility with ?

Remember in many cases these are factory OC cards that work with all of the other WU's including Core21 except the new 9xxx on Linux. Windows does not appear to have the same problem.
Does "adaptive clocking" make those WUs unstable? I don't know, but I'd like to gather enough information to figure that out.

My responsibility is to find reports of problems where overclocking is not a factor. As I said, it's very difficult to know which reports to include since many people do not specify whether they overclock or not. It's FAH's responsibilty to figure out what to do with those reports.

There seems to be a strong indication that Adaptive clocking + Overclocking contributes to stability problems. What's not clear is if the reports can be classified into ((_(A) Adaptive overclocking/not otherwise overclocked, (B) Overclocked without Adaptive overclocking (C) Both _)) that we might learn something useful. Actually, we'd probably have to subdivide overclocked into Factory Overclocked vs. Personally overclocked.

There's no doubt that Core_21 pushes the 9xx series harder. NV is responsible for making group A work correctly, but somebody would need to give them some way to reproduce the problem.

Please help rather than trying to find somebody to blame.
bruce

I think you misunderstand my intent I agree with most of what you say. I believe it is our responsibility yours, mine anybody that cares about the work and the project to try and figure it out. If it is a Nvidia Linux driver problem, a PG software problem or whatever it may be, we need to figure it out if we can and then see what can be done from there, which is what you I and others are trying to do.

I do not think the ones that are having the problem should be in general circulation, I believe they should be in advanced or beta or removed from being assigned to Linux they are just creating problems in general population and give the impression that PG does not care, which may or may not be true but I would hope that is not the case.

Anyway right now I am doing what I can to try and help figure out what is causing the problem I do tend to believe it is a Nvidia Linux driver problem not a PG problem because the WU series in question work perfectly with a high OC in Windows but not in Linux and that is not a PG software or code problem. But PG can help alleviate the problem by moving them out of general circulation or by removing them from being assigned to Linux until a fix is found. Removing them from being assigned to Linux would probably have the least effect because there are not that many donors running GPU's on Linux.

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:03 pm
by Grandpa_01
billford wrote:
Grandpa_01 wrote:Adaptive is supposed to enable driver boosted voltage.
Ah, right, thank you :)

A question, based on your comment:
Grandpa_01 wrote:Prefer Maximum Performance in Xserver PowerMizer page is supposed to disable the voltage adjustment it is supposed to hold the voltage to whatever you have it set to.
I've always left mine at Adaptive (or Auto on the later drivers) so as not too push the card too hard, but if the instabilities may be caused by (possibly) "out of step" changes to the voltage might it be better to set them at Prefer Maximum Performance?

Or am I completely off-beam?
Yes I believe so if it works on your card I believe it is supposed to hold the voltage at what you specify or the default boost state which on a GTX Classified is 1.212v and which it holds it at on mine

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:09 pm
by billford
Grandpa_01 wrote: Yes I believe so if it works on your card I believe it is supposed to hold the voltage at what you specify or the default boost state which on a GTX Classified is 1.212v and which it holds it at on mine
OK, I'll try that setting and see how it goes, thanks.

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:17 pm
by billford
Grandpa_01 wrote:But PG can help alleviate the problem by moving them out of general circulation or by removing them from being assigned to Linux until a fix is found. Removing them from being assigned to Linux would probably have the least effect because there are not that many donors running GPU's on Linux.
It's a pity that the bigadv flag is otherwise occupied- it could be useful (and appropriate) here to give it a meaning on Linux of, in effect, "adv plus Core_21". Then those who didn't want to take the risk could just use "adv".

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:23 pm
by Grandpa_01
billford wrote:
Grandpa_01 wrote:But PG can help alleviate the problem by moving them out of general circulation or by removing them from being assigned to Linux until a fix is found. Removing them from being assigned to Linux would probably have the least effect because there are not that many donors running GPU's on Linux.
It's a pity that the bigadv flag is otherwise occupied- it could be useful (and appropriate) here to give it a meaning on Linux of, in effect, "adv plus Core_21". Then those who didn't want to take the risk could just use "adv".
That is some creative thinking, I like that I will give you an attboy on that. :wink: I do believe there is a bigbeta flag available

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 6:39 pm
by billford
Grandpa_01 wrote: That is some creative thinking, I like that I will give you an attboy on that. :wink: I do believe there is a bigbeta flag available
Thanks :)

From a purely personal viewpoint I'd be quite happy to run with that flag to help out where I can, provided it didn't also mean getting full-blown beta WUs- I've done beta testing in the past but prefer a quieter life at my age :ewink:

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 10:37 pm
by toTOW
Don't be mistaken, I saw many bad states and some stuck WUs on Windows too. This is not specific to Linux.

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sat Oct 17, 2015 10:52 pm
by billford
I'll take your word for that, I don't run Windows.

But using the bigbeta flag as outlined earlier could still be a useful stop-gap until the problem is sorted out (or a workaround found).

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sun Oct 18, 2015 3:22 pm
by artoar_11
One observation from yesterday. I do not know whether it is of interest to someone, but will share it.

Code: Select all

02:20:55:WU00:FS00:Starting
02:20:55:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/user/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 704 -lifeline 3820 -checkpoint 12 -gpu 0 -gpu-vendor nvidia
02:20:55:WU00:FS00:Started FahCore on PID 7084
02:20:55:WU00:FS00:Core PID:8752
02:20:55:WU00:FS00:FahCore 0x21 started
02:20:56:WU00:FS00:0x21:*********************** Log Started 2015-10-17T02:20:55Z ***********************
02:20:56:WU00:FS00:0x21:Project: 9712 (Run 8, Clone 10, Gen 75)
02:20:56:WU00:FS00:0x21:Unit: 0x0000012dab40416255b9a770b7e7800e
02:20:56:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
02:20:56:WU00:FS00:0x21:Machine: 0
02:20:56:WU00:FS00:0x21:Reading tar file core.xml
02:20:56:WU00:FS00:0x21:Reading tar file integrator.xml
02:20:56:WU00:FS00:0x21:Reading tar file system.xml
02:20:57:WU00:FS00:0x21:Reading tar file state.xml
02:20:58:WU00:FS00:0x21:Digital signatures verified
02:20:58:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
02:20:58:WU00:FS00:0x21:Version 0.0.12
02:21:47:WU00:FS00:0x21:Completed 0 out of 1280000 steps (0%)
02:21:47:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
02:23:46:WU00:FS00:0x21:Completed 12800 out of 1280000 steps (1%)
02:25:31:WU00:FS00:0x21:Completed 25600 out of 1280000 steps (2%)
02:27:16:WU00:FS00:0x21:Completed 38400 out of 1280000 steps (3%)
...............................
04:49:29:WU00:FS00:0x21:Completed 1049600 out of 1280000 steps (82%)
04:51:15:WU00:FS00:0x21:Completed 1062400 out of 1280000 steps (83%)
04:53:01:WU00:FS00:0x21:Completed 1075200 out of 1280000 steps (84%)
05:45:27:WU00:FS00:0x21:Completed 1088000 out of 1280000 steps (85%)
06:50:18:WU00:FS00:0x21:Completed 1100800 out of 1280000 steps (86%)
I want to comment on the last two% (85, 86). Over the weekend I use Teamviewer for remote access.
On 86% the computer was totally overloaded. PC react on mouse with a delay of 10-15 seconds. Very difficult to do screenshoot of MSI AB window.
http://storage4.album.bg/a3e/whit_throt ... 893346.jpg
Image
The last two TPF about 1h each. Throttling with short peaks (↑↓). On the "Power" graph max. peak is - 114%.
After the "Pause" waited for about 1 min, Exit -> Quit, but in LOG record missing.
The question in my mind is: whether for some reason the software begins "cycle condition" and there is increased the load on the GPU? Or for some reason appears overload (throttling) then the software fails thing and increases the TPF (chicken or the egg)?
Rebooting the PC.

Code: Select all

*********************** Log Started 2015-10-17T08:03:03Z ***********************
08:03:03:************************* Folding@home Client *************************
08:03:03:      Website: http://folding.stanford.edu/
08:03:03:    Copyright: (c) 2009-2014 Stanford University
08:03:03:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:03:03:         Args: 
08:03:03:       Config: C:/Users/user/AppData/Roaming/FAHClient/config.xml
08:03:03:******************************** Build ********************************
08:03:03:      Version: 7.4.4
08:03:03:         Date: Mar 4 2014
08:03:03:         Time: 20:26:54
08:03:03:      SVN Rev: 4130
08:03:03:       Branch: fah/trunk/client
08:03:03:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
08:03:03:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
08:03:03:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
08:03:03:     Platform: win32 XP
08:03:03:         Bits: 32
08:03:03:         Mode: Release
08:03:03:******************************* System ********************************
08:03:03:          CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
08:03:03:       CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
08:03:03:         CPUs: 4
08:03:03:       Memory: 7.98GiB
08:03:03:  Free Memory: 6.66GiB
08:03:03:      Threads: WINDOWS_THREADS
08:03:03:   OS Version: 6.1
08:03:03:  Has Battery: false
08:03:03:   On Battery: false
08:03:03:   UTC Offset: 3
08:03:03:          PID: 3068
08:03:03:          CWD: C:/Users/user/AppData/Roaming/FAHClient
08:03:03:           OS: Windows 7 Professional
08:03:03:      OS Arch: AMD64
08:03:03:         GPUs: 1
08:03:03:        GPU 0: NVIDIA:5 GM204 [GeForce GTX 970]
08:03:03:         CUDA: 5.2
08:03:03:  CUDA Driver: 7050
08:03:03:Win32 Service: false
08:03:03:***********************************************************************
08:03:03:<config>
08:03:03:  <!-- Folding Core -->
08:03:03:  <checkpoint v='12'/>
08:03:03:
08:03:03:  <!-- Network -->
08:03:03:  <proxy v=':8080'/>
08:03:03:
08:03:03:  <!-- Slot Control -->
08:03:03:  <power v='full'/>
08:03:03:
08:03:03:  <!-- User Information -->
08:03:03:  <passkey v='********************************'/>
08:03:03:  <team v='224497'/>
08:03:03:  <user v='artoar_home'/>
08:03:03:
08:03:03:  <!-- Folding Slots -->
08:03:03:  <slot id='0' type='GPU'>
08:03:03:    <client-type v='advanced'/>
08:03:03:    <next-unit-percentage v='100'/>
08:03:03:  </slot>
08:03:03:  <slot id='1' type='CPU'>
08:03:03:    <client-type v='beta'/>
08:03:03:    <next-unit-percentage v='100'/>
08:03:03:    <paused v='true'/>
08:03:03:  </slot>
08:03:03:</config>
08:03:03:Trying to access database...
08:03:03:Successfully acquired database lock
08:03:03:Enabled folding slot 00: READY gpu:0:GM204 [GeForce GTX 970]
08:03:03:Enabled folding slot 01: PAUSED cpu:3 (by user)
08:03:16:WU00:FS00:Starting
08:03:16:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/user/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 00 -suffix 01 -version 704 -lifeline 3068 -checkpoint 12 -gpu 0 -gpu-vendor nvidia
08:03:16:WU00:FS00:Started FahCore on PID 3776
08:03:16:WU00:FS00:Core PID:3808
08:03:16:WU00:FS00:FahCore 0x21 started
08:03:17:WU00:FS00:0x21:*********************** Log Started 2015-10-17T08:03:17Z ***********************
08:03:17:WU00:FS00:0x21:Project: 9712 (Run 8, Clone 10, Gen 75)
08:03:17:WU00:FS00:0x21:Unit: 0x0000012dab40416255b9a770b7e7800e
08:03:17:WU00:FS00:0x21:CPU: 0x00000000000000000000000000000000
08:03:17:WU00:FS00:0x21:Machine: 0
08:03:17:WU00:FS00:0x21:Digital signatures verified
08:03:17:WU00:FS00:0x21:Folding@home GPU Core21 Folding@home Core
08:03:17:WU00:FS00:0x21:Version 0.0.12
08:03:17:WU00:FS00:0x21:  Found a checkpoint file
08:04:30:WU00:FS00:0x21:Completed 1040000 out of 1280000 steps (81%)
08:04:30:WU00:FS00:0x21:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
08:06:10:WU00:FS00:0x21:Completed 1049600 out of 1280000 steps (82%)
08:08:01:WU00:FS00:0x21:Completed 1062400 out of 1280000 steps (83%)
08:09:50:WU00:FS00:0x21:Completed 1075200 out of 1280000 steps (84%)
08:11:35:WU00:FS00:0x21:Completed 1088000 out of 1280000 steps (85%)
08:13:21:WU00:FS00:0x21:Completed 1100800 out of 1280000 steps (86%)
08:15:07:WU00:FS00:0x21:Completed 1113600 out of 1280000 steps (87%)
08:17:10:WU00:FS00:0x21:Completed 1126400 out of 1280000 steps (88%)
08:18:55:WU00:FS00:0x21:Completed 1139200 out of 1280000 steps (89%)
08:20:41:WU00:FS00:0x21:Completed 1152000 out of 1280000 steps (90%)
08:22:26:WU00:FS00:0x21:Completed 1164800 out of 1280000 steps (91%)
08:24:12:WU00:FS00:0x21:Completed 1177600 out of 1280000 steps (92%)
08:25:58:WU00:FS00:0x21:Completed 1190400 out of 1280000 steps (93%)
08:28:00:WU00:FS00:0x21:Completed 1203200 out of 1280000 steps (94%)
08:29:45:WU00:FS00:0x21:Completed 1216000 out of 1280000 steps (95%)
08:31:31:WU00:FS00:0x21:Completed 1228800 out of 1280000 steps (96%)
08:33:16:WU00:FS00:0x21:Completed 1241600 out of 1280000 steps (97%)
08:35:01:WU00:FS00:0x21:Completed 1254400 out of 1280000 steps (98%)
08:36:47:WU00:FS00:0x21:Completed 1267200 out of 1280000 steps (99%)
08:38:32:WU00:FS00:0x21:Completed 1280000 out of 1280000 steps (100%)
08:38:34:WU01:FS00:Connecting to 171.67.108.45:80
08:38:35:WU01:FS00:Assigned to work server 171.64.65.84
08:38:35:WU01:FS00:Requesting new work unit for slot 00: RUNNING gpu:0:GM204 [GeForce GTX 970] from 171.64.65.84
08:38:35:WU01:FS00:Connecting to 171.64.65.84:8080
08:38:36:WU01:FS00:Downloading 3.15MiB
08:38:40:WU01:FS00:Download complete
08:38:40:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9120 run:28 clone:1 gen:78 core:0x18 unit:0x000000650a3b1e78553ea16b17757df3
08:38:49:WU00:FS00:0x21:Saving result file logfile_01.txt
08:38:49:WU00:FS00:0x21:Saving result file checkpointState.xml
08:38:50:WU00:FS00:0x21:Saving result file checkpt.crc
08:38:50:WU00:FS00:0x21:Saving result file log.txt
08:38:50:WU00:FS00:0x21:Saving result file positions.xtc
08:38:51:WU00:FS00:0x21:Folding@home Core Shutdown: FINISHED_UNIT
08:38:52:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
08:38:52:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:9712 run:8 clone:10 gen:75 core:0x21 unit:0x0000012dab40416255b9a770b7e7800e
08:38:52:WU00:FS00:Uploading 9.59MiB to 171.64.65.98
08:38:52:WU00:FS00:Connecting to 171.64.65.98:8080

08:41:23:WU00:FS00:Upload 96.48%
08:41:29:WU00:FS00:Upload 100.00%
08:41:44:WU00:FS00:Upload complete
08:41:44:WU00:FS00:Server responded WORK_ACK (400)
08:41:44:WU00:FS00:Final credit estimate, 23701.00 points
08:41:44:WU00:FS00:Cleaning up
MSI AB after rebooting (the same WU).
http://storage4.album.bg/f76/w-o_thrott ... 893347.jpg
Image

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Sun Oct 18, 2015 4:54 pm
by toTOW
This is the "stuck" bug ... if you let it process, it will generate a bad state on next sanity check.

I heard that the cause of this behaviour should be fixed in core 21 v0.0.13.

Re: 9634 (Run 0, Clone 9, Gen 5)

Posted: Tue Oct 20, 2015 1:03 am
by bruce
A script has been posted in the 3rd party forum that allows overclocking based on a constant clock rate:
Script to disable NV Boost