Page 1 of 2
P9625-9643 and bad states- some observations
Posted: Sun Oct 25, 2015 1:46 pm
by billford
I haven't had enough of these for this to be conclusive, but it has happened several times and I haven't seen it with other projects. I put it forward as perhaps suggesting something that a developer may recognise. Or not, as the case may be.
It relates to my GTX 980's, NVidia 355.11 drivers (Linux), factory overclocked by 90MHz, another 125MHz overclock added by me, set to "Prefer Maximum performance". (This last seemed to improve matters a little).
First, only WUs that run the GPU close to maximum power (~190W wrt 196W) seem to cause issues. Those (most) that run at a more usual 160-180W seem OK.
If the WU is going to have issues, the first bad state usually occurs at around 35-40 frames, it backs up to the last checkpoint and continues, throws another bad state after another 35-40 frames, backs up again. At this point it will either complete the WU (occasionally with a 3rd bad state during the sanity check after 100%) or throw the third, fatal, bad state around 95%.
In other words, it doesn't seem to be random. It's as though there's a small, but cumulative, error somewhere which eventually forces it "out of spec", and is reset when a checkpoint file is loaded. (Which could imply that it's not an inherent part of the WU data but related to the fahcore or drivers) Whether it reaches three bad states or not depends how fast it accumulates. (Pure speculation on my part)
This could be supported by the observation that, if one of these WUs does fail due to too many bad states, the next WU (whatever project it's from) is quite likely to throw a bad state within seconds of starting, back up to the previous non-existent checkpoint (ie start again) and then complete without further issue.
Note- I have no intention of resetting the clock to factory values; with the current spread of projects I'm getting the loss of overall PPD (hence science done) that this would entail is about three times the loss due to binning the occasional WU.
Re: P9625-9643 and bad states- some observations
Posted: Mon Oct 26, 2015 4:16 am
by Grandpa_01
I can tell you how to fix this but you probably are not going to be real happy with the solution.
This is not a core21 problem it is a Nvidia problem that affects the upper end Maxwells mainly GTX 980's and some of the 970's, It may also affect 980 TI's but I do not know that since I do not have a 980 TI to test. I have a feeling Nvidia knows about it since they took steps to reduce the problem. When you run compute software on the Maxwells it defaults to P2 state which has the same core clock but has a reduced memory speed it gets lowered from 7000Mhz to 6000Mhz no other generation of Nvidia GPU's does this that I know of.
Anyway I started having a lot of the bad states on my GPU's both in Windows and Linux with this series of WU's and some others and I started testing things out with the help of a few others from here in the forum. And one of the thing's I did was lower the Memory speed using Nvidia Inspector in P2 state a little. The first adjustment was 300Mhz and the errors stopped completely and I was able to achieve a higher stable OC. I ended up between 5750Mhz and 5755Mhz on all 3 of the cards.
The part you are not going to like is that there is no way at this time to lower the P2 state memory speeds in Linux. I have put in a request for help over at the Nvidia Developers forum so hopefully they will enable that ability in x server soon. But as of yet nobody has replied to the post but it is Sunday. Until then I would recommend putting the 980 in a Windows box, which I know is slower but in the long run it will pay off, I have moved 3 of mine from Linux to Windows. Or you can yell at Nvidia for selling us a GPU with what I am thinking is either faulty memory or faulty memory controller, most likely the latter.
https://devtalk.nvidia.com/default/topi ... -software/
Re: P9625-9643 and bad states- some observations
Posted: Mon Oct 26, 2015 7:30 am
by billford
Thanks for that, very interesting.
Must admit I was under the impression that the memory clock
could be adjusted just like the graphics clock, but closer examination (plus experiment) shows you are perfectly correct
I'm reluctant to run the cards in a Windows box- apart from any other reason I haven't got any spare Windows licences!
(edit- I assume that expecting Nvidia Inspector to work under WINE would be a touch optimistic? Even if I plucked up the courage to try it...)
For the moment I'll leave it as it is, as I indicated I'm only losing an occasional WU. If that changes then I'll have to re-think.
FWIW I've added a reply to your post on the NVidia forum, but from their response to the Maxwell bug some time ago (ie there wasn't any) I'm not going to hold my breath.
Further edit- if NVidia show no signs of fixing it, a workaround might be the suggestion elsewhere to increase the tolerance of the Core_21 code to bad state errors. Maybe in v0.0.13.
Re: P9625-9643 and bad states- some observations
Posted: Thu Oct 29, 2015 7:29 pm
by bigblock990
Grandpa_01 wrote:I can tell you how to fix this but you probably are not going to be real happy with the solution.
This is not a core21 problem it is a Nvidia problem that affects the upper end Maxwells mainly GTX 980's and some of the 970's, It may also affect 980 TI's but I do not know that since I do not have a 980 TI to test. I have a feeling Nvidia knows about it since they took steps to reduce the problem. When you run compute software on the Maxwells it defaults to P2 state which has the same core clock but has a reduced memory speed it gets lowered from 7000Mhz to 6000Mhz no other generation of Nvidia GPU's does this that I know of.
Anyway I started having a lot of the bad states on my GPU's both in Windows and Linux with this series of WU's and some others and I started testing things out with the help of a few others from here in the forum. And one of the thing's I did was lower the Memory speed using Nvidia Inspector in P2 state a little. The first adjustment was 300Mhz and the errors stopped completely and I was able to achieve a higher stable OC. I ended up between 5750Mhz and 5755Mhz on all 3 of the cards.
The part you are not going to like is that there is no way at this time to lower the P2 state memory speeds in Linux. I have put in a request for help over at the Nvidia Developers forum so hopefully they will enable that ability in x server soon. But as of yet nobody has replied to the post but it is Sunday. Until then I would recommend putting the 980 in a Windows box, which I know is slower but in the long run it will pay off, I have moved 3 of mine from Linux to Windows. Or you can yell at Nvidia for selling us a GPU with what I am thinking is either faulty memory or faulty memory controller, most likely the latter.
https://devtalk.nvidia.com/default/topi ... -software/
You really might be on to something here. I have titan x, 980ti, and two 970's. I have modded the 970 bios to run at default 7010mhz for p2. The titan x and 980 ti both run at 6608mhz p2. The titan x is the worse offender, with 980ti next, then 970's when it comes to bad state errors for core21 units. More memory on titan x means memory controller works harder? I will edit my bios down to 5750mhz for the 970's and if that eliminates the errors, I will make a mod bios for the other two.
Any idea why core21 causes problems, but core18 runs perfectly fine?
Re: P9625-9643 and bad states- some observations
Posted: Thu Oct 29, 2015 8:01 pm
by Grandpa_01
Just a wild guess but I would say core 21 is more efficient and pushes the memory / memory controller a little harder than previous cores. But that really should not be a problem if Nvidias hardware was up to Industry standards, No other generation of Nvidia graphics have a problem with core 21 WU's. In addition to my 980's and 970 I am running 1 - GTX 770, 2 - GTX 680's and a GTX 580 all of which do not have any problem with core 21 projects on Windows or Linux.
Re: P9625-9643 and bad states- some observations
Posted: Thu Oct 29, 2015 11:42 pm
by mattifolder
It seems to be a way to dynamically adjust gpu memory clock under linux for folding. It works for me with nvidia driver 346.96. I think, it's more a bug than a feature, but it is reproducible.
Short description for my setup with nvidia gtx970, all commands have to execute under grafic desktop in terminal with root user (or sudo).
This is initially situation:
Code: Select all
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 2.
'GPUCurrentPerfLevel' is an integer attribute.
'GPUCurrentPerfLevel' is a read-only attribute.
'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3004.
'GPUCurrentClockFreqs' is a packed integer attribute.
'GPUCurrentClockFreqs' is a read-only attribute.
'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=pstate --format=csv
pstate
P2
Next steps to prepare for changing memory clock for folding are:
- get actual application clocks with nvidia-smi
Code: Select all
nvidia-smi --query-gpu=clocks.applications.graphics --format=csv
clocks.applications.graphics [MHz]
1113 MHz
nvidia-smi --query-gpu=clocks.applications.memory --format=csv
clocks.applications.memory [MHz]
3505 MHz
- get supported application clocks with nvidia-smi
Code: Select all
nvidia-smi -q -d SUPPORTED_CLOCKS
==============NVSMI LOG==============
Timestamp : Thu Oct 29 23:12:38 2015
Driver Version : 346.96
Attached GPUs : 1
GPU 0000:01:00.0
Supported Clocks
Memory : 3505 MHz
Graphics : 1641 MHz
Graphics : 1628 MHz
Graphics : 1616 MHz
Graphics : 1603 MHz
Graphics : 1590 MHz
Graphics : 1578 MHz
Graphics : 1565 MHz
Graphics : 1552 MHz
Graphics : 1540 MHz
Graphics : 1527 MHz
Graphics : 1514 MHz
Graphics : 1502 MHz
Graphics : 1489 MHz
Graphics : 1476 MHz
Graphics : 1464 MHz
Graphics : 1451 MHz
Graphics : 1438 MHz
Graphics : 1426 MHz
Graphics : 1413 MHz
Graphics : 1400 MHz
Graphics : 1388 MHz
Graphics : 1375 MHz
Graphics : 1362 MHz
Graphics : 1350 MHz
Graphics : 1337 MHz
Graphics : 1324 MHz
Graphics : 1312 MHz
Graphics : 1299 MHz
Graphics : 1286 MHz
Graphics : 1274 MHz
Graphics : 1261 MHz
Graphics : 1249 MHz
Graphics : 1236 MHz
Graphics : 1223 MHz
Graphics : 1211 MHz
Graphics : 1198 MHz
Graphics : 1185 MHz
Graphics : 1173 MHz
Graphics : 1160 MHz
Graphics : 1147 MHz
Graphics : 1135 MHz
Graphics : 1122 MHz
Graphics : 1109 MHz
Graphics : 1097 MHz
Graphics : 1085 MHz
Graphics : 1072 MHz
Graphics : 1071 MHz
Graphics : 1059 MHz
Graphics : 1046 MHz
Graphics : 1033 MHz
Graphics : 1021 MHz
Graphics : 1008 MHz
Graphics : 995 MHz
Graphics : 983 MHz
Graphics : 970 MHz
Graphics : 957 MHz
Graphics : 945 MHz
Graphics : 932 MHz
Graphics : 919 MHz
Graphics : 907 MHz
Graphics : 894 MHz
Graphics : 881 MHz
Graphics : 869 MHz
Graphics : 856 MHz
Graphics : 844 MHz
Graphics : 831 MHz
Graphics : 818 MHz
Graphics : 806 MHz
Graphics : 793 MHz
Graphics : 780 MHz
Graphics : 768 MHz
Graphics : 755 MHz
Graphics : 742 MHz
Graphics : 730 MHz
Graphics : 717 MHz
Graphics : 704 MHz
Graphics : 692 MHz
Graphics : 680 MHz
Graphics : 667 MHz
Graphics : 655 MHz
Graphics : 642 MHz
Graphics : 630 MHz
Graphics : 617 MHz
Graphics : 605 MHz
Graphics : 592 MHz
Graphics : 590 MHz
Graphics : 509 MHz
Graphics : 455 MHz
Graphics : 388 MHz
Graphics : 349 MHz
Graphics : 320 MHz
Memory : 3004 MHz
Graphics : 1641 MHz
Graphics : 1628 MHz
Graphics : 1616 MHz
Graphics : 1603 MHz
Graphics : 1590 MHz
Graphics : 1578 MHz
Graphics : 1565 MHz
Graphics : 1552 MHz
Graphics : 1540 MHz
Graphics : 1527 MHz
Graphics : 1514 MHz
Graphics : 1502 MHz
Graphics : 1489 MHz
Graphics : 1476 MHz
Graphics : 1464 MHz
Graphics : 1451 MHz
Graphics : 1438 MHz
Graphics : 1426 MHz
Graphics : 1413 MHz
Graphics : 1400 MHz
Graphics : 1388 MHz
Graphics : 1375 MHz
Graphics : 1362 MHz
Graphics : 1350 MHz
Graphics : 1337 MHz
Graphics : 1324 MHz
Graphics : 1312 MHz
Graphics : 1299 MHz
Graphics : 1286 MHz
Graphics : 1274 MHz
Graphics : 1261 MHz
Graphics : 1249 MHz
Graphics : 1236 MHz
Graphics : 1223 MHz
Graphics : 1211 MHz
Graphics : 1198 MHz
Graphics : 1185 MHz
Graphics : 1173 MHz
Graphics : 1160 MHz
Graphics : 1147 MHz
Graphics : 1135 MHz
Graphics : 1122 MHz
Graphics : 1109 MHz
Graphics : 1097 MHz
Graphics : 1085 MHz
Graphics : 1072 MHz
Graphics : 1071 MHz
Graphics : 1059 MHz
Graphics : 1046 MHz
Graphics : 1033 MHz
Graphics : 1021 MHz
Graphics : 1008 MHz
Graphics : 995 MHz
Graphics : 983 MHz
Graphics : 970 MHz
Graphics : 957 MHz
Graphics : 945 MHz
Graphics : 932 MHz
Graphics : 919 MHz
Graphics : 907 MHz
Graphics : 894 MHz
Graphics : 881 MHz
Graphics : 869 MHz
Graphics : 856 MHz
Graphics : 844 MHz
Graphics : 831 MHz
Graphics : 818 MHz
Graphics : 806 MHz
Graphics : 793 MHz
Graphics : 780 MHz
Graphics : 768 MHz
Graphics : 755 MHz
Graphics : 742 MHz
Graphics : 730 MHz
Graphics : 717 MHz
Graphics : 704 MHz
Graphics : 692 MHz
Graphics : 680 MHz
Graphics : 667 MHz
Graphics : 655 MHz
Graphics : 642 MHz
Graphics : 630 MHz
Graphics : 617 MHz
Graphics : 605 MHz
Graphics : 592 MHz
Graphics : 590 MHz
Graphics : 509 MHz
Graphics : 455 MHz
Graphics : 388 MHz
Graphics : 349 MHz
Graphics : 320 MHz
Memory : 810 MHz
Graphics : 1455 MHz
Graphics : 1442 MHz
Graphics : 1430 MHz
Graphics : 1417 MHz
Graphics : 1404 MHz
Graphics : 1392 MHz
Graphics : 1379 MHz
Graphics : 1366 MHz
Graphics : 1354 MHz
Graphics : 1341 MHz
Graphics : 1328 MHz
Graphics : 1316 MHz
Graphics : 1303 MHz
Graphics : 1290 MHz
Graphics : 1278 MHz
Graphics : 1265 MHz
Graphics : 1252 MHz
Graphics : 1240 MHz
Graphics : 1227 MHz
Graphics : 1215 MHz
Graphics : 1202 MHz
Graphics : 1189 MHz
Graphics : 1177 MHz
Graphics : 1164 MHz
Graphics : 1151 MHz
Graphics : 1139 MHz
Graphics : 1126 MHz
Graphics : 1113 MHz
Graphics : 1101 MHz
Graphics : 1088 MHz
Graphics : 1075 MHz
Graphics : 1063 MHz
Graphics : 1050 MHz
Graphics : 1037 MHz
Graphics : 1025 MHz
Graphics : 1012 MHz
Graphics : 999 MHz
Graphics : 987 MHz
Graphics : 974 MHz
Graphics : 961 MHz
Graphics : 949 MHz
Graphics : 936 MHz
Graphics : 923 MHz
Graphics : 911 MHz
Graphics : 899 MHz
Graphics : 886 MHz
Graphics : 885 MHz
Graphics : 873 MHz
Graphics : 860 MHz
Graphics : 847 MHz
Graphics : 835 MHz
Graphics : 822 MHz
Graphics : 810 MHz
Graphics : 797 MHz
Graphics : 784 MHz
Graphics : 772 MHz
Graphics : 759 MHz
Graphics : 746 MHz
Graphics : 734 MHz
Graphics : 721 MHz
Graphics : 708 MHz
Graphics : 696 MHz
Graphics : 683 MHz
Graphics : 670 MHz
Graphics : 658 MHz
Graphics : 645 MHz
Graphics : 632 MHz
Graphics : 620 MHz
Graphics : 607 MHz
Graphics : 594 MHz
Graphics : 582 MHz
Graphics : 569 MHz
Graphics : 556 MHz
Graphics : 544 MHz
Graphics : 531 MHz
Graphics : 519 MHz
Graphics : 506 MHz
Graphics : 494 MHz
Graphics : 481 MHz
Graphics : 469 MHz
Graphics : 456 MHz
Graphics : 444 MHz
Graphics : 431 MHz
Graphics : 419 MHz
Graphics : 406 MHz
Graphics : 405 MHz
Graphics : 324 MHz
Graphics : 270 MHz
Graphics : 202 MHz
Graphics : 162 MHz
Graphics : 135 MHz
Memory : 324 MHz
Graphics : 405 MHz
Graphics : 324 MHz
Graphics : 270 MHz
Graphics : 202 MHz
Graphics : 162 MHz
Graphics : 135 MHz
- select one combination from supported application clocks near actual clocks (without gpu-boost)
(I selected Memory 3004 (from P2 State) and Graphics 1109)
- change actual application clocks (it's secondary for our intention) and incidentally switch to full grafic performance state
(in nvidia-smi it names P0, in nvidia-settings P3)
Code: Select all
nvidia-smi -ac 3004,1109
Applications clocks set to "(MEM 3004, SM 1109)" for GPU 0000:01:00.0
Warning: persistence mode is disabled on this device. This settings will go back to default as soon as driver unloads (e.g. last application like nvidia-smi or cuda application terminates). Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.
That are the actually settings:
Code: Select all
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 3.
'GPUCurrentPerfLevel' is an integer attribute.
'GPUCurrentPerfLevel' is a read-only attribute.
'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3505.
'GPUCurrentClockFreqs' is a packed integer attribute.
'GPUCurrentClockFreqs' is a read-only attribute.
'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=pstate --format=csv
pstate
P0
The graphics clock is the same as before, memory clock and performance state are changed to full graphics values.
Because performance state now changed to full graphics (3 / P0) after this steps the change of memory clocks is possible on the well known way by nvidia-settings.
Code: Select all
nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[3]=...
Resetting application clocks and performance level is done by "nvidia-smi -rac" and all values are as before.
Code: Select all
nvidia-smi -rac
All done.
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 2.
'GPUCurrentPerfLevel' is an integer attribute.
'GPUCurrentPerfLevel' is a read-only attribute.
'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3004.
'GPUCurrentClockFreqs' is a packed integer attribute.
'GPUCurrentClockFreqs' is a read-only attribute.
'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=clocks.applications.graphics --format=csv
clocks.applications.graphics [MHz]
1113 MHz
nvidia-smi --query-gpu=clocks.applications.memory --format=csv
clocks.applications.memory [MHz]
3505 MHz
nvidia-smi --query-gpu=pstate --format=csv
pstate
P2
EDIT:
There is one marginal problem: the minimum vram clock offset at P3 (P0) for "nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[3]=..." because of higher base vram clock of P3 results in the same base clock as P2.
So nothing is won. I'll have a lock to other performance options of nvidia-settings and nvidia-smi, may be another solution. The Windows tools may increase / degrease the limits, why not under linux ?
Re: P9625-9643 and bad states- some observations
Posted: Sat Oct 31, 2015 4:12 am
by Grandpa_01
Yeah we need NVidia to add some more memoryclock speed options in their drivers it would be a simple fix if they would.
Re: P9625-9643 and bad states- some observations
Posted: Mon Nov 02, 2015 2:36 pm
by bigblock990
Changed my custom bios from 7010mhz to 5500mhz mem clocks for my 970's. The frequency of bad state errors, and failed WU's remained the same for openmm_21 projects. Also dropped PPD by ~50k on those projects. Core18 still work great with same PPD.
So dropping mem clocks may help in windows, but it doesn't in Linux. I'll be going back to 7010mhz
Re: P9625-9643 and bad states- some observations
Posted: Mon Nov 02, 2015 5:01 pm
by toTOW
Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
Re: P9625-9643 and bad states- some observations
Posted: Mon Nov 02, 2015 5:22 pm
by bigblock990
toTOW wrote:Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
Yes, I created a modified bios so that p2 runs at 7010mhz same as p0. I then modified that bios so that p2 runs at 5500mhz to test Grandpa_01 theory in linux.
Re: P9625-9643 and bad states- some observations
Posted: Mon Nov 02, 2015 7:17 pm
by toTOW
So maybe Linux core is able to push something harder than Windows one ?
Or we're completely mistaken, and the root cause still has to be found ...
Re: P9625-9643 and bad states- some observations
Posted: Mon Nov 02, 2015 8:30 pm
by bigblock990
Im not sure what the problem is. I don't have any issues with core18, and core21 9704/9712 work just fine.
I only have problems with core21 9205/06 and 9625-9643.
I have tried returning cards to stock (factory OC), and also to reference stock clocks and still have bad state errors and failed units. Now I have tried low mem clocks which also didn't help. I would say the problem lies within the fahcore, except that grandpa_01 claims his kepler cards work fine, and its only maxwell that has issues.
Re: P9625-9643 and bad states- some observations
Posted: Tue Nov 03, 2015 12:02 am
by jimerickson
9205 & 9206 are problematic on titan X. very few run properly. between slowdowns and bad states and hanging at 99.99% few if any run to completion without intervention. i have to babysit my titan X's and reboot several times a day. its getting old. this is on linux with 355.11 driver.
Re: P9625-9643 and bad states- some observations
Posted: Tue Nov 03, 2015 8:50 pm
by bigblock990
Completed 9205 today with ZERO bad state errors, first time thats ever happened for me. EVGA 980 classy KPE, which I just got folding yesterday evening. The KPE gpu's come with samsung memory, whereas pretty much all others come with either hynix or elpida. I have two of them folding, I will keep an eye on them over the next couple days to see if other core21 projects complete without issues.
Re: P9625-9643 and bad states- some observations
Posted: Wed Nov 04, 2015 7:57 am
by mattifolder
jimerickson wrote:this is on linux with 355.11 driver.
With my experience (GTX 970) the linux driver version 346.96 for folding is the fastest and most stable version (Linux mint mate 17.2). With reduced OC the 96[234]x-er projects also produce bad states, but not all are faulty.