P9625-9643 and bad states- some observations
Moderators: Site Moderators, FAHC Science Team
- 
				billford
- Posts: 1003
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
 2x NVidia GTX 980
 1x NVidia GTX 780 Ti
 2x 3GHz Core i5 PC (Linux)
 Retired:
 3.2GHz Core i5 PC (Linux)
 3.2GHz Core i5 iMac
 2.8GHz Core i5 iMac
 2.16GHz Core 2 Duo iMac
 2GHz Core 2 Duo MacBook
 1.6GHz Core 2 Duo Acer laptop
- Location: Near Oxford, United Kingdom
- Contact:
P9625-9643 and bad states- some observations
I haven't had enough of these for this to be conclusive, but it has happened several times and I haven't seen it with other projects. I put it forward as perhaps suggesting something that a developer may recognise. Or not, as the case may be.
It relates to my GTX 980's, NVidia 355.11 drivers (Linux), factory overclocked by 90MHz, another 125MHz overclock added by me, set to "Prefer Maximum performance". (This last seemed to improve matters a little).
First, only WUs that run the GPU close to maximum power (~190W wrt 196W) seem to cause issues. Those (most) that run at a more usual 160-180W seem OK.
If the WU is going to have issues, the first bad state usually occurs at around 35-40 frames, it backs up to the last checkpoint and continues, throws another bad state after another 35-40 frames, backs up again. At this point it will either complete the WU (occasionally with a 3rd bad state during the sanity check after 100%) or throw the third, fatal, bad state around 95%.
In other words, it doesn't seem to be random. It's as though there's a small, but cumulative, error somewhere which eventually forces it "out of spec", and is reset when a checkpoint file is loaded. (Which could imply that it's not an inherent part of the WU data but related to the fahcore or drivers) Whether it reaches three bad states or not depends how fast it accumulates. (Pure speculation on my part)
This could be supported by the observation that, if one of these WUs does fail due to too many bad states, the next WU (whatever project it's from) is quite likely to throw a bad state within seconds of starting, back up to the previous non-existent checkpoint (ie start again) and then complete without further issue.
Note- I have no intention of resetting the clock to factory values; with the current spread of projects I'm getting the loss of overall PPD (hence science done) that this would entail is about three times the loss due to binning the occasional WU.
			
			
									
						
							It relates to my GTX 980's, NVidia 355.11 drivers (Linux), factory overclocked by 90MHz, another 125MHz overclock added by me, set to "Prefer Maximum performance". (This last seemed to improve matters a little).
First, only WUs that run the GPU close to maximum power (~190W wrt 196W) seem to cause issues. Those (most) that run at a more usual 160-180W seem OK.
If the WU is going to have issues, the first bad state usually occurs at around 35-40 frames, it backs up to the last checkpoint and continues, throws another bad state after another 35-40 frames, backs up again. At this point it will either complete the WU (occasionally with a 3rd bad state during the sanity check after 100%) or throw the third, fatal, bad state around 95%.
In other words, it doesn't seem to be random. It's as though there's a small, but cumulative, error somewhere which eventually forces it "out of spec", and is reset when a checkpoint file is loaded. (Which could imply that it's not an inherent part of the WU data but related to the fahcore or drivers) Whether it reaches three bad states or not depends how fast it accumulates. (Pure speculation on my part)
This could be supported by the observation that, if one of these WUs does fail due to too many bad states, the next WU (whatever project it's from) is quite likely to throw a bad state within seconds of starting, back up to the previous non-existent checkpoint (ie start again) and then complete without further issue.
Note- I have no intention of resetting the clock to factory values; with the current spread of projects I'm getting the loss of overall PPD (hence science done) that this would entail is about three times the loss due to binning the occasional WU.
- 
				Grandpa_01
- Posts: 1122
- Joined: Wed Mar 04, 2009 7:36 am
- Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
 2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
 1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
 1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M
Re: P9625-9643 and bad states- some observations
I can tell you how to fix this but you probably are not going to be real happy with the solution.
This is not a core21 problem it is a Nvidia problem that affects the upper end Maxwells mainly GTX 980's and some of the 970's, It may also affect 980 TI's but I do not know that since I do not have a 980 TI to test. I have a feeling Nvidia knows about it since they took steps to reduce the problem. When you run compute software on the Maxwells it defaults to P2 state which has the same core clock but has a reduced memory speed it gets lowered from 7000Mhz to 6000Mhz no other generation of Nvidia GPU's does this that I know of.
Anyway I started having a lot of the bad states on my GPU's both in Windows and Linux with this series of WU's and some others and I started testing things out with the help of a few others from here in the forum. And one of the thing's I did was lower the Memory speed using Nvidia Inspector in P2 state a little. The first adjustment was 300Mhz and the errors stopped completely and I was able to achieve a higher stable OC. I ended up between 5750Mhz and 5755Mhz on all 3 of the cards.
The part you are not going to like is that there is no way at this time to lower the P2 state memory speeds in Linux. I have put in a request for help over at the Nvidia Developers forum so hopefully they will enable that ability in x server soon. But as of yet nobody has replied to the post but it is Sunday. Until then I would recommend putting the 980 in a Windows box, which I know is slower but in the long run it will pay off, I have moved 3 of mine from Linux to Windows. Or you can yell at Nvidia for selling us a GPU with what I am thinking is either faulty memory or faulty memory controller, most likely the latter.
https://devtalk.nvidia.com/default/topi ... -software/
			
			
									
						
							This is not a core21 problem it is a Nvidia problem that affects the upper end Maxwells mainly GTX 980's and some of the 970's, It may also affect 980 TI's but I do not know that since I do not have a 980 TI to test. I have a feeling Nvidia knows about it since they took steps to reduce the problem. When you run compute software on the Maxwells it defaults to P2 state which has the same core clock but has a reduced memory speed it gets lowered from 7000Mhz to 6000Mhz no other generation of Nvidia GPU's does this that I know of.
Anyway I started having a lot of the bad states on my GPU's both in Windows and Linux with this series of WU's and some others and I started testing things out with the help of a few others from here in the forum. And one of the thing's I did was lower the Memory speed using Nvidia Inspector in P2 state a little. The first adjustment was 300Mhz and the errors stopped completely and I was able to achieve a higher stable OC. I ended up between 5750Mhz and 5755Mhz on all 3 of the cards.
The part you are not going to like is that there is no way at this time to lower the P2 state memory speeds in Linux. I have put in a request for help over at the Nvidia Developers forum so hopefully they will enable that ability in x server soon. But as of yet nobody has replied to the post but it is Sunday. Until then I would recommend putting the 980 in a Windows box, which I know is slower but in the long run it will pay off, I have moved 3 of mine from Linux to Windows. Or you can yell at Nvidia for selling us a GPU with what I am thinking is either faulty memory or faulty memory controller, most likely the latter.
https://devtalk.nvidia.com/default/topi ... -software/
2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
- 
				billford
- Posts: 1003
- Joined: Thu May 02, 2013 8:46 pm
- Hardware configuration: Full Time:
 2x NVidia GTX 980
 1x NVidia GTX 780 Ti
 2x 3GHz Core i5 PC (Linux)
 Retired:
 3.2GHz Core i5 PC (Linux)
 3.2GHz Core i5 iMac
 2.8GHz Core i5 iMac
 2.16GHz Core 2 Duo iMac
 2GHz Core 2 Duo MacBook
 1.6GHz Core 2 Duo Acer laptop
- Location: Near Oxford, United Kingdom
- Contact:
Re: P9625-9643 and bad states- some observations
Thanks for that, very interesting.
Must admit I was under the impression that the memory clock could be adjusted just like the graphics clock, but closer examination (plus experiment) shows you are perfectly correct 
 
I'm reluctant to run the cards in a Windows box- apart from any other reason I haven't got any spare Windows licences!
(edit- I assume that expecting Nvidia Inspector to work under WINE would be a touch optimistic? Even if I plucked up the courage to try it...)
For the moment I'll leave it as it is, as I indicated I'm only losing an occasional WU. If that changes then I'll have to re-think.
FWIW I've added a reply to your post on the NVidia forum, but from their response to the Maxwell bug some time ago (ie there wasn't any) I'm not going to hold my breath.
Further edit- if NVidia show no signs of fixing it, a workaround might be the suggestion elsewhere to increase the tolerance of the Core_21 code to bad state errors. Maybe in v0.0.13.
			
			
									
						
							Must admit I was under the impression that the memory clock could be adjusted just like the graphics clock, but closer examination (plus experiment) shows you are perfectly correct
 
 I'm reluctant to run the cards in a Windows box- apart from any other reason I haven't got any spare Windows licences!
(edit- I assume that expecting Nvidia Inspector to work under WINE would be a touch optimistic? Even if I plucked up the courage to try it...)
For the moment I'll leave it as it is, as I indicated I'm only losing an occasional WU. If that changes then I'll have to re-think.
FWIW I've added a reply to your post on the NVidia forum, but from their response to the Maxwell bug some time ago (ie there wasn't any) I'm not going to hold my breath.
Further edit- if NVidia show no signs of fixing it, a workaround might be the suggestion elsewhere to increase the tolerance of the Core_21 code to bad state errors. Maybe in v0.0.13.
- 
				bigblock990
- Posts: 20
- Joined: Wed Sep 09, 2015 12:42 pm
Re: P9625-9643 and bad states- some observations
Grandpa_01 wrote:I can tell you how to fix this but you probably are not going to be real happy with the solution.
This is not a core21 problem it is a Nvidia problem that affects the upper end Maxwells mainly GTX 980's and some of the 970's, It may also affect 980 TI's but I do not know that since I do not have a 980 TI to test. I have a feeling Nvidia knows about it since they took steps to reduce the problem. When you run compute software on the Maxwells it defaults to P2 state which has the same core clock but has a reduced memory speed it gets lowered from 7000Mhz to 6000Mhz no other generation of Nvidia GPU's does this that I know of.
Anyway I started having a lot of the bad states on my GPU's both in Windows and Linux with this series of WU's and some others and I started testing things out with the help of a few others from here in the forum. And one of the thing's I did was lower the Memory speed using Nvidia Inspector in P2 state a little. The first adjustment was 300Mhz and the errors stopped completely and I was able to achieve a higher stable OC. I ended up between 5750Mhz and 5755Mhz on all 3 of the cards.
The part you are not going to like is that there is no way at this time to lower the P2 state memory speeds in Linux. I have put in a request for help over at the Nvidia Developers forum so hopefully they will enable that ability in x server soon. But as of yet nobody has replied to the post but it is Sunday. Until then I would recommend putting the 980 in a Windows box, which I know is slower but in the long run it will pay off, I have moved 3 of mine from Linux to Windows. Or you can yell at Nvidia for selling us a GPU with what I am thinking is either faulty memory or faulty memory controller, most likely the latter.
https://devtalk.nvidia.com/default/topi ... -software/
You really might be on to something here. I have titan x, 980ti, and two 970's. I have modded the 970 bios to run at default 7010mhz for p2. The titan x and 980 ti both run at 6608mhz p2. The titan x is the worse offender, with 980ti next, then 970's when it comes to bad state errors for core21 units. More memory on titan x means memory controller works harder? I will edit my bios down to 5750mhz for the 970's and if that eliminates the errors, I will make a mod bios for the other two.
Any idea why core21 causes problems, but core18 runs perfectly fine?
- 
				Grandpa_01
- Posts: 1122
- Joined: Wed Mar 04, 2009 7:36 am
- Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
 2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
 1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
 1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M
Re: P9625-9643 and bad states- some observations
Just a wild guess but I would say core 21 is more efficient and pushes the memory / memory controller a little harder than previous cores. But that really should not be a problem if Nvidias hardware was up to Industry standards, No other generation of Nvidia graphics have a problem with core 21 WU's. In addition to my 980's and 970 I am running 1 - GTX 770, 2 - GTX  680's and a GTX 580 all of which do not have any problem with core 21 projects on Windows or Linux.
			
			
									
						
							2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
- 
				mattifolder
Re: P9625-9643 and bad states- some observations
It seems to be a way to dynamically adjust gpu memory clock under linux for folding. It works for me with nvidia driver 346.96. I think, it's more a bug than a feature, but it is reproducible.
Short description for my setup with nvidia gtx970, all commands have to execute under grafic desktop in terminal with root user (or sudo).
This is initially situation:
Next steps to prepare for changing memory clock for folding are:
- get actual application clocks with nvidia-smi
- get supported application clocks with nvidia-smi
- select one combination from supported application clocks near actual clocks (without gpu-boost)
(I selected Memory 3004 (from P2 State) and Graphics 1109)
- change actual application clocks (it's secondary for our intention) and incidentally switch to full grafic performance state
(in nvidia-smi it names P0, in nvidia-settings P3)
That are the actually settings:
The graphics clock is the same as before, memory clock and performance state are changed to full graphics values.
Because performance state now changed to full graphics (3 / P0) after this steps the change of memory clocks is possible on the well known way by nvidia-settings.
Resetting application clocks and performance level is done by "nvidia-smi -rac" and all values are as before.
EDIT:
There is one marginal problem: the minimum vram clock offset at P3 (P0) for "nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[3]=..." because of higher base vram clock of P3 results in the same base clock as P2. So nothing is won. I'll have a lock to other performance options of nvidia-settings and nvidia-smi, may be another solution. The Windows tools may increase / degrease the limits, why not under linux ?
 So nothing is won. I'll have a lock to other performance options of nvidia-settings and nvidia-smi, may be another solution. The Windows tools may increase / degrease the limits, why not under linux ?
			
			
									
						
										
						Short description for my setup with nvidia gtx970, all commands have to execute under grafic desktop in terminal with root user (or sudo).
This is initially situation:
Code: Select all
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
  Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 2.
    'GPUCurrentPerfLevel' is an integer attribute.
    'GPUCurrentPerfLevel' is a read-only attribute.
    'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
  Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3004.
    'GPUCurrentClockFreqs' is a packed integer attribute.
    'GPUCurrentClockFreqs' is a read-only attribute.
    'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=pstate --format=csv
pstate
P2
- get actual application clocks with nvidia-smi
Code: Select all
nvidia-smi --query-gpu=clocks.applications.graphics --format=csv
clocks.applications.graphics [MHz]
1113 MHz
nvidia-smi --query-gpu=clocks.applications.memory --format=csv
clocks.applications.memory [MHz]
3505 MHz
Code: Select all
nvidia-smi -q -d SUPPORTED_CLOCKS
==============NVSMI LOG==============
Timestamp                           : Thu Oct 29 23:12:38 2015
Driver Version                      : 346.96
Attached GPUs                       : 1
GPU 0000:01:00.0
    Supported Clocks
        Memory                      : 3505 MHz
            Graphics                : 1641 MHz
            Graphics                : 1628 MHz
            Graphics                : 1616 MHz
            Graphics                : 1603 MHz
            Graphics                : 1590 MHz
            Graphics                : 1578 MHz
            Graphics                : 1565 MHz
            Graphics                : 1552 MHz
            Graphics                : 1540 MHz
            Graphics                : 1527 MHz
            Graphics                : 1514 MHz
            Graphics                : 1502 MHz
            Graphics                : 1489 MHz
            Graphics                : 1476 MHz
            Graphics                : 1464 MHz
            Graphics                : 1451 MHz
            Graphics                : 1438 MHz
            Graphics                : 1426 MHz
            Graphics                : 1413 MHz
            Graphics                : 1400 MHz
            Graphics                : 1388 MHz
            Graphics                : 1375 MHz
            Graphics                : 1362 MHz
            Graphics                : 1350 MHz
            Graphics                : 1337 MHz
            Graphics                : 1324 MHz
            Graphics                : 1312 MHz
            Graphics                : 1299 MHz
            Graphics                : 1286 MHz
            Graphics                : 1274 MHz
            Graphics                : 1261 MHz
            Graphics                : 1249 MHz
            Graphics                : 1236 MHz
            Graphics                : 1223 MHz
            Graphics                : 1211 MHz
            Graphics                : 1198 MHz
            Graphics                : 1185 MHz
            Graphics                : 1173 MHz
            Graphics                : 1160 MHz
            Graphics                : 1147 MHz
            Graphics                : 1135 MHz
            Graphics                : 1122 MHz
            Graphics                : 1109 MHz
            Graphics                : 1097 MHz
            Graphics                : 1085 MHz
            Graphics                : 1072 MHz
            Graphics                : 1071 MHz
            Graphics                : 1059 MHz
            Graphics                : 1046 MHz
            Graphics                : 1033 MHz
            Graphics                : 1021 MHz
            Graphics                : 1008 MHz
            Graphics                : 995 MHz
            Graphics                : 983 MHz
            Graphics                : 970 MHz
            Graphics                : 957 MHz
            Graphics                : 945 MHz
            Graphics                : 932 MHz
            Graphics                : 919 MHz
            Graphics                : 907 MHz
            Graphics                : 894 MHz
            Graphics                : 881 MHz
            Graphics                : 869 MHz
            Graphics                : 856 MHz
            Graphics                : 844 MHz
            Graphics                : 831 MHz
            Graphics                : 818 MHz
            Graphics                : 806 MHz
            Graphics                : 793 MHz
            Graphics                : 780 MHz
            Graphics                : 768 MHz
            Graphics                : 755 MHz
            Graphics                : 742 MHz
            Graphics                : 730 MHz
            Graphics                : 717 MHz
            Graphics                : 704 MHz
            Graphics                : 692 MHz
            Graphics                : 680 MHz
            Graphics                : 667 MHz
            Graphics                : 655 MHz
            Graphics                : 642 MHz
            Graphics                : 630 MHz
            Graphics                : 617 MHz
            Graphics                : 605 MHz
            Graphics                : 592 MHz
            Graphics                : 590 MHz
            Graphics                : 509 MHz
            Graphics                : 455 MHz
            Graphics                : 388 MHz
            Graphics                : 349 MHz
            Graphics                : 320 MHz
        Memory                      : 3004 MHz
            Graphics                : 1641 MHz
            Graphics                : 1628 MHz
            Graphics                : 1616 MHz
            Graphics                : 1603 MHz
            Graphics                : 1590 MHz
            Graphics                : 1578 MHz
            Graphics                : 1565 MHz
            Graphics                : 1552 MHz
            Graphics                : 1540 MHz
            Graphics                : 1527 MHz
            Graphics                : 1514 MHz
            Graphics                : 1502 MHz
            Graphics                : 1489 MHz
            Graphics                : 1476 MHz
            Graphics                : 1464 MHz
            Graphics                : 1451 MHz
            Graphics                : 1438 MHz
            Graphics                : 1426 MHz
            Graphics                : 1413 MHz
            Graphics                : 1400 MHz
            Graphics                : 1388 MHz
            Graphics                : 1375 MHz
            Graphics                : 1362 MHz
            Graphics                : 1350 MHz
            Graphics                : 1337 MHz
            Graphics                : 1324 MHz
            Graphics                : 1312 MHz
            Graphics                : 1299 MHz
            Graphics                : 1286 MHz
            Graphics                : 1274 MHz
            Graphics                : 1261 MHz
            Graphics                : 1249 MHz
            Graphics                : 1236 MHz
            Graphics                : 1223 MHz
            Graphics                : 1211 MHz
            Graphics                : 1198 MHz
            Graphics                : 1185 MHz
            Graphics                : 1173 MHz
            Graphics                : 1160 MHz
            Graphics                : 1147 MHz
            Graphics                : 1135 MHz
            Graphics                : 1122 MHz
            Graphics                : 1109 MHz
            Graphics                : 1097 MHz
            Graphics                : 1085 MHz
            Graphics                : 1072 MHz
            Graphics                : 1071 MHz
            Graphics                : 1059 MHz
            Graphics                : 1046 MHz
            Graphics                : 1033 MHz
            Graphics                : 1021 MHz
            Graphics                : 1008 MHz
            Graphics                : 995 MHz
            Graphics                : 983 MHz
            Graphics                : 970 MHz
            Graphics                : 957 MHz
            Graphics                : 945 MHz
            Graphics                : 932 MHz
            Graphics                : 919 MHz
            Graphics                : 907 MHz
            Graphics                : 894 MHz
            Graphics                : 881 MHz
            Graphics                : 869 MHz
            Graphics                : 856 MHz
            Graphics                : 844 MHz
            Graphics                : 831 MHz
            Graphics                : 818 MHz
            Graphics                : 806 MHz
            Graphics                : 793 MHz
            Graphics                : 780 MHz
            Graphics                : 768 MHz
            Graphics                : 755 MHz
            Graphics                : 742 MHz
            Graphics                : 730 MHz
            Graphics                : 717 MHz
            Graphics                : 704 MHz
            Graphics                : 692 MHz
            Graphics                : 680 MHz
            Graphics                : 667 MHz
            Graphics                : 655 MHz
            Graphics                : 642 MHz
            Graphics                : 630 MHz
            Graphics                : 617 MHz
            Graphics                : 605 MHz
            Graphics                : 592 MHz
            Graphics                : 590 MHz
            Graphics                : 509 MHz
            Graphics                : 455 MHz
            Graphics                : 388 MHz
            Graphics                : 349 MHz
            Graphics                : 320 MHz
        Memory                      : 810 MHz
            Graphics                : 1455 MHz
            Graphics                : 1442 MHz
            Graphics                : 1430 MHz
            Graphics                : 1417 MHz
            Graphics                : 1404 MHz
            Graphics                : 1392 MHz
            Graphics                : 1379 MHz
            Graphics                : 1366 MHz
            Graphics                : 1354 MHz
            Graphics                : 1341 MHz
            Graphics                : 1328 MHz
            Graphics                : 1316 MHz
            Graphics                : 1303 MHz
            Graphics                : 1290 MHz
            Graphics                : 1278 MHz
            Graphics                : 1265 MHz
            Graphics                : 1252 MHz
            Graphics                : 1240 MHz
            Graphics                : 1227 MHz
            Graphics                : 1215 MHz
            Graphics                : 1202 MHz
            Graphics                : 1189 MHz
            Graphics                : 1177 MHz
            Graphics                : 1164 MHz
            Graphics                : 1151 MHz
            Graphics                : 1139 MHz
            Graphics                : 1126 MHz
            Graphics                : 1113 MHz
            Graphics                : 1101 MHz
            Graphics                : 1088 MHz
            Graphics                : 1075 MHz
            Graphics                : 1063 MHz
            Graphics                : 1050 MHz
            Graphics                : 1037 MHz
            Graphics                : 1025 MHz
            Graphics                : 1012 MHz
            Graphics                : 999 MHz
            Graphics                : 987 MHz
            Graphics                : 974 MHz
            Graphics                : 961 MHz
            Graphics                : 949 MHz
            Graphics                : 936 MHz
            Graphics                : 923 MHz
            Graphics                : 911 MHz
            Graphics                : 899 MHz
            Graphics                : 886 MHz
            Graphics                : 885 MHz
            Graphics                : 873 MHz
            Graphics                : 860 MHz
            Graphics                : 847 MHz
            Graphics                : 835 MHz
            Graphics                : 822 MHz
            Graphics                : 810 MHz
            Graphics                : 797 MHz
            Graphics                : 784 MHz
            Graphics                : 772 MHz
            Graphics                : 759 MHz
            Graphics                : 746 MHz
            Graphics                : 734 MHz
            Graphics                : 721 MHz
            Graphics                : 708 MHz
            Graphics                : 696 MHz
            Graphics                : 683 MHz
            Graphics                : 670 MHz
            Graphics                : 658 MHz
            Graphics                : 645 MHz
            Graphics                : 632 MHz
            Graphics                : 620 MHz
            Graphics                : 607 MHz
            Graphics                : 594 MHz
            Graphics                : 582 MHz
            Graphics                : 569 MHz
            Graphics                : 556 MHz
            Graphics                : 544 MHz
            Graphics                : 531 MHz
            Graphics                : 519 MHz
            Graphics                : 506 MHz
            Graphics                : 494 MHz
            Graphics                : 481 MHz
            Graphics                : 469 MHz
            Graphics                : 456 MHz
            Graphics                : 444 MHz
            Graphics                : 431 MHz
            Graphics                : 419 MHz
            Graphics                : 406 MHz
            Graphics                : 405 MHz
            Graphics                : 324 MHz
            Graphics                : 270 MHz
            Graphics                : 202 MHz
            Graphics                : 162 MHz
            Graphics                : 135 MHz
        Memory                      : 324 MHz
            Graphics                : 405 MHz
            Graphics                : 324 MHz
            Graphics                : 270 MHz
            Graphics                : 202 MHz
            Graphics                : 162 MHz
            Graphics                : 135 MHz
(I selected Memory 3004 (from P2 State) and Graphics 1109)
- change actual application clocks (it's secondary for our intention) and incidentally switch to full grafic performance state
(in nvidia-smi it names P0, in nvidia-settings P3)
Code: Select all
nvidia-smi -ac 3004,1109
Applications clocks set to "(MEM 3004, SM 1109)" for GPU 0000:01:00.0
Warning: persistence mode is disabled on this device. This settings will go back to default as soon as driver unloads (e.g. last application like nvidia-smi or cuda application terminates). Run with [--help | -h] switch to get more information on how to enable persistence mode.
All done.
Code: Select all
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
  Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 3.
    'GPUCurrentPerfLevel' is an integer attribute.
    'GPUCurrentPerfLevel' is a read-only attribute.
    'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
  Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3505.
    'GPUCurrentClockFreqs' is a packed integer attribute.
    'GPUCurrentClockFreqs' is a read-only attribute.
    'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=pstate --format=csv
pstate
P0
Because performance state now changed to full graphics (3 / P0) after this steps the change of memory clocks is possible on the well known way by nvidia-settings.
Code: Select all
nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[3]=...Code: Select all
nvidia-smi -rac
All done.
nvidia-settings -q [gpu:0]/GPUCurrentPerfLevel -q [gpu:0]/GPUCurrentClockFreqs
  Attribute 'GPUCurrentPerfLevel' (mslinuxmint:0[gpu:0]): 2.
    'GPUCurrentPerfLevel' is an integer attribute.
    'GPUCurrentPerfLevel' is a read-only attribute.
    'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.
  Attribute 'GPUCurrentClockFreqs' (mslinuxmint:0[gpu:0]): 1514,3004.
    'GPUCurrentClockFreqs' is a packed integer attribute.
    'GPUCurrentClockFreqs' is a read-only attribute.
    'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.
nvidia-smi --query-gpu=clocks.applications.graphics --format=csv
clocks.applications.graphics [MHz]
1113 MHz
nvidia-smi --query-gpu=clocks.applications.memory --format=csv
clocks.applications.memory [MHz]
3505 MHz
nvidia-smi --query-gpu=pstate --format=csv
pstate
P2
There is one marginal problem: the minimum vram clock offset at P3 (P0) for "nvidia-settings --assign [gpu:0]/GPUMemoryTransferRateOffset[3]=..." because of higher base vram clock of P3 results in the same base clock as P2.
 So nothing is won. I'll have a lock to other performance options of nvidia-settings and nvidia-smi, may be another solution. The Windows tools may increase / degrease the limits, why not under linux ?
 So nothing is won. I'll have a lock to other performance options of nvidia-settings and nvidia-smi, may be another solution. The Windows tools may increase / degrease the limits, why not under linux ?- 
				Grandpa_01
- Posts: 1122
- Joined: Wed Mar 04, 2009 7:36 am
- Hardware configuration: 3 - Supermicro H8QGi-F AMD MC 6174=144 cores 2.5Ghz, 96GB G.Skill DDR3 1333Mhz Ubuntu 10.10
 2 - Asus P6X58D-E i7 980X 4.4Ghz 6GB DDR3 2000 A-Data 64GB SSD Ubuntu 10.10
 1 - Asus Rampage Gene III 17 970 4.3Ghz DDR3 2000 2-500GB Segate 7200.11 0-Raid Ubuntu 10.10
 1 - Asus G73JH Laptop i7 740QM 1.86Ghz ATI 5870M
Re: P9625-9643 and bad states- some observations
Yeah we need NVidia to add some more memoryclock speed options in their drivers it would be a simple fix if they would.
			
			
									
						
							2 - SM H8QGi-F AMD 6xxx=112 cores @ 3.2 & 3.9Ghz
5 - SM X9QRI-f+ Intel 4650 = 320 cores @ 3.15Ghz
2 - I7 980X 4.4Ghz 2-GTX680
1 - 2700k 4.4Ghz GTX680
Total = 464 cores folding
- 
				bigblock990
- Posts: 20
- Joined: Wed Sep 09, 2015 12:42 pm
Re: P9625-9643 and bad states- some observations
Changed my custom bios from 7010mhz to 5500mhz mem clocks for my 970's. The frequency of bad state errors, and failed WU's remained the same for openmm_21 projects. Also dropped PPD by ~50k on those projects. Core18 still work great with same PPD.
So dropping mem clocks may help in windows, but it doesn't in Linux. I'll be going back to 7010mhz
			
			
									
						
										
						So dropping mem clocks may help in windows, but it doesn't in Linux. I'll be going back to 7010mhz
- 
				toTOW
- Site Moderator
- Posts: 6497
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: P9625-9643 and bad states- some observations
Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
			
			
									
						
										
						- 
				bigblock990
- Posts: 20
- Joined: Wed Sep 09, 2015 12:42 pm
Re: P9625-9643 and bad states- some observations
Yes, I created a modified bios so that p2 runs at 7010mhz same as p0. I then modified that bios so that p2 runs at 5500mhz to test Grandpa_01 theory in linux.toTOW wrote:Did you change clock for the right P state ? 7010 MHz (1752 MHz real) is the clock for P0, but most cards are folding in P2 state, which default clock is 6000 MHz (1500 MHz real).
- 
				toTOW
- Site Moderator
- Posts: 6497
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: P9625-9643 and bad states- some observations
So maybe Linux core is able to push something harder than Windows one ?
Or we're completely mistaken, and the root cause still has to be found ...
			
			
									
						
										
						Or we're completely mistaken, and the root cause still has to be found ...

- 
				bigblock990
- Posts: 20
- Joined: Wed Sep 09, 2015 12:42 pm
Re: P9625-9643 and bad states- some observations
Im not sure what the problem is. I don't have any issues with core18, and core21 9704/9712 work just fine. 
I only have problems with core21 9205/06 and 9625-9643.
I have tried returning cards to stock (factory OC), and also to reference stock clocks and still have bad state errors and failed units. Now I have tried low mem clocks which also didn't help. I would say the problem lies within the fahcore, except that grandpa_01 claims his kepler cards work fine, and its only maxwell that has issues.
			
			
									
						
										
						I only have problems with core21 9205/06 and 9625-9643.
I have tried returning cards to stock (factory OC), and also to reference stock clocks and still have bad state errors and failed units. Now I have tried low mem clocks which also didn't help. I would say the problem lies within the fahcore, except that grandpa_01 claims his kepler cards work fine, and its only maxwell that has issues.
- 
				jimerickson
- Posts: 533
- Joined: Tue May 27, 2008 11:56 pm
- Hardware configuration: Parts:
 Asus H370 Mining Master motherboard (X2)
 Patriot Viper DDR4 memory 16gb stick (X4)
 Nvidia GeForce GTX 1080 gpu (X16)
 Intel Core i7 8700 cpu (X2)
 Silverstone 1000 watt psu (X4)
 Veddha 8 gpu miner case (X2)
 Thermaltake hsf (X2)
 Ubit riser card (X16)
- Location: ames, iowa
Re: P9625-9643 and bad states- some observations
9205 & 9206 are problematic on titan X. very few run properly. between slowdowns and bad states and hanging at 99.99% few if any run to completion without intervention. i have to babysit my titan X's and reboot several times a day. its getting old. this is on linux with 355.11 driver.
			
			
									
						
										
						- 
				bigblock990
- Posts: 20
- Joined: Wed Sep 09, 2015 12:42 pm
Re: P9625-9643 and bad states- some observations
Completed 9205 today with ZERO bad state errors, first time thats ever happened for me. EVGA 980 classy KPE, which I just got folding yesterday evening. The KPE gpu's come with samsung memory, whereas pretty much all others come with either hynix or elpida. I have two of them folding, I will keep an eye on them over the next couple days to see if other core21 projects complete without issues.
			
			
									
						
										
						- 
				mattifolder
Re: P9625-9643 and bad states- some observations
With my experience (GTX 970) the linux driver version 346.96 for folding is the fastest and most stable version (Linux mint mate 17.2). With reduced OC the 96[234]x-er projects also produce bad states, but not all are faulty.jimerickson wrote:this is on linux with 355.11 driver.