Page 1 of 1

Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 12:44 pm
by csvanefalk
I fold on a GTX770, using the 319.76 Linux drivers on a Fedora 20 box.

While I was previously rebooting every 36 hours to avoid the TDR bug, I have noticed that pausing the folding seems to have the same effect. Letting the card rest for 5-10 minutes between each WU, I am now approaching 72 hours of folding without rebooting.

Can anyone confirm if this is expected behavior?

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 1:14 pm
by 7im
The TDR bug was simply time related. Didn't matter if you were folding or gaming or not. So pausing would have no affect.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 1:41 pm
by csvanefalk
Understood, could it have something to do with my OS then? I am far past the 36-hour cutoff, and there have been no broken WU:s, no crashes, or any other symptoms of the bug at all.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 2:37 pm
by ChristianVirtual
I expirienced the TDR bug mainly on GTX 780 at that time; with GK110 chipset (also Titan and 780Ti). The 770 has GK104.

With newer driver the TDR got fixed; but GK104 based card got slower (like my 660TI). I split my GPU in different system and gave each a matching driver. TDR not seen for 9 month or so.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 5:16 pm
by 7im
csvanefalk wrote:Understood, could it have something to do with my OS then? I am far past the 36-hour cutoff, and there have been no broken WU:s, no crashes, or any other symptoms of the bug at all.
2 options. Pre-TDR bug driver version. Or the GPU did a reset. Check the FAH logs to see if there are any folding interruptions in the last 2 days other than your pausing the client.

Optionally, there a v55 fahcore that has no folding slow down, so you could upgrade past the TDR bug driver version, and just use the latest NV driver.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 5:26 pm
by bollix47
7im wrote:Optionally, there a v55 fahcore that has no folding slow down, so you could upgrade past the TDR bug driver version, and just use the latest NV driver.
AFAIK that version of the core is Windows only at this time.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 7:13 pm
by 7im
bollix47 wrote:
7im wrote:Optionally, there a v55 fahcore that has no folding slow down, so you could upgrade past the TDR bug driver version, and just use the latest NV driver.
AFAIK that version of the core is Windows only at this time.
Yep. Time to poke Prot again.

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 8:48 pm
by heikosch
7im wrote:
bollix47 wrote:
7im wrote:Optionally, there a v55 fahcore that has no folding slow down, so you could upgrade past the TDR bug driver version, and just use the latest NV driver.
AFAIK that version of the core is Windows only at this time.
Yep. Time to poke Prot again.
In my opinion v55 is still beta.

Heiko

Re: Pausing mitigates TDR bug?

Posted: Wed Jul 16, 2014 9:47 pm
by 7im
Operationally, yes (simply because no one has moved it to public yet).

Is there some functional reason you think they should not release it as public?

Re: Pausing mitigates TDR bug?

Posted: Thu Jul 17, 2014 7:54 pm
by heikosch
7im wrote:Operationally, yes (simply because no one has moved it to public yet).

Is there some functional reason you think they should not release it as public?
No but I´ve no idea who decides about the public release of a fahcore and why it takes so long to release an obviously working fahcore version.

Heiko

Re: Pausing mitigates TDR bug?

Posted: Fri Jul 18, 2014 5:46 am
by csvanefalk
7im - I can't identify with either of the cases you mentioned. The driver version is 319.76, and I have had the TDR issue with it earlier:

Code: Select all

[christopher@chrisdesktop ~]$ nvidia-smi 
Fri Jul 18 07:44:01 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.76   Driver Version: 319.76         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 770     Off  | 0000:03:00.0     N/A |                  N/A |
| 50%   66C  N/A     N/A /  N/A |      688MB /  2047MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+
I also cannot find any evidence in the log of the GPU resetting, apart from me pausing it (too large to post here):

http://hastebin.com/zomevafedu.coffee

Re: Pausing mitigates TDR bug?

Posted: Fri Jul 18, 2014 2:07 pm
by 7im
The bug, as reported in the NV forum, was time based. You are welcome to look it up.

Also keep trying your pause trick. Does it work consistently, or just this once on a while? Let us know.

Re: Pausing mitigates TDR bug?

Posted: Sun Jul 20, 2014 9:17 am
by csvanefalk
I have not used the pause trick for at least 48 hours, and the folding process continues without error. There appear to be no traces of the bug at all. I wish I could determine exactly how I got to this stage for the benefit of other Linux GPU folders, but the only major change I can recall doing was to recompile the driver after updating to kernel 3.15.

Complete log is here: http://hastebin.com/bahegomewu.coffee