It seems that a lot of GPU problems revolve around specific versions of drivers. Though NVidia has their own support structure, you can often learn from information reported by others who fold.
snapshot wrote:If someone can point me to instructions on how to hack the relevant files then I might give it a try.
I did not try this myself but if you want to experiment: http://null-bin.blogspot.de/2015/08/how ... river.html
But don't blame me if it does not work or Windows refuses to boot anymore.
It's a shame that NV has not fixed their drivers yet. Apparently the 1050 and the 1050 Ti are the only two GPUs which cannot revert to older drivers without some kind of a hack. If you do try foldy's hack, let us know how it went.
So this is why the 750 Ti I have throws all sorts of crazy Event Log messages?
\Device\Video4
Graphics SM Global Exception on (GPC 0, TPC 4): Physical Multiple Warp Errors
\Device\Video4
Graphics SM Warp Exception on (GPC 0, TPC 4): Out Of Range Address
\Device\Video4
Graphics Exception: ESR 0x505e48=0x11000e 0x505e50=0x4 0x505e44=0xd3eff2 0x505e4c=0x7f
I recognize what warps are and how NVIDIA uses them to move data through the pipeline, pretty insane if NVIDIA still has not fixed and its been going on for over six weeks...
Huh odd. Yes this first started occurring after I installed the 376.19 drivers. My Titan Black blew itself up so I moved the 750 Ti into this system. In the original system the 750 Ti was using the 372 drivers.
Dunno if you want the reply in its own thread or not. Here's a partial of the log, I found the system had stopped GPU folding because of the number of failed units, so currently I just removed the GPU folding slot entirely. The event log has errors for every single failed WU I believe, always one of the listed three causes.
Yes, the message exception: Error downloading array interactionCount: clEnqueueReadBuffer (-5) has only been found with the 375.xx/376.xx drivers. You can restart GPU folding if you reinstall an older driver version.
Thanks for the help Bruce. I made a regression post in the 376.19 thread.
Quick aside, do any flags give preference for the AVX WUs or are those luck of the draw? Trying to keep my PPD above the EVGA monthly minimum with the Titan out of commission
Last edited by Kougar on Sat Dec 10, 2016 10:38 pm, edited 1 time in total.
The explanation makes perfect sense and if it IS a bug that's been surfaced by the driver code, it's in ALL of our best interest for OpenMM to dig into it to fix it, because that's a possible vulnerability that can be leveraged by not so nice coders to make our hardware do not so nice things . . .
Kougar wrote:... do any flags give preference for the AVX WUs or are those luck of the draw? Trying to keep my PPD above the EVGA monthly minimum with the Titan out of commission
Off Topic, but see viewtopic.php?f=105&t=29273&p=291129#p291129
The explanation makes perfect sense and if it IS a bug that's been surfaced by the driver code, it's in ALL of our best interest for OpenMM to dig into it to fix it, because that's a possible vulnerability that can be leveraged by not so nice coders to make our hardware do not so nice things . . .
True, but it's strange that it appeared only in the 375-6 series of drivers.
Yes, a bug has been found in OpenMM, but fixing it doesn't resolve the issue. In other words (contrary to external appearances) SOMETHING is happening.
At this point they're continuing to coordinate and when there is any concrete information about a fix, we'll hear about it whether it's in OpenMM or in driver 375-6 or both.
bruce wrote:
True, but it's strange that it appeared only in the 375-6 series of drivers.
Nah, not strange at all. The way I parse the message from NVidia is "We did some optimization of our code based on expected results and pushed it out in that version that started breaking things. Our 'optimization' brought to light unexpected behavior in the underlying support code of the OpenMM infrastructure."
Basically, NVidia is owning up to making a change in how the CUDA code interfaces with the OpenMM software, but it is causing entirely unexpected results to be returned and are under the belief that there's a synchronization error in the underlying code.
I'm glad to know that there is cooperation between the two teams, but now my curious nerd side has been poked with a sharp stick and wants to know the nitty gritty details, once it all gets hammered out.
I hope the OpenMM team will see it Nvidia's way. I would hate for Nvidia to have to branch or suppress some of their future Game Ready optimizations just to keep a bunch of non-gaming do-gooders like us happy! On the other hand, if they can control it at the app profile level at compile time, perhaps that isn't even an issue. Of course, this ultimately will mean that core_21 will need to be rebuilt if OpenMM implements a fix.
snapshot wrote:If someone can point me to instructions on how to hack the relevant files then I might give it a try.
I did not try this myself but if you want to experiment: http://null-bin.blogspot.de/2015/08/how ... river.html
But don't blame me if it does not work or Windows refuses to boot anymore.
Thanks for that. It's a bit laptop-specific but I think there are enough clues in there to at least have a look. I'll be using a test PC and taking a fresh Acronis TIH image first....