Page 9 of 10

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:00 pm
by MtM
Lol @ car analogy ;)
Technically, if the same code work on certain cards but not on others, we can look at the driver or hardware level. However, the core is partly to be responsible of this as well so it's a two-side work to find out what wrong (NVIDIA with the CUDA code and PG with the core). This is what make debugging of this issue very hard.
In short, you don't know either?

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:07 pm
by Xilikon
Nope, I'm a programmer by trade myself and I'm usually good at narrowing the possible causes but the current issue is so hard to narrow. I tried to find trends but it's so varied that I'm really stumped but the only thing I'm sure is that the 8600 series seems to have more problems than othersn and the GTX 2xx series have zero problems. Beside this, it's hit or miss between these 2 series.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:15 pm
by MtM
Xilikon wrote:Nope, I'm a programmer by trade myself and I'm usually good at narrowing the possible causes but the current issue is so hard to narrow. I tried to find trends but it's so varied that I'm really stumped but the only thing I'm sure is that the 8600 series seems to have more problems than othersn and the GTX 2xx series have zero problems. Beside this, it's hit or miss between these 2 series.
What struck me, was that with the 1.15 core, the efficiency was greatly enhanced but also the eue rate was over the top and I speculated about it being a local cache problem, seeing how cards with slower ram seem to be affected more ( thinking, cache on the simd units isn't filled fast enough, so next instruction treis to fetch data which isn't there yet ).

I'm a hobbyist programmer, certainly not into low level languages so I'm using allot of wet finger work on that assumption.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:23 pm
by Xilikon
Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 1:27 pm
by shatteredsilicon
VijayPande wrote:
shatteredsilicon wrote: Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.
1.15 passed all of the regression testing on machines at Stanford and NVIDIA and then passed FAH beta testing. There's not much more we can do than that before releasing it. Keep in mind that we now know that for many people (some boards), 1.15 is perfectly fine and stable, whereas for others, it doesn't work at all. If that's the case, my guess is that this is a CUDA or hardware issue. If the code in 1.15 were really broken, it would not work on any hardware, which is definitely not the case. We're working with NVIDIA on this one. The first step is to get the problem reproducible in their labs.

The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
Which hardware does it work on perfectly? I don't remember any one particular model not listed as experiencing problems. Pretty much the entire G8x/G9x line-up appear to have been listed by users as affected at stock clock speeds, which indicates a more systematic than random failure. I'm quite curious to know which boards you were originally doing the testing with, if you are saying the error is not reproducible on them.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:30 pm
by shatteredsilicon
Xilikon wrote:Nope, I'm a programmer by trade myself and I'm usually good at narrowing the possible causes but the current issue is so hard to narrow. I tried to find trends but it's so varied that I'm really stumped but the only thing I'm sure is that the 8600 series seems to have more problems than othersn and the GTX 2xx series have zero problems. Beside this, it's hit or miss between these 2 series.
This is the part where one has to ask - how big is the diff between 1.09 and 1.15. Has the compiler been changed/updated between the two? There is only going to be so much code in there that could be causing the problem to manifest itself.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:36 pm
by Xilikon
shatteredsilicon wrote:
Xilikon wrote:Nope, I'm a programmer by trade myself and I'm usually good at narrowing the possible causes but the current issue is so hard to narrow. I tried to find trends but it's so varied that I'm really stumped but the only thing I'm sure is that the 8600 series seems to have more problems than othersn and the GTX 2xx series have zero problems. Beside this, it's hit or miss between these 2 series.
This is the part where one has to ask - how big is the diff between 1.09 and 1.15. Has the compiler been changed/updated between the two? There is only going to be so much code in there that could be causing the problem to manifest itself.
That's a very very good question...

The reason the PG is so eager to kick 1.09 in the curb is because they are putting lots of time on the new Lambda units, which is where real science is done. Those units unfortunately fail a lot with 1.09 so they are struck working on smaller and less useful units. However, the newer core break a lot of small units so I'm sure there is something caused by the compiler or some code change.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 2:14 pm
by MtM
Xilikon wrote:Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.
Could still be memory related, speed is only once aspect, latency is another. Been waiting to hear something about that. had a discusion about the 9600gso's, you got a 512mb version with gddr2 which only get's <4K ppd while the other variants all get over 5k easy. Same number of stream processors, and when clocked the same still that much diffrence while everyone always said memory does not matter for folding.

Wet finger work I know, don't read to much into it.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 3:12 pm
by theo343
VijayPande wrote:5801 was just a copy of another project, which did go all the way through QA. Nevertheless, I will have a talk with the responsible parties about this.
Thanks again :)

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 3:55 pm
by VijayPande
shatteredsilicon wrote: Which hardware does it work on perfectly? I don't remember any one particular model not listed as experiencing problems. Pretty much the entire G8x/G9x line-up appear to have been listed by users as affected at stock clock speeds, which indicates a more systematic than random failure. I'm quite curious to know which boards you were originally doing the testing with, if you are saying the error is not reproducible on them.
GTX260 and GTX280 seem to be running fine. Also, it looks like certain boards in the previous generations do work for certain people. I can't tell what's the difference between these boards and whether this is hardware or drivers. What is clear is that the same CUDA code works perfectly fine (no EUE's, running reliably) on some boards and not at all on others. This is very different than what we'd be dealing with on the CPU side.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 4:18 pm
by powerarmour
Xilikon wrote:Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.
To throw my penny in the ring, out of all the cards I fold with which is 2x 8800GT, 9600GT, 9500GT and a 9800GTX+, it's only the 9800GTX+ that has given any errors, so I think as many other 9800GTX+ users have had the same problems I think it's a good candidate to be a test card.

I have the feeling that if the core can be stable on that and the 8600GT/S, then it'll be stable..!, something a bit odd for sure.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 4:20 pm
by Xilikon
powerarmour wrote:
Xilikon wrote:Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.
To throw my penny in the ring, out of all the cards I fold with which is 2x 8800GT, 9600GT, 9500GT and a 9800GTX+, it's only the 9800GTX+ that has given any errors, so I think as many other 9800GTX+ users have had the same problems I think it's a good candidate to be a test card.

I have the feeling that if the core can be stable on that and the 8600GT/S, then it'll be stable..!, something a bit odd for sure.
Yes, I agree that the PG should get the 8600GT and the 9800GTX+ card in the labs and work to get the core stable with those cards. I bet that when they succeed at this, everything else should work fine.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 4:34 pm
by gaster
MtM wrote:
Xilikon wrote:Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.
Could still be memory related, speed is only once aspect, latency is another. Been waiting to hear something about that. had a discusion about the 9600gso's, you got a 512mb version with gddr2 which only get's <4K ppd while the other variants all get over 5k easy. Same number of stream processors, and when clocked the same still that much diffrence while everyone always said memory does not matter for folding.

Wet finger work I know, don't read to much into it.
The 512mb 9600gso from Asus has 128 bit memory. All other 9600gso cards that I have seen, 384mb or 768mb, have 192 bit memory.
That is a pretty significant difference and probably accounts for the ppd difference. But memory speed variances on cards with 192 bit memory does not make a big difference. They are all at least in the same league.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 4:36 pm
by MtM
gaster wrote:
MtM wrote:
Xilikon wrote:Your guess is as good as mine but I doubt it's really the case since 9800GTX+ also have problems (almost at the same rate as the 8600GT) and they should be using faster memory. Right now, there is no enough debug message to pinpoint to the exact cause.
Could still be memory related, speed is only once aspect, latency is another. Been waiting to hear something about that. had a discusion about the 9600gso's, you got a 512mb version with gddr2 which only get's <4K ppd while the other variants all get over 5k easy. Same number of stream processors, and when clocked the same still that much diffrence while everyone always said memory does not matter for folding.

Wet finger work I know, don't read to much into it.
The 512mb 9600gso from Asus has 128 bit memory. All other 9600gso cards that I have seen, 384mb or 768mb, have 192 bit memory.
That is a pretty significant difference and probably accounts for the ppd difference. But memory speed variances on cards with 192 bit memory does not make a big difference. They are all at least in the same league.
But it's still very weird to see such a diffrence when the memory isn't supposed to have an influence :?:

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 5:00 pm
by sdack
Xilikon wrote:Yes, I agree that the PG should get the 8600GT and the 9800GTX+ card in the labs and work to get the core stable with those cards. I bet that when they succeed at this, everything else should work fine.
It is not about the cards mind you.

What the exact changes were that went into 1.15 is not known to us. However, I think Prof. Pande pointed out that it is all about getting larger proteins to fold.

If it turns out that some older cards cannot reliably fold larger proteins, and over several hours, then the consequence has to be to stop folding on these cards.

I think this needs to be discussed because should it happen will many people be disappointed and the Pande Group cannot be expected to keep folding only small proteins.