Page 8 of 10

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:05 am
by theo343
And this is understandable Mr. Pande. Thanks for your feedback :)

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:10 am
by jebo_4jc
Let's hope the QA process sees some vast improvements
Having issues with 5 GPUs here.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:11 am
by theo343
And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:36 am
by harlam357
theo343 wrote:And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.
Precisely... nothing revolutionary... even if just a couple WUs were run, this problem would have been evident and halted before it ever became a problem.

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 1:39 am
by elrado1
VijayPande wrote:We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.
Mr. Pande has the patients of a saint.

I do have this issue with two GPU (Nvidia) machines. After about 6 hours a project 5506 unit was finally sent out successfully. The UNSTABLE_MACHINE issue with the project 5801 unit persists. Any recommendations?

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 1:51 am
by mklvotep
I've got 15 Nvidia gpu's that I have to restart periodically to dump this wu. I'm ready for things to get back to normal(whatevr that is)

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 2:15 am
by MoneyGuyBK
Image ... Image
To V.P. aka Dr. Pande aka Vijay Pande :) ... much obliged sir, and God Bless.




Peace
VijayPande wrote:PS In case you're curious:
MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state :!:
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).
However, more stumped that:
2) F@H has not chimed in here officially after 7 Pages of comments :(
We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 2:28 am
by shatteredsilicon
VijayPande wrote:PS In case you're curious:
MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state :!:
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).
Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 5:26 am
by Teddy
Well I come back home from work & see the p5801's have been pulled, switch on machines, flush all bad work units & we are up & running again.

I have NO server issues at the moment, all units have been returned safely to their servers & I see nothing but green ink in Fahspy.
Congratulations Vijay & co... I can rest easy for now that all my Linux SMP, Windows SMP, my standard clients,my ATI clients & especially my Nvidia clients
are happy for now.

Cheers Teddy

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 9:31 am
by toTOW
VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.
Well ... I think you missed at least one of the QA steps ... :roll:

p5800 was fully tested through the whole QA process ... but not the p5801 :(

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 11:56 am
by theo343
The sad thing about this is that half of my GPU folders(3 of 7 cards in total) will be dead in the water for 24 hours or more as i cannot reach them until tomorrow. (to much chaos on the roads today so im working from the homeoffice).

Those 3 cards are also the most powerful. This P5801 thing was extremly bad timing for me, as Ive been working my arse off with the clients the last couple of weeks to be competetive with a couple of guys on my team. I was just knifing and was ready to pass. I can now say goodbye to that aspect as my PPD statistic will plummit with only half my PPD for more than 24 hours and the other guys have access to all foldingmachines and have lost minimal PPD during these problems.

EDIT:
I also wounder how many Nvidia GPUs that will lay dead in the water for 24 hours or more, in total, because of the P5801 distribution.

I truly hope the QA procedure will get some improvements after this blunder.

Re: Project 5801 issues. [Should be Offline]

Posted: Wed Oct 29, 2008 12:25 pm
by VijayPande
toTOW wrote:
VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.
Well ... I think you missed at least one of the QA steps ... :roll:

p5800 was fully tested through the whole QA process ... but not the p5801 :(
5801 was just a copy of another project, which did go all the way through QA. Nevertheless, I will have a talk with the responsible parties about this.

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 12:28 pm
by VijayPande
shatteredsilicon wrote: Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.
1.15 passed all of the regression testing on machines at Stanford and NVIDIA and then passed FAH beta testing. There's not much more we can do than that before releasing it. Keep in mind that we now know that for many people (some boards), 1.15 is perfectly fine and stable, whereas for others, it doesn't work at all. If that's the case, my guess is that this is a CUDA or hardware issue. If the code in 1.15 were really broken, it would not work on any hardware, which is definitely not the case. We're working with NVIDIA on this one. The first step is to get the problem reproducible in their labs.

The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 12:35 pm
by MtM
VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted :(

Re: Project 5801 issues.

Posted: Wed Oct 29, 2008 12:54 pm
by Xilikon
MtM wrote:
VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted :(
Technically, if the same code work on certain cards but not on others, we can look at the driver or hardware level. However, the core is partly to be responsible of this as well so it's a two-side work to find out what wrong (NVIDIA with the CUDA code and PG with the core). This is what make debugging of this issue very hard.

Think of a car engine choking under load. The cause can be multiple from fuel quality, air quality, timing adjustement, ECU programming, mechanical problem or else so it take lots of diagnostic to find out what went wrong.