Project 5801 issues. [Should be Offline]

Moderators: Site Moderators, FAHC Science Team

theo343
Posts: 74
Joined: Thu Jul 03, 2008 12:43 pm
Hardware configuration: Home Network:
ADSL 12Mbps - 807 / 14.439
USRobotics 8port 10/100/1000
All computers connected with TP CAT5

HC1:
E8400@4GHz(8*500)
4GB PC2-8000
9800GX2@stock
removed - (8800GTS-512MB@724/1810/972)
Mist 600W rev2
WinXP Pro 32bit SP3
FW180.43
1xGPU2 v6.20 R1 Core 1.18/1.19
1xSMP v6.23 Beta R1

HC2:
AM2+ x4 Phenom 9950@3GHz
2GB Crucial Ballistix PC2-5300
3x8800GS-384MB
Corsair TX 750W
WinXP Pro 32bit SP3
FW178.24
3xGPU2 v6.20 R1 Core 1.18
1xSMP v6.23 Beta R1

HC3: (not folding atm and outdated)
X2 6000+@Stock
HD3850OC-512MB@783MHz
2GB PC6400
PSU 420W
WinXP Pro 32bit SP3
CCC8.6
1*GPU2 v6.12 Beta8 Core 1.04
1*CPU 5.04


Office Network:
SDSL 20Mbps
100/1000
All computers connected with TP CAT5

OC1:
E6750@stock
8800GT-256MB@702/1755/900
4GB PC6400
Tagan 480W
Vista Ultimate 32bit SP1
FW 178.24
1xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1


OC2:
E8400@stock
2x8800GT-512MB@stock
8GB PC3-8500
Corsair HX520
Vista Business 64bit
FW178.24
2xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1
Location: Norway

Re: Project 5801 issues. [Should be Offline]

Post by theo343 »

And this is understandable Mr. Pande. Thanks for your feedback :)
Image
jebo_4jc
Posts: 17
Joined: Wed Jun 18, 2008 7:09 pm

Re: Project 5801 issues. [Should be Offline]

Post by jebo_4jc »

Let's hope the QA process sees some vast improvements
Having issues with 5 GPUs here.
Image
theo343
Posts: 74
Joined: Thu Jul 03, 2008 12:43 pm
Hardware configuration: Home Network:
ADSL 12Mbps - 807 / 14.439
USRobotics 8port 10/100/1000
All computers connected with TP CAT5

HC1:
E8400@4GHz(8*500)
4GB PC2-8000
9800GX2@stock
removed - (8800GTS-512MB@724/1810/972)
Mist 600W rev2
WinXP Pro 32bit SP3
FW180.43
1xGPU2 v6.20 R1 Core 1.18/1.19
1xSMP v6.23 Beta R1

HC2:
AM2+ x4 Phenom 9950@3GHz
2GB Crucial Ballistix PC2-5300
3x8800GS-384MB
Corsair TX 750W
WinXP Pro 32bit SP3
FW178.24
3xGPU2 v6.20 R1 Core 1.18
1xSMP v6.23 Beta R1

HC3: (not folding atm and outdated)
X2 6000+@Stock
HD3850OC-512MB@783MHz
2GB PC6400
PSU 420W
WinXP Pro 32bit SP3
CCC8.6
1*GPU2 v6.12 Beta8 Core 1.04
1*CPU 5.04


Office Network:
SDSL 20Mbps
100/1000
All computers connected with TP CAT5

OC1:
E6750@stock
8800GT-256MB@702/1755/900
4GB PC6400
Tagan 480W
Vista Ultimate 32bit SP1
FW 178.24
1xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1


OC2:
E8400@stock
2x8800GT-512MB@stock
8GB PC3-8500
Corsair HX520
Vista Business 64bit
FW178.24
2xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1
Location: Norway

Re: Project 5801 issues. [Should be Offline]

Post by theo343 »

And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.
Image
harlam357
Posts: 222
Joined: Fri Jun 27, 2008 11:03 pm
Location: Alabama - USA
Contact:

Re: Project 5801 issues. [Should be Offline]

Post by harlam357 »

theo343 wrote:And implement that you always reQA a Project on the latest forced core, before you distribute the project. Record what core the Project was QAed on so you know if you have to reQA it before release.
Precisely... nothing revolutionary... even if just a couple WUs were run, this problem would have been evident and halted before it ever became a problem.
elrado1
Posts: 3
Joined: Tue Sep 09, 2008 2:11 am

Re: Project 5801 issues.

Post by elrado1 »

VijayPande wrote:We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.
Mr. Pande has the patients of a saint.

I do have this issue with two GPU (Nvidia) machines. After about 6 hours a project 5506 unit was finally sent out successfully. The UNSTABLE_MACHINE issue with the project 5801 unit persists. Any recommendations?
Image
mklvotep
Posts: 8
Joined: Sun Oct 19, 2008 8:35 pm

Re: Project 5801 issues. [Should be Offline]

Post by mklvotep »

I've got 15 Nvidia gpu's that I have to restart periodically to dump this wu. I'm ready for things to get back to normal(whatevr that is)
Image
MoneyGuyBK
Posts: 179
Joined: Sun Dec 02, 2007 6:40 am
Location: Team_XPS ..... OC, S. Calif

Re: Project 5801 issues.

Post by MoneyGuyBK »

Image ... Image
To V.P. aka Dr. Pande aka Vijay Pande :) ... much obliged sir, and God Bless.




Peace
VijayPande wrote:PS In case you're curious:
MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state :!:
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).
However, more stumped that:
2) F@H has not chimed in here officially after 7 Pages of comments :(
We keep an eye on the forum, but the first post was just a few hours ago. Due to staff having other responsibilities, our response will typically be on the hours time scale not minutes time scales for issues like this. I wish it could be faster, but that's what we're staffed to do at the moment.
T.E.A.M. “Together Everyone Accomplishes Miracles!”
Image
OC, S. California ... God Bless All
shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Project 5801 issues.

Post by shatteredsilicon »

VijayPande wrote:PS In case you're curious:
MoneyGuyBK wrote: I am surprised that:
1) F@H released this WU in such a bad state :!:
This was beta tested before (this was a project # change due to a move onto a new server -- which was done to try to keep work around while the CS servers were down).
Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: Project 5801 issues. [Should be Offline]

Post by Teddy »

Well I come back home from work & see the p5801's have been pulled, switch on machines, flush all bad work units & we are up & running again.

I have NO server issues at the moment, all units have been returned safely to their servers & I see nothing but green ink in Fahspy.
Congratulations Vijay & co... I can rest easy for now that all my Linux SMP, Windows SMP, my standard clients,my ATI clients & especially my Nvidia clients
are happy for now.

Cheers Teddy
toTOW
Site Moderator
Posts: 6359
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Project 5801 issues. [Should be Offline]

Post by toTOW »

VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.
Well ... I think you missed at least one of the QA steps ... :roll:

p5800 was fully tested through the whole QA process ... but not the p5801 :(
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
theo343
Posts: 74
Joined: Thu Jul 03, 2008 12:43 pm
Hardware configuration: Home Network:
ADSL 12Mbps - 807 / 14.439
USRobotics 8port 10/100/1000
All computers connected with TP CAT5

HC1:
E8400@4GHz(8*500)
4GB PC2-8000
9800GX2@stock
removed - (8800GTS-512MB@724/1810/972)
Mist 600W rev2
WinXP Pro 32bit SP3
FW180.43
1xGPU2 v6.20 R1 Core 1.18/1.19
1xSMP v6.23 Beta R1

HC2:
AM2+ x4 Phenom 9950@3GHz
2GB Crucial Ballistix PC2-5300
3x8800GS-384MB
Corsair TX 750W
WinXP Pro 32bit SP3
FW178.24
3xGPU2 v6.20 R1 Core 1.18
1xSMP v6.23 Beta R1

HC3: (not folding atm and outdated)
X2 6000+@Stock
HD3850OC-512MB@783MHz
2GB PC6400
PSU 420W
WinXP Pro 32bit SP3
CCC8.6
1*GPU2 v6.12 Beta8 Core 1.04
1*CPU 5.04


Office Network:
SDSL 20Mbps
100/1000
All computers connected with TP CAT5

OC1:
E6750@stock
8800GT-256MB@702/1755/900
4GB PC6400
Tagan 480W
Vista Ultimate 32bit SP1
FW 178.24
1xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1


OC2:
E8400@stock
2x8800GT-512MB@stock
8GB PC3-8500
Corsair HX520
Vista Business 64bit
FW178.24
2xGPU2 v6.20 R1 Core 1.15
1xSMP v6.23 Beta R1
Location: Norway

Re: Project 5801 issues. [Should be Offline]

Post by theo343 »

The sad thing about this is that half of my GPU folders(3 of 7 cards in total) will be dead in the water for 24 hours or more as i cannot reach them until tomorrow. (to much chaos on the roads today so im working from the homeoffice).

Those 3 cards are also the most powerful. This P5801 thing was extremly bad timing for me, as Ive been working my arse off with the clients the last couple of weeks to be competetive with a couple of guys on my team. I was just knifing and was ready to pass. I can now say goodbye to that aspect as my PPD statistic will plummit with only half my PPD for more than 24 hours and the other guys have access to all foldingmachines and have lost minimal PPD during these problems.

EDIT:
I also wounder how many Nvidia GPUs that will lay dead in the water for 24 hours or more, in total, because of the P5801 distribution.

I truly hope the QA procedure will get some improvements after this blunder.
Image
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project 5801 issues. [Should be Offline]

Post by VijayPande »

toTOW wrote:
VijayPande wrote:Sorry about the really nasty problem on this one. It was definitely strange since these WU's were QA'd before. I think this may be an issue where they were QA'd on an earlier core and 1.15 is causing issues.
Well ... I think you missed at least one of the QA steps ... :roll:

p5800 was fully tested through the whole QA process ... but not the p5801 :(
5801 was just a copy of another project, which did go all the way through QA. Nevertheless, I will have a talk with the responsible parties about this.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project 5801 issues.

Post by VijayPande »

shatteredsilicon wrote: Two words: regression testing
This makes it all the more shocking just how broken the nVidia core 1.15 is.
1.15 passed all of the regression testing on machines at Stanford and NVIDIA and then passed FAH beta testing. There's not much more we can do than that before releasing it. Keep in mind that we now know that for many people (some boards), 1.15 is perfectly fine and stable, whereas for others, it doesn't work at all. If that's the case, my guess is that this is a CUDA or hardware issue. If the code in 1.15 were really broken, it would not work on any hardware, which is definitely not the case. We're working with NVIDIA on this one. The first step is to get the problem reproducible in their labs.

The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Project 5801 issues.

Post by MtM »

VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted :(
Xilikon
Posts: 155
Joined: Sun Dec 02, 2007 1:34 pm

Re: Project 5801 issues.

Post by Xilikon »

MtM wrote:
VijayPande wrote:The bottom line here is that it is becoming clear that what works on some CUDA hardware platforms does not universally work on all. We have since gotten a few of the boards that cause problems and have included them in our recent testing.
So does this mean CUDA isn't compatible with all hardware which is supposed to be compatible with it, or does it point to the implementation of CUDA by the clients isn't compatible with all hardware? Or is it to soon to tell? I would hope it's the last option, as in the first case I'm afraid you don't have the same expedience in getting it sorted :(
Technically, if the same code work on certain cards but not on others, we can look at the driver or hardware level. However, the core is partly to be responsible of this as well so it's a two-side work to find out what wrong (NVIDIA with the CUDA code and PG with the core). This is what make debugging of this issue very hard.

Think of a car engine choking under load. The cause can be multiple from fuel quality, air quality, timing adjustement, ECU programming, mechanical problem or else so it take lots of diagnostic to find out what went wrong.
Image
Locked