Page 1 of 2
Every 353 Point WU fails
Posted: Mon May 04, 2009 2:40 am
by RickC_
I have been folding on my GTX 285's for almost 2 months. I have never had a NAN. Now, suddenly, every single 353 point WU on either GPU fails immediately. It does not even start to fold the WU. It just quits out with a NAN within a second of downloading it. 384, 472, 511, 1888, etc., WU's all fold just fine. I don't know if I have received a 768 point WU since this started, so I cannot comment on them. I am sure I have folded dozens, if not hundreds, of 353 point WU's between February and a few days ago.
Underclocking the GPUs makes no difference. Rebooting makes no difference. Driver cleaning and reinstalling drivers makes no difference. Driver cleaning and installing different drivers makes no difference. Uninstalling Folding@Home and reinstalling it makes no difference. Temps are in the 50's and 60's which seems fine, and other WU's that run hotter than the 353 point ones are fine. Also, temps are even lower than that, probably in the 40's when this happens, because it happens before it starts to fold.
I guess both of my cards could have gone bad at the exact same time, but it seems unlikely. Especially since it is only for the 353 point WU's. Anyone ever run into something similar? Any ideas?
Thanks,
Rick (User = RickC, Team = 111065)
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 2:44 am
by 7im
Hello RickC, welcome to the forum.
We don't typically refer to work units by their point values. What are the Project numbers for these work units?
Have you tried the MemtestG80 GPU memory tester program to see if your GTX is running well?
http://folding.typepad.com/news/2009/04 ... ecker.html
And to help diagnose the problem, please post more info about your setup. OS, client version, driver version, fahcore version, etc.
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 3:10 am
by RickC_
Hi 7im,
I am downloading the memory checker now. Seems odd that my memory would be so bad it cannot pass a fraction of a second of the self-test, but can fold to 100% anything it actually starts. We'll see, though.
My system is:
CPU = Core i7 920
Motherboard = EVGA X58
RAM = Patriot 6GB DDR3
GPUs = 2x EVGA GTX 285
OS = Windows Vista Business 64-bit
ForceWare Drivers = 182.08 & 185.81
182.08 were working fine for months. I tried reinstalling them without luck. I then tried the 185.81 beta drivers, because they had some kind of CUDA update that I thought might be helpful. Still no good.
I would refer to Project Numbers, but there are a seemingly infinite amount of them. Here are a bunch. This is not a complete list.
Project: 5765 (Run 1, Clone 56, Gen 289)
Project: 5765 (Run 7, Clone 363, Gen 62)
Project: 5765 (Run 12, Clone 4, Gen 22)
Project: 5766 (Run 7, Clone 130, Gen 22)
Project: 5767 (Run 1, Clone 247, Gen 19)
Project: 5767 (Run 4, Clone 242, Gen 25)
Project: 5767 (Run 13, Clone 75, Gen 16)
Project: 5768 (Run 9, Clone 6, Gen 36)
Project: 5769 (Run 7, Clone 265, Gen 23)
Project: 5770 (Run 9, Clone 223, Gen 22)
Project: 5770 (Run 12, Clone 254, Gen 35)
Project: 5771 (Run 4, Clone 101, Gen 36)
Project: 5771 (Run 9, Clone 157, Gen 353)
Project: 5772 (Run 2, Clone 294, Gen 20)
Project: 5772 (Run 4, Clone 370, Gen 39)
Project: 5772 (Run 5, Clone 344, Gen 52)
Project: 5772 (Run 14, Clone 247, Gen 48)
Thanks,
Rick
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 3:51 am
by bruce
Starting from the top of your list, the following were successfully completed by someone else:
Project: 5765 (Run 1, Clone 56, Gen 289)
Project: 5765 (Run 12, Clone 4, Gen 22)
Project: 5767 (Run 1, Clone 247, Gen 19)
Project: 5767 (Run 4, Clone 242, Gen 25)
Project: 5768 (Run 9, Clone 6, Gen 36)
Project: 5770 (Run 12, Clone 254, Gen 35)
Project: 5771 (Run 4, Clone 101, Gen 36)
Project: 5771 (Run 9, Clone 157, Gen 353)
Project: 5772 (Run 2, Clone 294, Gen 20)
Project: 5772 (Run 4, Clone 370, Gen 39)
Project: 5772 (Run 5, Clone 344, Gen 52)
Project: 5772 (Run 14, Clone 247, Gen 48)
The following have not been returned by anyone yet:
Project: 5765 (Run 7, Clone 363, Gen 62)
Project: 5767 (Run 13, Clone 75, Gen 16)
Project: 5769 (Run 7, Clone 265, Gen 23)
Project: 5770 (Run 9, Clone 223, Gen 22)
The following was returned once with an error but not yet completed:
Project: 5766 (Run 7, Clone 130, Gen 22)
We will probably have reports from the last 5 within a day or so but the 13 which were completed without error by somebody else makes a pretty clear case that your hardware is not functioning correctly. The ususal suspects are Overheating/overclocking/poor 12v power/failing GPU hardware/etc. (Or, it could be software if you've changed drivers or something else like that, but you probably would have mentioned that.)
Overheating/overclocking/poor power are also not likely to have changed without you knowing about it, though it might be worth blowing the dust out of your filters/fans/heatsinks and see if that helps. If you've already eliminated all of the other options, that may leave the one possibility that your GPU or VRAM has died.
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 12:00 pm
by toTOW
I don't know if there's a link here, but there's a similar report in the NV forum : viewtopic.php?f=52&t=9822
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 12:11 pm
by kiore
Interesting, my cards are loving them 5865PPD (current one project 5770 ) on my 9800GT's more than make up for the super slow project 575X 511 pointers which pull me down to 3560PPD (currently running project 5749).
Only run of EUEs I have had on them was yesterday when I was fiddling with the settings of an inactive card and accidentally crashed my active ones. That was definitely an overheating/overclocking issue.
Cards running at 1800 shaders and temps of <60C is no problem on my system. They don't seem to like higher clocks and/or temperatures.
kiore.
addit the project 5770 run 11 clone 361 gen 60 just finished successfully in an hour 27 minutes on the above config.
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 4:31 pm
by ChrissyT88
I posted the other thread toTow linked to. Im interested in the power supply comments made above by Bruce. The PC in question is running a Q6600, couple of HDDs and optical drives, 4 5 fans and a GTX280 from a BeQuiet! 650 watt power supply. Using OCCT, the 12V line never drops below 11.88V (often switching between 11.93 and 11.88V), which is a pretty small deviation from 12V and within spec. Last time i checked using a wall plug power monitor, the whole PC was pulling about 340 watts (i think, although how far you can trust a £10 monitor i dont know!). As far as i can see, this is well within spec, but obviously i would like to know if a more powerful PSU is needed to give make F@H more stable, particularly as i am contemplating the move to i7.
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 4:35 pm
by toTOW
An easy PSU test to do is to unplug anything you don't need actually (HDD and optical drives) to leave only the system HDD and a couple of fan (or no fan if you test with case open which is even better), and to remove all overclockings. Then, do not start anything CPU intensive, and try to fold on your GPU.
If it doesn't fail anymore, the PSU might be the cause of the issue (try to add load progressively to see if the issue comes back).
Re: Every 353 Point WU fails
Posted: Mon May 04, 2009 5:44 pm
by ChrissyT88
Thanks. I'll try to catch it on a failing WU and save the files elsewhere for future testing.
Re: Every 353 Point WU fails
Posted: Tue May 05, 2009 12:18 am
by RickC_
I ran 800MB VRAM size & 500 iterations of MemtestG80 on both GPU's simultaneously. This took several hours and there were zero errors.
If all I run is a single instance of the GPU client and it catches a 353 point WU, it fails the self-test immediately despite temps in the mid-40's (it will push 60C if it actually starts folding, but it never starts for those WU's). However, as soon as I download any other point WU, I can run two GPU clients + two VMWare CPU clients (4 threads total) simultaneously with no problems. Therefore, I do not think it is a temperature or overloading problem.
I have a Corsair 1000W PSU that is certified for dual GTX 285. Software monitoring shows a +12V of 12.04v at idle and 11.96v under load. Those are the same numbers that have always shown for the +12V rail. I really do not think this is a PSU issue. Flaky PSU's tend to result in sporadic, unpredictable behavior. This is so consistent and so repeatable, that it feels like a software or configuration issue. I just have no idea what else to try in an attempt to figure out what is going on.
Is there any way to get the self-test as a stand alone item? I would like to do some more controlled testing without having to wait around for a 353 point WU or unnecessarily throwing away WU's.
-Rick
Re: Every 353 Point WU fails
Posted: Tue May 05, 2009 2:02 am
by RickC_
Also, what are the feelings about restarting the client after these errors occur? I typically restart the client and will often get a 511 point, 384 point, 472 point, 1888 point, 768 point, etc., that folds to 100% just fine. Is that OK? Or should we really stop folding for 24 hours? I can typically fold dozens of WU's successfully a day, even with the occasional 353 point WU that immediately fails. I hate to just sit idle and not fold, but if restarting after these problems will somehow mess up the WU assignments on Stanford's side, I obviously do not want to do that.
Thanks,
Rick
Re: Every 353 Point WU fails
Posted: Tue May 05, 2009 5:05 am
by kiore
RickC_, the 24 hr stop is to stop a problem endlessly repeating, if you have fixed the issue it should be fine to restart.
kiore.
Re: Every 353 Point WU fails
Posted: Thu May 21, 2009 5:33 pm
by PeddlerOfFlesh
I've been having the same problem for a couple months too. I have a 8800GT and a 9800GT in the same computer that just fail instantly. All memory tests come out out, unistalling, reinstalling, trying different drivers, etc doesn't fix it.
Re: Every 353 Point WU fails
Posted: Sat May 23, 2009 6:53 pm
by Escher
A lot of people have been having trouble with the 576*/577* projects and have been for quite some time. I used to post every time I'd get an UNSTABLE_MACHINE error code immediately upon running one of these WUs on my 8800GT. Yes, some people do seem to be able to run them successfully, and unfortunately some people on this board point to that as an indicator that the rest of us have faulty hardware, even though every other test/application in the world says otherwise. Maybe those people are able to run them because they're using a different client or aren't using nVidia hardware. I seem to recall a writeup someone did that showed that an earlier version of the client didn't have this issue and it's only been the more recent F@H clients that have been failing. Maybe someone can look into that.
I really wish the buck would stop getting passed around and we could actually work toward finding a real solution. I'm so very tired of constantly restarting this supposedly beneficial software.
Re: Every 353 Point WU fails
Posted: Sun May 24, 2009 4:51 pm
by PeddlerOfFlesh
and after about 2 months of not being able to fold a single 353 point WU, they started working again. Right after I complained, too. Wish I knew what it was that fixed it.