Every 353 Point WU fails

Moderators: Site Moderators, FAHC Science Team

RickC_
Posts: 5
Joined: Mon May 04, 2009 2:19 am

Every 353 Point WU fails

Post by RickC_ »

I have been folding on my GTX 285's for almost 2 months. I have never had a NAN. Now, suddenly, every single 353 point WU on either GPU fails immediately. It does not even start to fold the WU. It just quits out with a NAN within a second of downloading it. 384, 472, 511, 1888, etc., WU's all fold just fine. I don't know if I have received a 768 point WU since this started, so I cannot comment on them. I am sure I have folded dozens, if not hundreds, of 353 point WU's between February and a few days ago.

Underclocking the GPUs makes no difference. Rebooting makes no difference. Driver cleaning and reinstalling drivers makes no difference. Driver cleaning and installing different drivers makes no difference. Uninstalling Folding@Home and reinstalling it makes no difference. Temps are in the 50's and 60's which seems fine, and other WU's that run hotter than the 353 point ones are fine. Also, temps are even lower than that, probably in the 40's when this happens, because it happens before it starts to fold.

I guess both of my cards could have gone bad at the exact same time, but it seems unlikely. Especially since it is only for the 353 point WU's. Anyone ever run into something similar? Any ideas?

Thanks,
Rick (User = RickC, Team = 111065)
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Every 353 Point WU fails

Post by 7im »

Hello RickC, welcome to the forum.

We don't typically refer to work units by their point values. What are the Project numbers for these work units?

Have you tried the MemtestG80 GPU memory tester program to see if your GTX is running well? http://folding.typepad.com/news/2009/04 ... ecker.html

And to help diagnose the problem, please post more info about your setup. OS, client version, driver version, fahcore version, etc.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
RickC_
Posts: 5
Joined: Mon May 04, 2009 2:19 am

Re: Every 353 Point WU fails

Post by RickC_ »

Hi 7im,

I am downloading the memory checker now. Seems odd that my memory would be so bad it cannot pass a fraction of a second of the self-test, but can fold to 100% anything it actually starts. We'll see, though.

My system is:
CPU = Core i7 920
Motherboard = EVGA X58
RAM = Patriot 6GB DDR3
GPUs = 2x EVGA GTX 285
OS = Windows Vista Business 64-bit
ForceWare Drivers = 182.08 & 185.81

182.08 were working fine for months. I tried reinstalling them without luck. I then tried the 185.81 beta drivers, because they had some kind of CUDA update that I thought might be helpful. Still no good.

I would refer to Project Numbers, but there are a seemingly infinite amount of them. Here are a bunch. This is not a complete list.

Project: 5765 (Run 1, Clone 56, Gen 289)
Project: 5765 (Run 7, Clone 363, Gen 62)
Project: 5765 (Run 12, Clone 4, Gen 22)
Project: 5766 (Run 7, Clone 130, Gen 22)
Project: 5767 (Run 1, Clone 247, Gen 19)
Project: 5767 (Run 4, Clone 242, Gen 25)
Project: 5767 (Run 13, Clone 75, Gen 16)
Project: 5768 (Run 9, Clone 6, Gen 36)
Project: 5769 (Run 7, Clone 265, Gen 23)
Project: 5770 (Run 9, Clone 223, Gen 22)
Project: 5770 (Run 12, Clone 254, Gen 35)
Project: 5771 (Run 4, Clone 101, Gen 36)
Project: 5771 (Run 9, Clone 157, Gen 353)
Project: 5772 (Run 2, Clone 294, Gen 20)
Project: 5772 (Run 4, Clone 370, Gen 39)
Project: 5772 (Run 5, Clone 344, Gen 52)
Project: 5772 (Run 14, Clone 247, Gen 48)

Thanks,
Rick
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Every 353 Point WU fails

Post by bruce »

Starting from the top of your list, the following were successfully completed by someone else:
Project: 5765 (Run 1, Clone 56, Gen 289)
Project: 5765 (Run 12, Clone 4, Gen 22)
Project: 5767 (Run 1, Clone 247, Gen 19)
Project: 5767 (Run 4, Clone 242, Gen 25)
Project: 5768 (Run 9, Clone 6, Gen 36)
Project: 5770 (Run 12, Clone 254, Gen 35)
Project: 5771 (Run 4, Clone 101, Gen 36)
Project: 5771 (Run 9, Clone 157, Gen 353)
Project: 5772 (Run 2, Clone 294, Gen 20)
Project: 5772 (Run 4, Clone 370, Gen 39)
Project: 5772 (Run 5, Clone 344, Gen 52)
Project: 5772 (Run 14, Clone 247, Gen 48)


The following have not been returned by anyone yet:
Project: 5765 (Run 7, Clone 363, Gen 62)
Project: 5767 (Run 13, Clone 75, Gen 16)
Project: 5769 (Run 7, Clone 265, Gen 23)
Project: 5770 (Run 9, Clone 223, Gen 22)

The following was returned once with an error but not yet completed:
Project: 5766 (Run 7, Clone 130, Gen 22)

We will probably have reports from the last 5 within a day or so but the 13 which were completed without error by somebody else makes a pretty clear case that your hardware is not functioning correctly. The ususal suspects are Overheating/overclocking/poor 12v power/failing GPU hardware/etc. (Or, it could be software if you've changed drivers or something else like that, but you probably would have mentioned that.)

Overheating/overclocking/poor power are also not likely to have changed without you knowing about it, though it might be worth blowing the dust out of your filters/fans/heatsinks and see if that helps. If you've already eliminated all of the other options, that may leave the one possibility that your GPU or VRAM has died.
toTOW
Site Moderator
Posts: 6334
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Every 353 Point WU fails

Post by toTOW »

I don't know if there's a link here, but there's a similar report in the NV forum : viewtopic.php?f=52&t=9822 :(
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
kiore
Posts: 920
Joined: Fri Jan 16, 2009 5:45 pm
Location: USA

Re: Every 353 Point WU fails

Post by kiore »

Interesting, my cards are loving them 5865PPD (current one project 5770 ) on my 9800GT's more than make up for the super slow project 575X 511 pointers which pull me down to 3560PPD (currently running project 5749).
Only run of EUEs I have had on them was yesterday when I was fiddling with the settings of an inactive card and accidentally crashed my active ones. That was definitely an overheating/overclocking issue.
Cards running at 1800 shaders and temps of <60C is no problem on my system. They don't seem to like higher clocks and/or temperatures.
kiore.

addit the project 5770 run 11 clone 361 gen 60 just finished successfully in an hour 27 minutes on the above config.
Image
i7 7800x RTX 3070 OS= win10. AMD 3700x RTX 2080ti OS= win10 .

Team page: https://www.rationalskepticism.org/viewtopic.php?t=616
ChrissyT88
Posts: 9
Joined: Mon Nov 17, 2008 11:15 am

Re: Every 353 Point WU fails

Post by ChrissyT88 »

I posted the other thread toTow linked to. Im interested in the power supply comments made above by Bruce. The PC in question is running a Q6600, couple of HDDs and optical drives, 4 5 fans and a GTX280 from a BeQuiet! 650 watt power supply. Using OCCT, the 12V line never drops below 11.88V (often switching between 11.93 and 11.88V), which is a pretty small deviation from 12V and within spec. Last time i checked using a wall plug power monitor, the whole PC was pulling about 340 watts (i think, although how far you can trust a £10 monitor i dont know!). As far as i can see, this is well within spec, but obviously i would like to know if a more powerful PSU is needed to give make F@H more stable, particularly as i am contemplating the move to i7.
toTOW
Site Moderator
Posts: 6334
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Re: Every 353 Point WU fails

Post by toTOW »

An easy PSU test to do is to unplug anything you don't need actually (HDD and optical drives) to leave only the system HDD and a couple of fan (or no fan if you test with case open which is even better), and to remove all overclockings. Then, do not start anything CPU intensive, and try to fold on your GPU.

If it doesn't fail anymore, the PSU might be the cause of the issue (try to add load progressively to see if the issue comes back).
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
ChrissyT88
Posts: 9
Joined: Mon Nov 17, 2008 11:15 am

Re: Every 353 Point WU fails

Post by ChrissyT88 »

Thanks. I'll try to catch it on a failing WU and save the files elsewhere for future testing.
RickC_
Posts: 5
Joined: Mon May 04, 2009 2:19 am

Re: Every 353 Point WU fails

Post by RickC_ »

I ran 800MB VRAM size & 500 iterations of MemtestG80 on both GPU's simultaneously. This took several hours and there were zero errors.

If all I run is a single instance of the GPU client and it catches a 353 point WU, it fails the self-test immediately despite temps in the mid-40's (it will push 60C if it actually starts folding, but it never starts for those WU's). However, as soon as I download any other point WU, I can run two GPU clients + two VMWare CPU clients (4 threads total) simultaneously with no problems. Therefore, I do not think it is a temperature or overloading problem.

I have a Corsair 1000W PSU that is certified for dual GTX 285. Software monitoring shows a +12V of 12.04v at idle and 11.96v under load. Those are the same numbers that have always shown for the +12V rail. I really do not think this is a PSU issue. Flaky PSU's tend to result in sporadic, unpredictable behavior. This is so consistent and so repeatable, that it feels like a software or configuration issue. I just have no idea what else to try in an attempt to figure out what is going on.

Is there any way to get the self-test as a stand alone item? I would like to do some more controlled testing without having to wait around for a 353 point WU or unnecessarily throwing away WU's.

-Rick
RickC_
Posts: 5
Joined: Mon May 04, 2009 2:19 am

Re: Every 353 Point WU fails

Post by RickC_ »

Also, what are the feelings about restarting the client after these errors occur? I typically restart the client and will often get a 511 point, 384 point, 472 point, 1888 point, 768 point, etc., that folds to 100% just fine. Is that OK? Or should we really stop folding for 24 hours? I can typically fold dozens of WU's successfully a day, even with the occasional 353 point WU that immediately fails. I hate to just sit idle and not fold, but if restarting after these problems will somehow mess up the WU assignments on Stanford's side, I obviously do not want to do that.

Thanks,
Rick
kiore
Posts: 920
Joined: Fri Jan 16, 2009 5:45 pm
Location: USA

Re: Every 353 Point WU fails

Post by kiore »

RickC_, the 24 hr stop is to stop a problem endlessly repeating, if you have fixed the issue it should be fine to restart.
kiore.
Image
i7 7800x RTX 3070 OS= win10. AMD 3700x RTX 2080ti OS= win10 .

Team page: https://www.rationalskepticism.org/viewtopic.php?t=616
PeddlerOfFlesh
Posts: 8
Joined: Sat Jan 10, 2009 9:54 am

Re: Every 353 Point WU fails

Post by PeddlerOfFlesh »

I've been having the same problem for a couple months too. I have a 8800GT and a 9800GT in the same computer that just fail instantly. All memory tests come out out, unistalling, reinstalling, trying different drivers, etc doesn't fix it. :(
Image
Escher
Posts: 26
Joined: Thu Aug 07, 2008 2:22 am

Re: Every 353 Point WU fails

Post by Escher »

A lot of people have been having trouble with the 576*/577* projects and have been for quite some time. I used to post every time I'd get an UNSTABLE_MACHINE error code immediately upon running one of these WUs on my 8800GT. Yes, some people do seem to be able to run them successfully, and unfortunately some people on this board point to that as an indicator that the rest of us have faulty hardware, even though every other test/application in the world says otherwise. Maybe those people are able to run them because they're using a different client or aren't using nVidia hardware. I seem to recall a writeup someone did that showed that an earlier version of the client didn't have this issue and it's only been the more recent F@H clients that have been failing. Maybe someone can look into that.

I really wish the buck would stop getting passed around and we could actually work toward finding a real solution. I'm so very tired of constantly restarting this supposedly beneficial software.
PeddlerOfFlesh
Posts: 8
Joined: Sat Jan 10, 2009 9:54 am

Re: Every 353 Point WU fails

Post by PeddlerOfFlesh »

and after about 2 months of not being able to fold a single 353 point WU, they started working again. Right after I complained, too. Wish I knew what it was that fixed it.
Image
Post Reply