Page 3 of 10
Re: New Assignment Server feedback/problem
Posted: Sat Oct 04, 2014 9:58 pm
by Breach
As the cores themselves didn't change overnight, I am guessing that the new AS server is now giving out projects/WUs to Maxwell cards which it shouldn't. Your GPUs are not trying to complete anything - the process has hung. You either get a bad unit failure or a core hang.
Re: New Assignment Server feedback/problem
Posted: Sat Oct 04, 2014 10:01 pm
by kookykrazee
I have had the 2nd pair fail/not complete, and been assigned 2 more, unfortunately. Any way to get these to stop showing up until this is resolved (probably Monday at this point, right?)
Re: New Assignment Server feedback/problem
Posted: Sat Oct 04, 2014 10:43 pm
by widsss
I've had nothing but Bad Work Units for a few hours from projects 9406, 13000 and 13001.
Re: New Assignment Server feedback/problem
Posted: Sat Oct 04, 2014 10:46 pm
by Kjetil
Yes you are not the only one. 6 maxwell doing nada. 3x 750Ti and 3x 980.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 12:00 am
by PS3EdOlkkola
Similar issues here: Got a series of 9406 WU's on a new 980 Maxwell GPU that immediately failed. I've disabled the slot until a fix is apparent. The same 980 has completed about two dozen Core 15 work units prior to the slot failure on Core 17 WUs. Same failure mode every time:
22:46:33:WU02:FS01:0x17:ERROR:exception: Force RMSE error of 617.919 with threshold of 5
22:46:33:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
980 GPU is under-clocked by 5% from stock speeds.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 12:03 am
by kookykrazee
Curious, why are you underclocked?
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 12:44 am
by PS3EdOlkkola
I under-clock all my GPUs.
On AMD-based GPU's they run more stable by under-clocking and reliably processing and then uploading WUs to the collection server. The under-clock also improves their longevity by reducing heat. I pack a lot of AMD GPUs (6 GPUs in a 4u highly modified server case, two systems configured this way), so managing heat levels is an important issue with that kind of density. They all run under 80 deg C in those enclosures when slightly under-clocked. Also, even though the power supply is spec'd at 1500 watts, the under-clock keeps consumed wattage at 1,200 which also helps preserve the life of the PS.
For Nvidia, the primary reason is for heat and longevity, since they tend to process and upload more reliably than the AMD GPUs (at least for me)
I have a planned life-cycle of 3 years for each GPU, at which point they are retired. Everyone has different priorities, but for me that balances a decent life-cycle investment with the time and energy it takes to maintain older hardware. After 3 years, it's time to give the GPU away or trash it and move on to newer hardware. Under-clocking virtually guarantees they make the 3 year time horizon for replacement.
That policy is also in effect for motherboards. I just decommissioned 2 AMD FX8350-based motherboards. A PCIe slot failed in one, and both could not support PCIe 3.0 spec that's needed for optimal performance (on a PPD basis) for highest-end Nvidia and AMD GPUs (both MBs replaced with i7-4790K based MBs). Next weekend, I'm decommissioning a 3rd AMD FX8350 motherboard and replacing it with an i7-5960X-based motherboard, retiring two AMD 7970's and replacing them with two Maxwell GTX 980's. Using the EVGA Classified X99 MB (socket 2011-v3) will be able to add a third 980 in that configuration at a later date.
I've wasted way too many hours trying to keep older hardware running reliably. Admittedly, it's a challenge to see if you can extract the last bit of life from a piece of ancient hardware, and even more interesting is how much I've learned about troubleshooting hardware issues, but I've only got so much time I can spend on my "FAH hobby" that I have to optimize around known, good and contemporary hardware systems.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 1:03 am
by kookykrazee
That makes sense, I was curious. I still wish the 9406s would stop coming...until they are fixed or such...
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 4:50 am
by VijayPande
We've been asked by donors to give Maxwell's Core17 & Core18 and we've done so with the adv setting (please see the previous blog post). You can opt out by removing the adv setting.
Sounds like Maxwell's even with the latest drivers aren't ready and/or we need to see what we can do on the core side.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 4:58 am
by kookykrazee
Will it help at all to downgrade the drivers?
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 5:03 am
by VijayPande
We had reports that the newest drivers worked (and were looking good in our testing as well). We're going to
1) revert back to not giving Maxwell's the latest cores
2) Yutong (aka Proteineer) has a plan for upgrading the cores to work around the driver issue and will get on that on Monday if not sooner.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 5:04 am
by kookykrazee
Sounds great, thanks for the update.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 5:33 am
by Mstenholm
I had ten 9406 fail on my brand new 970. All with Force RMSE error. No OC, 344.16 driver. Last client 7.4.4. I removed FAH and did a fresh install but the first 9406 failed as well.
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 7:54 am
by Breach
@VijayPande, thanks, waiting for the revert. My experience: at the moment fah, advanced and beta all get 0x17 WUs which all fail and from what I have seen last week some, but not all 0x18 WUs insta-fail too.
What I don't understand is how a few months ago Maxwell was apparently folding Core 17 WUs and now it's not:
viewtopic.php?f=80&t=25887&start=120
Re: New Assignment Server feedback/problem
Posted: Sun Oct 05, 2014 8:02 am
by JimF
I don't understand the need for the "advanced settings". I was getting Core 17 fine on all six of my GTX 750 Ti's (without the advanced tag) until a couple of days ago, when we were asked to use it. So I did, and have been getting only failures ever since.
My latest log is in this post.
viewtopic.php?f=80&t=25887&start=135