Page 3 of 8

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:41 pm
by Scarlet-Tech
bruce wrote:
Scarlet-Tech wrote:Is there a FAQ page that shows these portions where server will slow feed out units that were completed but not credited. I did not realize the system could function that way, so it would be good to be able to point people to a fax in case this occurs again. I apologize if it is right under my nose, searching with this phone is less than fun.
If you're asking for a list of credits that are not yet on-line, then no, it doesn't exist. The data is somewhere in a server log which needs to be parsed and data extracted.

If you're asking for the status of the servers, learn to use http://fah-web.stanford.edu/pybeta/serverstat.html

If someone asks, "I am getting higher ppd than normal" is there a FAQ page that we can direct them to that shows that the credit may be coming in slowly from a server that wasn't reporting previously.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:50 pm
by wilding2004
No - and the concept of "normal ppd" is wrong. As an example, I was getting a nice steady 80K ppd from a R9 270X. Now, in the last week, all of a sudden I'm getting 105K ppd. But I haven't changed anything, It can't be AMD's fault because I haven't changed drivers. It can't be Microsofts fault because I've stopped updates.

So it must be Stanfords fault that I'm getting more ppd.

Now in this instance the ppd has gone up, so I'm not making too much fuss - but can you see where I'm going?

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:56 pm
by Scarlet-Tech
wilding2004 wrote:No - and the concept of "normal ppd" is wrong. As an example, I was getting a nice steady 80K ppd from a R9 270X. Now, in the last week, all of a sudden I'm getting 105K ppd. But I haven't changed anything, It can't be AMD's fault because I haven't changed drivers. It can't be Microsofts fault because I've stopped updates.

So it must be Stanfords fault that I'm getting more ppd.

Now in this instance the ppd has gone up, so I'm not making too much fuss - but can you see where I'm going?
Yep, I made my post today saying it is a good thing, not a bad..

I mean, my ppd doubles and my completed work units have almost doubled.

Prior to last week, there was only 6 straight months of consistency, so I can see how one week of suddenly crashing shouldn't be worried about when nothing changes. See where I am going with that?

If over 6 months there was fluctuation, then I wouldn't have been alarmed at all. But, since it was extremely consistent when I was folding with this system, it caught me off guard. Since nearly everyone on our team experienced the issue and many still are, then it was alarming.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 7:09 pm
by wilding2004
Do most people on your team use similar hardware. i.e Nvidia GPU's? Because obviously if everyone uses similar hardware, then they will all experience similar results. If there was a team that only used R9 270X's - they would have just got a massive 30% performance boost.

All I'm getting at, is that over time the WU's evolve and sometimes for no reason that is obvious, personal perfomance goes up or down. Over the past year it appears that AMD GPU's are better at dealing with WU's with large numbers of atoms, and Nvidia GPU's are much better at handling smaller WU's.

If there has been a distribution of new, larger WU's recently that would explain both your lower performance and my better performance from my R9. ( I would also like to add that most of my points come from a pair of 970's, and these seem to fluctuate from 225k ppd to over 375k ppd recently)

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 7:16 pm
by Scarlet-Tech
There is a vast mix of hardware, but most of the errors are being experienced on Maxwell gpu's. I don't say a specific version, because some are 980, some 980tis, some Titan X's and so on, but mostly on Maxwell. A few are experiencing some errors on Kepler based gpu's as well.

Most people are getting errors still, but my production is going up, so I am not sure what the cause it. It may be just like Bruce mentioned that it is slow feeding credit from previous weeks. A nearly 30% increase in production on my system alone.

Like I said, I will continue to monitor results and when I get home, I will go through the logs and see what I can find. I won't be posting the constant percentage updates, but I will look at credits, completed units, and errors and post those together so that hopefully it helps if possible.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 8:17 pm
by bruce
wilding2004 wrote:Over the past year it appears that AMD GPU's are better at dealing with WU's with large numbers of atoms, and Nvidia GPU's are much better at handling smaller WU's.
A long time ago {in a galaxy far, far away oops} I can remember when people with ATI griped at their inferior performance. We told them that their GPUs were better suited for large proteins and nobody believed us.
Scarlet-Tech wrote:... but most of the errors are being experienced on Maxwell gpu's.
If I said that, or even respond to that, would you complain that Stanford always blames NV?

The FAH Developers are constantly doing their best to minimize any inequality, especially when errors are involved. Sometimes it's something they can fix... sometimes it depends on required changes to the Drivers or the particular implementation of OpenCL. Often nobody knows which until it gets fixed.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 8:28 pm
by Scarlet-Tech
bruce wrote: If I said that, or even respond to that, would you complain that Stanford always blames NV?

The FAH Developers are constantly doing their best to minimize any inequality, especially when errors are involved. Sometimes it's something they can fix... sometimes it depends on required changes to the Drivers or the particular implementation of OpenCL. Often nobody knows which until it gets fixed.

I would be curious more of why the errors occurred when drivers from nvidia had not changed. Our hardware and drivers seem to be reported. I don't update until I know how it effects folding, as that is the primary purpose of my system. I don't jump on the newest driver the day it is released because it changes more than game compatibility.

Nvidia drivers have been lacking in the last few revisions, and I am not one to beta test what is supposed to be a WHQL driver. There are plenty of people that will immediately update and let their system run bad. If, after a few days, the drivers seem game stable and they aren't destroying everyone else's folding, then I usually give them a chance.

Re: Failing units, low ppd, and returned units.

Posted: Sun Nov 15, 2015 12:10 am
by wilding2004
Scarlet-Tech wrote:
bruce wrote: If I said that, or even respond to that, would you complain that Stanford always blames NV?

The FAH Developers are constantly doing their best to minimize any inequality, especially when errors are involved. Sometimes it's something they can fix... sometimes it depends on required changes to the Drivers or the particular implementation of OpenCL. Often nobody knows which until it gets fixed.

I would be curious more of why the errors occurred when drivers from nvidia had not changed. Our hardware and drivers seem to be reported. I don't update until I know how it effects folding, as that is the primary purpose of my system. I don't jump on the newest driver the day it is released because it changes more than game compatibility.

Nvidia drivers have been lacking in the last few revisions, and I am not one to beta test what is supposed to be a WHQL driver. There are plenty of people that will immediately update and let their system run bad. If, after a few days, the drivers seem game stable and they aren't destroying everyone else's folding, then I usually give them a chance.

I think this is a good example of WU evolution. The drivers haven't changed, but the science might have. Maxwell GPU's are still pretty new. Maybe the core17/18 WU's available at launch were better suited to them. Now the latest core21 WU's are just more difficult for Maxwell to fold. I don't know the answer, just that the question is always changing

Re: Failing units, low ppd, and returned units.

Posted: Sun Nov 15, 2015 12:18 am
by Scarlet-Tech
wilding2004 wrote:

I think this is a good example of WU evolution. The drivers haven't changed, but the science might have. Maxwell GPU's are still pretty new. Maybe the core17/18 WU's available at launch were better suited to them. Now the latest core21 WU's are just more difficult for Maxwell to fold. I don't know the answer, just that the question is always changing

I know this, yesterday was an awesome day for completed units, and today is as well. What ever the reason, it is a relief and a good thing. I will feel that my electric bill will be balanced out this way.

Re: Failing units, low ppd, and returned units.

Posted: Sun Nov 15, 2015 4:09 am
by 7im
There has been a number of new fahcore_21 versions and NV driver versions in the last few months, plus the release of new Projects. One, some, or all may be involved.

Re: Failing units, low ppd, and returned units.

Posted: Mon Nov 16, 2015 11:38 pm
by bcavnaugh
I have GTX 980 Graphics Cards what is the Correct Driver Version we should be using?

I also have some GTX 780 Graphics Cards but have not seen any issues on these cards but what is the correct driver version to use here as well?

Last I have AMD 290X Graphics Cards but have not seen any issues on these cards but what is the correct driver version to use here as well?

Thank you,
bcavnaugh

Re: Failing units, low ppd, and returned units.

Posted: Tue Nov 17, 2015 7:46 pm
by bcavnaugh
I started a new Thread for Testing: Testing Issues with the GTX 980 Graphics Cards and Core 21 Projects
http://forums.evga.com/Testing-Issues-w ... 15012.aspx

First BAD_WORK_UNIT Core 21 P9637. EVGA GTX 980 Hybrid Card
21:16:01:WU00:FS00:0x21:Completed 500000 out of 2000000 steps (25%)
21:16:10:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
21:16:10:WU00:FS00:0x21:Max number of retries reached. Aborting.
21:16:10:WU00:FS00:0x21:ERROR:Max Retries Reached
21:16:10:WU00:FS00:0x21:Saving result file logfile_01.txt
21:16:10:WU00:FS00:0x21:Saving result file log.txt
21:16:10:WU00:FS00:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
21:16:11:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:16:11:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:9637 run:1 clone:8 gen:87 core:0x21 unit:0x00000079ab436c9b5609bee3c04ebac8
21:16:11:WU00:FS00:Uploading 10.50KiB to 171.67.108.155
21:16:11:WU00:FS00:Connecting to 171.67.108.155:8080
21:16:11:WU00:FS00:Upload complete
21:16:11:WU00:FS00:Server responded WORK_ACK (400)
21:16:11:WU00:FS00:Cleaning up

Re: Failing units, low ppd, and returned units.

Posted: Tue Nov 17, 2015 10:24 pm
by bcavnaugh
Second BAD_WORK_UNIT:

Core 21 P9629 EVGA GTX 980 Hydro Copper Card
It could not get passed 5%/20%

22:04:36:WU05:FS02:0x21:Completed 100000 out of 2000000 steps (5%)
22:04:45:WU05:FS02:0x21:Bad State detected... attempting to resume from last good checkpoint
22:06:13:WU05:FS02:0x21:Completed 20000 out of 2000000 steps (1%)
22:07:40:WU05:FS02:0x21:Completed 40000 out of 2000000 steps (2%)
22:09:08:WU05:FS02:0x21:Completed 60000 out of 2000000 steps (3%)
22:10:35:WU05:FS02:0x21:Completed 80000 out of 2000000 steps (4%)
22:12:02:WU05:FS02:0x21:Completed 100000 out of 2000000 steps (5%)
22:12:12:WU05:FS02:0x21:Bad State detected... attempting to resume from last good checkpoint
22:12:12:WU05:FS02:0x21:Max number of retries reached. Aborting.
22:12:12:WU05:FS02:0x21:ERROR:Max Retries Reached
22:12:12:WU05:FS02:0x21:Saving result file logfile_01.txt
22:12:12:WU05:FS02:0x21:Saving result file log.txt
22:12:12:WU05:FS02:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
22:12:12:WARNING:WU05:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:12:12:WU05:FS02:Sending unit results: id:05 state:SEND error:FAULTY project:9629 run:0 clone:5 gen:62 core:0x21 unit:0x00000056ab436c9b5609bee2fee1603c
22:12:12:WU05:FS02:Uploading 9.00KiB to 171.67.108.155
22:12:12:WU05:FS02:Connecting to 171.67.108.155:8080
22:12:13:WU05:FS02:Upload complete
22:12:13:WU05:FS02:Server responded WORK_ACK (400)
22:12:13:WU05:FS02:Cleaning up

Re: Failing units, low ppd, and returned units.

Posted: Wed Nov 18, 2015 12:44 am
by Ricky
becavnaugh,

If you are running windows 7 or 8, try Nividia driver 347.88 with a clean install from the custom tab of their driver installer. It stopped this problem for me.

Re: Failing units, low ppd, and returned units.

Posted: Wed Nov 18, 2015 3:15 am
by bcavnaugh
Ricky wrote:becavnaugh,

If you are running windows 7 or 8, try Nividia driver 347.88 with a clean install from the custom tab of their driver installer. It stopped this problem for me.
Thank you, I have done this already for about 3 days and it seems to only work on Core 21 P9704 Projects and NOT P96xx Projects. P96xx still fail.