Page 2 of 8

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 4:00 am
by Scarlet-Tech
bruce wrote: You must spend a lot of your time looking for conspiracies under every bush. As you can see from my posts, there's no problem getting the truth from a server that's overburdened.

If it were opened to everyone, it would be swamped with a multitude of requests from a multitude of Donors. Collectively, the Mods submit perhaps 30 transactions a week. If it were open to everybody, it would be getting perhaps 30 hits per hour and it simply isn't designed to handle that kind of load. Even so, I've seen it take several minutes to respond to a fairly simple request -- but I don't choose to gripe about the minor inconveniences in life.
One of the members over on our forums, who I have never seen before yesterday, posted about using P State 2 for memory with Nvidia inspector as the memory factory overclock from nvidia may be causing the fault in the work units.

You may think that I am looking for conspiracies, and you could be right, but when a company is required to post their findings, they normally dump them. If this information was placed in a public domain where users could search it themselves by work units, then they would be able to see if their own system was the problem without asking for as much help.

If the server is overloaded and having a hard time, I am sure they could get grants to upgrade their systems. I am not against doing research on my own to figure out what is causing my issue, but the minor inconvenience you have is on Pande group, not the users. If a public log file with returned work units was posted that users could search, they would be able to see the errors and if another system was able to complete that work unit.

I would rather be able to see if my system is having errors and fix it, rather than wasting $300-$500 worth of energy receiving errors. It's much cheaper to have a fully productive system that contributes over a system that gets the phrase "Bad Work Unit". I can justify wasted electricity when my system is producing like it should and that is all I want.

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 4:05 am
by bruce
The job of the Stanford servers is to maximize the science that gets completed. To do that, when there's an error, it's retried. not analyzed.

One of the requirements for being a beta tester is to monitor your logs and report and analyze whatever problems you encounter. Development takes whatever actions are appropriate to minimize future errors and when the error rate has been reduced to an acceptable level the item being tested graduates from beta testing to advanced testing ... and later from advanced testing to a full Fah release.

The V7 Advanced Control program has an opton to parse your active log for errors. This just finds serious errors where something had to be aborted. Recent code changes have introduced a class of correctable errors which attempts to complete the active WU by backing up to the last checkpoint and restarting from that point. If it then reaches a successful completiong, it's not counted as an error unless you happen to see it in the un-parsed log and ask what happened.

(I'm making up arbitrary numbers which may not reflect reality): The project owner can readily see that (say 97%) of the WUs were completed on their first try and half of the remaining ones were completed on the second try. and only 0.5% were not completed on the third or fourth try. If the completion of 99.5% of the trajectories plus some partially completed trajectories provides enough data to support the conclusions in the scientific paper, nobody really worries about those other 0.5% except if they're captured during beta testing.

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 4:15 am
by Scarlet-Tech
bruce wrote:The job of the Stanford servers is to maximize the science that gets completed. To do that, when there's an error, it's retried. not analyzed.

I did use Beta previously... I turned it off because FAH v7.4.4 would completely hardlocks the system and completely halt production. After that, I just receive errors, but kit hardlocks so it was a progressive step in the correct direction.

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 4:17 am
by Kebast
I've noticed and used the Show Errors button (whatever the text exactly), but for any number of reasons, much of the log file isn't loaded in the program. Unless I just can't scroll up but the find errors can. Also, additional log files are created and I don't think they are shown in the control program.

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 4:39 am
by bruce
Only part of the active log is shown in FAHControl. To see earlier portions, press the "Refresh" button before or after checking the "Warnings and Errors" filter. You can also filter by Slot or by WU number. (I always use slot.)

A new log is started whenever you restart FAH. A number of older logs are retained in FAH's data directory but you'll have to parse them visually.

Re: Failing units, low ppd, and returned units.

Posted: Fri Nov 13, 2015 11:59 pm
by Ricky
Bruce,

Thanks for pointing this out! I had no idea that the refresh would be so handy. I would get up in the middle of the night to check the logs. Now I don't have to.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 1:47 am
by Scarlet-Tech
bruce wrote:Only part of the active log is shown in FAHControl. To see earlier portions, press the "Refresh" button before or after checking the "Warnings and Errors" filter. You can also filter by Slot or by WU number. (I always use slot.)

A new log is started whenever you restart FAH. A number of older logs are retained in FAH's data directory but you'll have to parse them visually.
I don't know what happened today... My production has been incredible, and I am up to 24 completed units so far.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 3:14 pm
by Scarlet-Tech
So, what happened at Stanford over the last 36 hours?

My system definitely didn't change? It is on the east coast and I am in Arizona, so it just sits running, but over the last 24 hours alone, my system has pumped it 45 work units and 2.1 million points.

Since it is impossible that my untouched system changed, what changed at stanford to finally start seeing these results after 9 days of horrible work unit's and production?

I don't know what change, but rest assured, I am happy something finally happened.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 3:48 pm
by Rel25917
It's possible a server didn't report points earned to the stat server for a while, once it finally reports you get a big spike. Not saying that's what happened just a possibility.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 3:52 pm
by Scarlet-Tech
Rel25917 wrote:It's possible a server didn't report points earned to the stat server for a while, once it finally reports you get a big spike. Not saying that's what happened just a possibility.

Totally possible, but usually when points are reported, they don't get reported at all.

I haven't seen where I get the low report but not the total amount. And since it is a consistent 4-6 work units completed every 3 hours, it isn't a dump, it is actually being consistent.

Hopefully this link doesn't get blocked.

http://folding.extremeoverclocking.com/ ... =&u=654307

This shows that I am completing work units consistently, where they were very inconsistent last week. That is the only reason it makes it seem like there has been something fixed from Stanford.

Point dumps all come at one time where the server is like "oh no, I didn't report for 48 hours straight, here are your completed units and points" lol. They aren't slow and steady like this.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 5:55 pm
by Rel25917
Bad server got turned off or fixed and now your just getting a great mix of units your cards like maybe? Would be interesting to know if Stanford did anything. Not sure what the best units for a 980 is but my titan(first version) goes insane on P9704. With you away from the computer and not knowing what it is getting now and what it was getting before who knows whats up.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:01 pm
by wilding2004
Actually I've noticed a few times recently where the stats at Extremeoverclocking seem to be running a bit "behind". I.e. where the Stanford stats show x number of WU's completed in the previous x hours - but those stats are not picked up within the 3 hour update at Extremeoverclocking.

Nothing to worry about, it all evens out in the end.

As an aside - I've kept an eye on this thread since it was posted, and your notion that Stanford like to place "blame" on Nvidia for things not working, is simply not true. Now that you have posted here, it might be a good idea to stick around and get a feel for it. Perhaps report back to your team forum that this is a place of honesty and fact - not fanboyism or blame.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:03 pm
by bruce
Scarlet-Tech wrote:Totally possible, but usually when points are reported, they don't get reported at all.
False.

I have responded to a number of reeports of "I didn't get points for WUs X, Y and Z. Here's my log" Often I respond server XXX.XXX.XXX.XXX is down. We need to recheck when it comes back up --- followed by "here are the reports showing points for X, Y, and Z."

It's also common to find a server that was down for a while but points were colledted. When the server comes back on-line, points for NEW WUs start getting credited but the "lost" points have to be recomputed and added manually. If the project owner is busy, that delayed credit can take a week or two.

I'm not saying that points are never lost, but it's really pretty rare, compared to them being delayed. I'll bet your system is NOT my system pumping out "45 work units and 2.1 million points" now, but it's pumping out half that and the other half are delayed credits.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:10 pm
by Scarlet-Tech
Rel25917 wrote:Bad server got turned off or fixed and now your just getting a great mix of units your cards like maybe? Would be interesting to know if Stanford did anything. Not sure what the best units for a 980 is but my titan(first version) goes insane on P9704. With you away from the computer and not knowing what it is getting now and what it was getting before who knows whats up.
I will have logs for days when I get home next Saturday. I will be looking up the errors and will try to sort through and post it to a clean Google doc so we can see what happened.

bruce wrote:
Scarlet-Tech wrote:Totally possible, but usually when points are reported, they don't get reported at all.
False.

I have responded to a number of reeports of "I didn't get points for WUs X, Y and Z. Here's my log" Often I respond server XXX.XXX.XXX.XXX is down. We need to recheck when it comes back up --- followed by "here are the reports showing points for X, Y, and Z."

It's also common to find a server that was down for a while but points were colledted. When the server comes back on-line, points for NEW WUs start getting credited but the "lost" points have to be recomputed and added manually. If the project owner is busy, that delayed credit can take a week or two.

I'm not saying that points are never lot, but it's really pretty rare, as long as your allow for delayed credits. I'll bet your system is NOT my system pumping out "45 work units and 2.1 million points" now, but it's pumping out half that and the other half are delayed credits.

I was going off of the fact that when the servers stop reporting over the weekend, which has happened quite a few times, that I always received a dump for the points and received credit for everything that was held over the weekend. Usually, these happened Friday night, then Saturday and Sunday show zero points and then around 9am-12pm on Monday there is a mass influx of points from the weekend and all competed units.

Is there a FAQ page that shows these portions where server will slow feed out units that were completed but not credited. I did not realize the system could function that way, so it would be good to be able to point people to a fax in case this occurs again. I apologize if it is right under my nose, searching with this phone is less than fun.

Re: Failing units, low ppd, and returned units.

Posted: Sat Nov 14, 2015 6:36 pm
by bruce
Scarlet-Tech wrote:Is there a FAQ page that shows these portions where server will slow feed out units that were completed but not credited. I did not realize the system could function that way, so it would be good to be able to point people to a fax in case this occurs again. I apologize if it is right under my nose, searching with this phone is less than fun.
If you're asking for a list of credits that are not yet on-line, then no, it doesn't exist. The data is somewhere in a server log which needs to be parsed and data extracted.

If you're asking for the status of the servers, learn to use http://fah-web.stanford.edu/pybeta/serverstat.html