Page 1 of 2
Stats backlog
Posted: Sat Jan 11, 2014 11:40 am
by billford
A few thoughts.
From past experience (and paying a bit more attention to serverstats this morning) I get the following impressions (which I accept may be mistaken):
1) the stats system works on a "last in, first out" basis, ie recently uploaded WUs get processed before any backlog is addressed.
2) it stops processing when some limit is reached (time, space, WU count, I have no idea)
3) in the "contest" between clients adding uploaded WUs to the queue and the stats system removing them, the stats system isn't winning.
If I'm right, the parameter in 2) needs to be increased… comments?
Re: Stats backlog
Posted: Sat Jan 11, 2014 1:31 pm
by PantherX
1) Seems plausible.
2) When the Stats System is out of sync with the WS/CS or manually taken offline. This happens due to network issues, maintenance work, etc. Generally the Stats system is quite robust and any issues happen infrequently.
3) Clients will always upload results to the WS. In some cases, if the WS isn't online (keeping the WS online is very important), the CS is used (not all Projects use this). The Stats system takes the data from the WS/CS and then processes it. It is usually does in a batch process but it can take more than one attempt to clear the backlog. Not sure what the reasons are to split up the backlog into more than one batch job. Every few months, a manual "sync" takes place so if any WUs which weren't added to the Stats Server, are now added.
I do realize that the start of the year has been rocky from the Server side but do keep in mind that there were some significant changes to the infrastructure which may have resulted in unexpected issues.
Re: Stats backlog
Posted: Sat Jan 11, 2014 2:00 pm
by billford
PantherX wrote:2) When the Stats System is out of sync with the WS/CS or manually taken offline. This happens due to network issues, maintenance work, etc. Generally the Stats system is quite robust and any issues happen infrequently.
Poor phrasing on my part, I meant that it stops processing
the current run, even when everything is running correctly. Also see next.
PantherX wrote:3) … It is usually does in a batch process but it can take more than one attempt to clear the backlog.
That's more or less what I'm saying- whatever determines the size of the batch isn't making it big enough!
PantherX wrote:Not sure what the reasons are to split up the backlog into more than one batch job.
I don't have any problem with that- for any job that runs at fixed intervals with varying (and unpredictable) amounts of input data it makes sense to ensure that a specific run won't overlap the start time for the next. It can be coded so that it won't cause problems if it should happen, but it's easier just to make sure it won't
In this particular case it also means that there are predictable periods when the stats will be "static" (even if not entirely up to date) which is convenient for those who want to build their own stats from them, be they individual donors or third-party "aggregation" sites.
PantherX wrote:Every few months, a manual "sync" takes place so if any WUs which weren't added to the Stats Server, are now added.
Perhaps the Operating Procedure should be amended to "Every few months
and after a significant outage…" ?
I think it would make sense that after an outage such as this one that the backlog is cleared in a single giant batch
before the hourly update goes live again. The extra delay would, I think, be more acceptable to donors than "they'll be up to date eventually… give it a few months". Especially if they suspect that one or more uploaded WUs may (or may not) have gone missing.
(Yes, I know the mods can check if requested, but at the moment I'm 4 WUs light out of about 25, but I can't identify which ones. I don't really want to present the mods with a list of all 25 to check any more than they are likely to want to check them, and I'm fairly sure I'm not the only one who could come up with such a list…)
Re: Stats backlog
Posted: Sat Jan 11, 2014 2:31 pm
by 7im
Please post the basis for your conclusions, because the stats server has never been unable to dig out of a backlog within a day or two of all the servers coming back online.
Re: Stats backlog
Posted: Sat Jan 11, 2014 2:46 pm
by billford
Just keep an eye on the WUs Rcv total on the server stats page- if it's going down at all it's doing it very slowly.
The feeling that each run of the stats update is taking less time (ie processing fewer WUs) than before is, I admit, little more than that- I've no hard evidence. But I think I'm right.
Re: Stats backlog
Posted: Sat Jan 11, 2014 3:32 pm
by billford
7im wrote:Please post the basis for your conclusions, because the stats server has never been unable to dig out of a backlog within a day or two of all the servers coming back online.
OK, a few figures (later ones edited in)- some dug out from odd notes I've made, some exact from the stats:
Code: Select all
Time WU Rcv
09:00Z ~12.6K
09:30Z >10K
12:00Z ~14K
14:26Z 11147
15:05Z 13920
15:26Z 11262
16:05Z 14257
16:22Z 11362
16:41Z 12587
17:01Z 14035
17:22Z 11369
You tell me how many days it will take to clear the backlog… to me, looking at pre- and post-update totals, it seems as though it may even be growing.
(Sorry about the code tags, it was the only way I could find to keep the formatting)
Re: Stats backlog
Posted: Sat Jan 11, 2014 7:26 pm
by rickoic
My thoughts:
1. A batch file is scheduled to run at H+59 or possibly H+00 each and every hour.
2. Catchup batch file runs until pre-empted by the batch file in item 1 which takes priority.
3. Catchup batch file begins where it left off after batch file in item 1 finishes.
This continues until catchup batch file has no more data to process.
Files containing data on finished wu's are limited in size by some means; from 0000Z to 0059Z or by some size limitations.
The longer the system is down the longer it will take the catchup batch file to process them.
I'm sure that Pande Group has a printout of files that have not been processed, and compare that list to files as they are completed to ensure that any missed files are resubmitted.
It works, eventually everything will be processed, maybe not as fast as some would like, but eventually.
Rick
Re: Stats backlog
Posted: Sat Jan 11, 2014 8:30 pm
by PantherX
billford wrote:...I meant that it stops processing the current run, even when everything is running correctly...
After the Stats System comes back online, I have seen that either the following happens:
1) There is a massive delay where almost all backlog WUs are processed. It takes significant amount of time when compared to a normal run. The current run is effected by this.
2) The current run happens normally. However, the backlog of WUs is processed "slowly" i.e. it takes a couple of updates to clear the backlog.
billford wrote:...Perhaps the Operating Procedure should be amended to "Every few months and after a significant outage…" ?
I think it would make sense that after an outage such as this one that the backlog is cleared in a single giant batch before the hourly update goes live again. The extra delay would, I think, be more acceptable to donors than "they'll be up to date eventually… give it a few months". Especially if they suspect that one or more uploaded WUs may (or may not) have gone missing...
Since the manual update requires more resources than an automatic one, if the researcher is very busy, it is unlikely. However, if the researcher has time, it could be done. Moreover, bruce has informed PG about a possible backlog (viewtopic.php?p=256057#p256057) so you just might see your points.
Re: Stats backlog
Posted: Sat Jan 11, 2014 10:20 pm
by billford
PantherX wrote:However, the backlog of WUs is processed "slowly" i.e. it takes a couple of updates to clear the backlog.
I don't dispute the accuracy of anything I'm being told, but this morning at 09:00Z the server stats gave the total of unprocessed WUs as ~12,600, currently (22:00Z) it's 14,622.
Shouldn't it be going
down, not up?
edit- I'm assuming that "received" implies "not yet processed into stats". If I'm wrong then I apologise for the noise I've created and I'll shut up.
Re: Stats backlog
Posted: Sat Jan 11, 2014 10:31 pm
by bollix47
billford wrote:Shouldn't it be going down, not up?
Not necessarily so ... each time a WU is returned a new one is created. So that alone will keep the count fairly constant. Also, there are new projects added which would in fact increase the number of WUs.
Re: Stats backlog
Posted: Sat Jan 11, 2014 10:33 pm
by billford
bollix47 wrote:Also, there are new projects added which would in fact increase the number of WUs.
The number of
received WUs?
Re: Stats backlog
Posted: Sat Jan 11, 2014 10:41 pm
by bollix47
Sorry, I posted before I saw your edit.
The number of received WUs is only since the last stats update(i.e. normally the number of WUs returned since shortly after the top of the hour). It could be more or less than the last time the stats were run. Not all servers actually report those figures so the validity of the total is a bit questionable. Although it is correct for the figures shown, detail for some servers is missing so the Totals figure for that column does not represent a true total of all the work units received ... just for what is showing. For example, if you look at 171.64.65.69 you won't see a figure in that column but we know there are a lot of core_17 work units returned every hour to that server. On a server where this feature is working you'll see the number increase throughout the hour following the last stats update until a new stats update ... then the count starts over. Here is an example from the log of 171.64.65.124:
Code: Select all
Sat Jan 11 15:10:11 PST 2014 171.64.65.124 vspg14e sryckbos SMP full Accepting 4.18 42 4 50541 42976 0 5895 5895 5895 63 - - - - 236 1 - - 1 - 0 1 WL; X; 10000, 10000 6.34, 7.00 5, 5 10000, 10000 64, 64 - 1, 1 - F, A, F, A 8080, 8080
Sat Jan 11 15:30:11 PST 2014 171.64.65.124 vspg14e sryckbos SMP full Accepting 4.26 41 4 50541 42976 0 5932 5932 5932 67 - - - - 613 1 - - 1 - 0 1 WL; X; 10000, 10000 6.34, 7.00 5, 5 10000, 10000 64, 64 - 1, 1 - F, A, F, A 8080, 8080
Sat Jan 11 15:50:11 PST 2014 171.64.65.124 vspg14e sryckbos SMP full Accepting 3.36 44 9 50541 42975 1 5968 5968 5968 62 - - - - 978 1 - - 1 - 0 1 WL; X;
At 10 minutes past the hour the count is 236, at 30 minutes past the hour it is 613 and finally at 50 minutes the count is 978. If you look at the log you'll see the pattern repeating and, depending on the timing, the total could be higher or lower than the last time you looked.
Re: Stats backlog
Posted: Sat Jan 11, 2014 10:47 pm
by billford
bollix47 wrote:Not all servers actually report those figures so the validity of the total is a bit questionable.
Ah, OK… serves me right for believing what I read on the internet, I should know better by now.
I'll shut up forthwith

Re: Stats backlog
Posted: Sun Jan 12, 2014 4:46 am
by 7im
The validity is NOT in question, only that not all the factors have been considered when drawing conclusions, as I tried to say earlier.
Re: Stats backlog
Posted: Sun Jan 12, 2014 4:36 pm
by bcavnaugh
I don't think our Points or Status page has updated in almost 3 days.
From
http://folding.extremeoverclocking.com/ ... =&u=638326
Code: Select all
24 264 1,439,478 1,298,147 246,666 246,666 186,306,868 9,187 04.27.13
Form
http://fah-web2.stanford.edu/cgi-bin/ma ... num=111065
This looks correct so maybe it is only the Team page above.
From
http://fah-web.stanford.edu/cgi-bin/mai ... =bcavnaugh
Code: Select all
Date of last work unit 2014-01-12 08:07:09
Total score 186342277
Overall rank (if points are combined) 243 of 1718776
Active clients (within 50 days) 85
Active clients (within 7 days) 24
Looks like it is fixed now.