Page 1 of 2

Stats backlog

Posted: Sat Jan 11, 2014 11:40 am
by billford
A few thoughts.

From past experience (and paying a bit more attention to serverstats this morning) I get the following impressions (which I accept may be mistaken):

1) the stats system works on a "last in, first out" basis, ie recently uploaded WUs get processed before any backlog is addressed.

2) it stops processing when some limit is reached (time, space, WU count, I have no idea)

3) in the "contest" between clients adding uploaded WUs to the queue and the stats system removing them, the stats system isn't winning.

If I'm right, the parameter in 2) needs to be increased… comments?

Re: Stats backlog

Posted: Sat Jan 11, 2014 1:31 pm
by PantherX
1) Seems plausible.

2) When the Stats System is out of sync with the WS/CS or manually taken offline. This happens due to network issues, maintenance work, etc. Generally the Stats system is quite robust and any issues happen infrequently.

3) Clients will always upload results to the WS. In some cases, if the WS isn't online (keeping the WS online is very important), the CS is used (not all Projects use this). The Stats system takes the data from the WS/CS and then processes it. It is usually does in a batch process but it can take more than one attempt to clear the backlog. Not sure what the reasons are to split up the backlog into more than one batch job. Every few months, a manual "sync" takes place so if any WUs which weren't added to the Stats Server, are now added.

I do realize that the start of the year has been rocky from the Server side but do keep in mind that there were some significant changes to the infrastructure which may have resulted in unexpected issues.

Re: Stats backlog

Posted: Sat Jan 11, 2014 2:00 pm
by billford
PantherX wrote:2) When the Stats System is out of sync with the WS/CS or manually taken offline. This happens due to network issues, maintenance work, etc. Generally the Stats system is quite robust and any issues happen infrequently.
Poor phrasing on my part, I meant that it stops processing the current run, even when everything is running correctly. Also see next.
PantherX wrote:3) … It is usually does in a batch process but it can take more than one attempt to clear the backlog.
That's more or less what I'm saying- whatever determines the size of the batch isn't making it big enough!
PantherX wrote:Not sure what the reasons are to split up the backlog into more than one batch job.
I don't have any problem with that- for any job that runs at fixed intervals with varying (and unpredictable) amounts of input data it makes sense to ensure that a specific run won't overlap the start time for the next. It can be coded so that it won't cause problems if it should happen, but it's easier just to make sure it won't :wink:

In this particular case it also means that there are predictable periods when the stats will be "static" (even if not entirely up to date) which is convenient for those who want to build their own stats from them, be they individual donors or third-party "aggregation" sites.
PantherX wrote:Every few months, a manual "sync" takes place so if any WUs which weren't added to the Stats Server, are now added.
Perhaps the Operating Procedure should be amended to "Every few months and after a significant outage…" ?

I think it would make sense that after an outage such as this one that the backlog is cleared in a single giant batch before the hourly update goes live again. The extra delay would, I think, be more acceptable to donors than "they'll be up to date eventually… give it a few months". Especially if they suspect that one or more uploaded WUs may (or may not) have gone missing.

(Yes, I know the mods can check if requested, but at the moment I'm 4 WUs light out of about 25, but I can't identify which ones. I don't really want to present the mods with a list of all 25 to check any more than they are likely to want to check them, and I'm fairly sure I'm not the only one who could come up with such a list…)

Re: Stats backlog

Posted: Sat Jan 11, 2014 2:31 pm
by 7im
Please post the basis for your conclusions, because the stats server has never been unable to dig out of a backlog within a day or two of all the servers coming back online.

Re: Stats backlog

Posted: Sat Jan 11, 2014 2:46 pm
by billford
Just keep an eye on the WUs Rcv total on the server stats page- if it's going down at all it's doing it very slowly.

The feeling that each run of the stats update is taking less time (ie processing fewer WUs) than before is, I admit, little more than that- I've no hard evidence. But I think I'm right.

Re: Stats backlog

Posted: Sat Jan 11, 2014 3:32 pm
by billford
7im wrote:Please post the basis for your conclusions, because the stats server has never been unable to dig out of a backlog within a day or two of all the servers coming back online.
OK, a few figures (later ones edited in)- some dug out from odd notes I've made, some exact from the stats:

Code: Select all

Time         WU Rcv

09:00Z      ~12.6K
09:30Z      >10K
12:00Z      ~14K
14:26Z      11147
15:05Z      13920
15:26Z      11262
16:05Z      14257
16:22Z      11362
16:41Z      12587
17:01Z      14035
17:22Z      11369
You tell me how many days it will take to clear the backlog… to me, looking at pre- and post-update totals, it seems as though it may even be growing.


(Sorry about the code tags, it was the only way I could find to keep the formatting)

Re: Stats backlog

Posted: Sat Jan 11, 2014 7:26 pm
by rickoic
My thoughts:

1. A batch file is scheduled to run at H+59 or possibly H+00 each and every hour.
2. Catchup batch file runs until pre-empted by the batch file in item 1 which takes priority.
3. Catchup batch file begins where it left off after batch file in item 1 finishes.

This continues until catchup batch file has no more data to process.
Files containing data on finished wu's are limited in size by some means; from 0000Z to 0059Z or by some size limitations.
The longer the system is down the longer it will take the catchup batch file to process them.

I'm sure that Pande Group has a printout of files that have not been processed, and compare that list to files as they are completed to ensure that any missed files are resubmitted.

It works, eventually everything will be processed, maybe not as fast as some would like, but eventually.

Rick

Re: Stats backlog

Posted: Sat Jan 11, 2014 8:30 pm
by PantherX
billford wrote:...I meant that it stops processing the current run, even when everything is running correctly...
After the Stats System comes back online, I have seen that either the following happens:
1) There is a massive delay where almost all backlog WUs are processed. It takes significant amount of time when compared to a normal run. The current run is effected by this.
2) The current run happens normally. However, the backlog of WUs is processed "slowly" i.e. it takes a couple of updates to clear the backlog.
billford wrote:...Perhaps the Operating Procedure should be amended to "Every few months and after a significant outage…" ?

I think it would make sense that after an outage such as this one that the backlog is cleared in a single giant batch before the hourly update goes live again. The extra delay would, I think, be more acceptable to donors than "they'll be up to date eventually… give it a few months". Especially if they suspect that one or more uploaded WUs may (or may not) have gone missing...
Since the manual update requires more resources than an automatic one, if the researcher is very busy, it is unlikely. However, if the researcher has time, it could be done. Moreover, bruce has informed PG about a possible backlog (viewtopic.php?p=256057#p256057) so you just might see your points.

Re: Stats backlog

Posted: Sat Jan 11, 2014 10:20 pm
by billford
PantherX wrote:However, the backlog of WUs is processed "slowly" i.e. it takes a couple of updates to clear the backlog.
I don't dispute the accuracy of anything I'm being told, but this morning at 09:00Z the server stats gave the total of unprocessed WUs as ~12,600, currently (22:00Z) it's 14,622.

Shouldn't it be going down, not up?


edit- I'm assuming that "received" implies "not yet processed into stats". If I'm wrong then I apologise for the noise I've created and I'll shut up.

Re: Stats backlog

Posted: Sat Jan 11, 2014 10:31 pm
by bollix47
billford wrote:Shouldn't it be going down, not up?
Not necessarily so ... each time a WU is returned a new one is created. So that alone will keep the count fairly constant. Also, there are new projects added which would in fact increase the number of WUs.

Re: Stats backlog

Posted: Sat Jan 11, 2014 10:33 pm
by billford
bollix47 wrote:Also, there are new projects added which would in fact increase the number of WUs.
The number of received WUs?

Re: Stats backlog

Posted: Sat Jan 11, 2014 10:41 pm
by bollix47
Sorry, I posted before I saw your edit.

The number of received WUs is only since the last stats update(i.e. normally the number of WUs returned since shortly after the top of the hour). It could be more or less than the last time the stats were run. Not all servers actually report those figures so the validity of the total is a bit questionable. Although it is correct for the figures shown, detail for some servers is missing so the Totals figure for that column does not represent a true total of all the work units received ... just for what is showing. For example, if you look at 171.64.65.69 you won't see a figure in that column but we know there are a lot of core_17 work units returned every hour to that server. On a server where this feature is working you'll see the number increase throughout the hour following the last stats update until a new stats update ... then the count starts over. Here is an example from the log of 171.64.65.124:

Code: Select all

Sat Jan 11 15:10:11 PST 2014	171.64.65.124	vspg14e	sryckbos	SMP	full	Accepting	4.18	42	4		50541	42976	0	5895	5895	5895	63	-	-	-	-	236	1	-	-	1		-	0	1	WL; X;	10000, 10000	6.34, 7.00	5, 5	10000, 10000	64, 64	-	1, 1	-	F, A, F, A	8080, 8080	
Sat Jan 11 15:30:11 PST 2014	171.64.65.124	vspg14e	sryckbos	SMP	full	Accepting	4.26	41	4		50541	42976	0	5932	5932	5932	67	-	-	-	-	613	1	-	-	1		-	0	1	WL; X;	10000, 10000	6.34, 7.00	5, 5	10000, 10000	64, 64	-	1, 1	-	F, A, F, A	8080, 8080	
Sat Jan 11 15:50:11 PST 2014	171.64.65.124	vspg14e	sryckbos	SMP	full	Accepting	3.36	44	9		50541	42975	1	5968	5968	5968	62	-	-	-	-	978	1	-	-	1		-	0	1	WL; X;
At 10 minutes past the hour the count is 236, at 30 minutes past the hour it is 613 and finally at 50 minutes the count is 978. If you look at the log you'll see the pattern repeating and, depending on the timing, the total could be higher or lower than the last time you looked.

Re: Stats backlog

Posted: Sat Jan 11, 2014 10:47 pm
by billford
bollix47 wrote:Not all servers actually report those figures so the validity of the total is a bit questionable.
Ah, OK… serves me right for believing what I read on the internet, I should know better by now.

I'll shut up forthwith :wink:

Re: Stats backlog

Posted: Sun Jan 12, 2014 4:46 am
by 7im
The validity is NOT in question, only that not all the factors have been considered when drawing conclusions, as I tried to say earlier.

Re: Stats backlog

Posted: Sun Jan 12, 2014 4:36 pm
by bcavnaugh
I don't think our Points or Status page has updated in almost 3 days.
From http://folding.extremeoverclocking.com/ ... =&u=638326

Code: Select all

24 264 1,439,478 1,298,147 246,666 246,666 186,306,868 9,187 04.27.13 
Form http://fah-web2.stanford.edu/cgi-bin/ma ... num=111065

Code: Select all

25   bcavnaugh    182730886    9092  
This looks correct so maybe it is only the Team page above.
From http://fah-web.stanford.edu/cgi-bin/mai ... =bcavnaugh

Code: Select all

Date of last work unit  2014-01-12 08:07:09   
Total score  186342277  
Overall rank (if points are combined)  243 of 1718776  
Active clients (within 50 days)  85  
Active clients (within 7 days)  24  
Looks like it is fixed now.