Page 1 of 3

Problem with stats. [Lots of 0 point WUs]

Posted: Sat Oct 06, 2012 6:09 pm
by Tobit
Well, looks like we have another server issue already this weekend. The stats server appears to be spitting out garbage.

Over the last four updates, multiple teams have noted abnormally high numbers of completed WUs. This is not limited to one team. For example:

EVGA usually completes 1000 to 1200 work units per update. The last four updates show numbers of: 2,871 - 8,406 - 11,745 - 11,184

[H]ardOCP usually completes 400 to 600 units per update. The last four updates show numbers of: 2,198 - 4,604 - 5,246 - 5,117

Both teams have already completed more WUs today than they usually do in one day. However, the points are NORMAL and not affected by this abnormality. A quick look at other teams show this anomaly as well.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 6:28 pm
by P5-133XL
When stuff like this has happened before it was from re-credits. I have no idea about this specific hiccup though.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 6:32 pm
by Tobit
P5-133XL wrote:When stuff like this has happened before it was from re-credits. I have no idea about this specific hiccup though.
Yeah, I was thinking that but the points appear normal.. there is no apparent spike in points per update. Very odd.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 6:39 pm
by Joe_H
I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently? Perhaps just a hiccup, or more information may come out later.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 6:45 pm
by Tobit
Joe_H wrote:I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently?
There would have to be a lot of small units suddenly assigned to see the spikes shown. It is more noticeable on larger teams but I've seen a few smaller teams with this anomaly as well. There are some SMP units that complete very quickly but, in the 5 years I've been with my team, I don't think I've ever seen a spike like this.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 6:50 pm
by bruce
Thanks for the report. The Pande Group is aware of the information and will be looking into it.
Tobit wrote:
Joe_H wrote:I am not seeing this in all teams, are they getting a large number of small WU's instead of fewer large ones recently?
There would have to be a lot of small units suddenly assigned to see the spikes shown. It is more noticeable on larger teams but I've seen a few smaller teams with this anomaly as well. There are some SMP units that complete very quickly but, in the 5 years I've been with my team, I don't think I've ever seen a spike like this.
If you're one of the people who personally experienced this spike, what type of projects were involved (i.e.- what are your client's preference settings?)

Re: Problem with stats server

Posted: Sat Oct 06, 2012 7:07 pm
by Tobit
bruce wrote:If you're one of the people who personally experienced this spike, what type of projects were involved (i.e.- what are your client's preference settings?)
One of our users, Schro, says he's been getting assigned quite a few p8001 SMP units that fold really fast on his hardware. He's completed 21 units over the past four updates.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 7:13 pm
by Tobit
Ahh, I have found a better example..

User kranium- I am trying to look at his logs.

235 WUs for 0 points
239 for 938 points
239 for 7889 points
176 for 26,926 points

Re: Problem with stats server

Posted: Sat Oct 06, 2012 7:20 pm
by P5-133XL
That is definitely not happening to me, but mine are all GPU WU's: p8018's. What Kranium is seeing is quite significant.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 7:25 pm
by bruce
Tobit wrote:Ahh, I have found a better example..
The reason I said if YOU experienced the problem, please report. Please search your logs and explain whatever you can learn about the 235 WUs for 0 points. By itself, that information doesn't add anything that wasn't already known -- except perhaps that 'kranium' might have unstable hardware.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 8:16 pm
by Tobit
Sorry Bruce. Just trying to help users on my team who can't/won't come here. I am trying to get my hands on Kranium's logs, he just posted another 236 units for 0 points. Trying to wake him up to also turn off his box.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 8:50 pm
by Tobit
Looks like the problem is related to the issues with 128.143.199.96. Here is the log file from a fellow user, KMac, whose stats are all messed up as well:

Code: Select all

Arguments: -oneunit -forceasm -smp -bigadv -verbosity 9 

[02:58:57] - Ask before connecting: No
[02:58:57] - User name: KMac (Team 33)
[02:58:57] - User ID: 4XXXXXXXXXXX
[02:58:57] - Machine ID: 2
[02:58:57] 
[02:58:57] Work directory not found. Creating...
[02:58:57] Could not open work queue, generating new queue...
[02:58:57] - Preparing to get new work unit...
[02:58:57] - Autosending finished units... [October 6 02:58:57 UTC]
[02:58:57] Cleaning up work directory
[02:58:57] Trying to send all finished work units
[02:58:57] + No unsent completed units remaining.
[02:58:57] - Autosend completed
[02:58:57] + Attempting to get work packet
[02:58:57] Passkey found
[02:58:57] - Will indicate memory of 8183 MB
[02:58:57] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 12, Stepping: 2
[02:58:57] - Connecting to assignment server
[02:58:57] Connecting to http://assign.stanford.edu:8080/
[02:58:57] Posted data.
[02:58:57] Initial: 8F80; - Successful: assigned to (128.143.199.96).
[02:58:57] + News From Folding@Home: Welcome to Folding@Home
[02:58:58] Loaded queue successfully.
[02:58:58] Sent data
[02:58:58] Connecting to http://128.143.199.96:8080/
[02:58:59] Posted data.
[02:58:59] Initial: 0000; - Receiving payload (expected size: 1763751)
[02:59:03] - Downloaded at ~430 kB/s
[02:59:03] - Averaged speed for that direction ~430 kB/s
[02:59:03] + Received work.
[02:59:03] + Closed connections
[02:59:03] 
[02:59:03] + Processing work unit
[02:59:03] Core required: FahCore_a3.exe
[02:59:03] Core found.
[02:59:03] Working on queue slot 01 [October 6 02:59:03 UTC]
[02:59:03] + Working ...
[02:59:03] - Calling '.\FahCore_a3.exe -dir work/ -nice 19 -suffix 01 -np 12 -nocpulock -checkpoint 3 -forceasm -verbose -lifeline 5016 -version 634'

[02:59:03] 
[02:59:03] *------------------------------*
[02:59:03] Folding@Home Gromacs SMP Core
[02:59:03] Version 2.27 (Dec. 15, 2010)
[02:59:03] 
[02:59:03] Preparing to commence simulation
[02:59:03] - Assembly optimizations manually forced on.
[02:59:03] - Not checking prior termination.
[02:59:04] - Expanded 1763239 -> 1944296 (decompressed 110.2 percent)
[02:59:04] Called DecompressByteArray: compressed_data_size=1763239 data_size=1944296, decompressed_data_size=1944296 diff=0
[02:59:04] - Digital signature verified
[02:59:04] 
[02:59:04] Project: 6947 (Run 0, Clone 143, Gen 189)
[02:59:04] 
[02:59:04] Assembly optimizations on if available.
[02:59:04] Entering M.D.
[02:59:10] Mapping NT from 12 to 12 
[02:59:10] Completed 0 out of 500000 steps  (0%)
[03:01:24] Completed 5000 out of 500000 steps  (1%)
--snip--
[06:42:53] Completed 500000 out of 500000 steps  (100%)
[06:42:54] DynamicWrapper: Finished Work Unit: sleep=10000
[06:43:04] 
[06:43:04] Finished Work Unit:
[06:43:04] - Reading up to 3694992 from "work/wudata_01.trr": Read 3694992
[06:43:04] trr file hash check passed.
[06:43:04] edr file hash check passed.
[06:43:04] logfile size: 61055
[06:43:04] Leaving Run
[06:43:07] - Writing 3790983 bytes of core data to disk...
[06:43:07] Done: 3790471 -> 3516790 (compressed to 92.7 percent)
[06:43:07]   ... Done.
[06:43:07] - Shutting down core
[06:43:07] 
[06:43:07] Folding@home Core Shutdown: FINISHED_UNIT
[06:43:11] CoreStatus = 64 (100)
[06:43:11] Unit 1 finished with 96 percent of time to deadline remaining.
[06:43:11] Updated performance fraction: 0.961088
[06:43:11] Sending work to server
[06:43:11] Project: 6947 (Run 0, Clone 143, Gen 189)


[06:43:11] + Attempting to send results [October 6 06:43:11 UTC]
[06:43:11] - Reading file work/wuresults_01.dat from core
[06:43:11]   (Read 3517302 bytes from disk)
[06:43:11] Connecting to http://128.143.199.96:8080/
[06:43:12] - Couldn't send HTTP request to server
[06:43:12] + Could not connect to Work Server (results)
[06:43:12]     (128.143.199.96:8080)
[06:43:12] + Retrying using alternative port
[06:43:12] Connecting to http://128.143.199.96:80/
[06:43:13] - Couldn't send HTTP request to server
[06:43:13] + Could not connect to Work Server (results)
[06:43:13]     (128.143.199.96:80)
[06:43:13] - Error: Could not transmit unit 01 (completed October 6) to work server.
[06:43:13] - 1 failed uploads of this unit.
[06:43:13]   Keeping unit 01 in queue.

[19:11:38] + Attempting to send results [October 6 19:11:38 UTC]
[19:11:38] - Reading file work/wuresults_01.dat from core
[19:11:38]   (Read 3517302 bytes from disk)
[19:11:38] Connecting to http://130.237.165.141:8080/
[19:12:07] Posted data.
[19:32:07] Initial: 0000; + Could not connect to Work Server (results)
[19:52:07]     (130.237.165.141:8080)
[19:52:07] + Retrying using alternative port
[19:52:07] Connecting to http://130.237.165.141:80/
[19:52:45] Posted data.
[19:52:45] Initial: 0000; - Server does not have record of this unit. Will try again later.
[19:52:45]   Could not transmit unit 01 to Collection server; keeping in queue.
[19:52:45] + Sent 0 of 1 completed units to the server
[19:53:15] Trying to send all finished work units
[19:53:15] Project: 6947 (Run 0, Clone 143, Gen 189)

[19:53:15] + Attempting to send results [October 6 19:53:15 UTC]
[19:53:15] - Reading file work/wuresults_01.dat from core
[19:53:15]   (Read 3517302 bytes from disk)
[19:53:15] Connecting to http://128.143.199.96:8080/
[19:53:16] - Couldn't send HTTP request to server
[19:53:16] + Could not connect to Work Server (results)
[19:53:16]     (128.143.199.96:8080)
[19:53:16] + Retrying using alternative port
[19:53:16] Connecting to http://128.143.199.96:80/
[19:53:18] - Couldn't send HTTP request to server
[19:53:18] + Could not connect to Work Server (results)
[19:53:18]     (128.143.199.96:80)
[19:53:18] - Error: Could not transmit unit 01 (completed October 6) to work server.
[19:53:18] - 485 failed uploads of this unit.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 8:53 pm
by orion
I haven't had a spike in WU's but I did get one WU for no points on my 3am update.

The only WU that I would have finished for that time period is p6958 (0,239,102) It has tried to upload 9 times but hasn't yet.

Here's the last upload attempt from about 1 hour ago as of this post.

Code: Select all

[19:57:05] + Attempting to send results [October 6 19:57:05 UTC]
[19:57:05] - Reading file work/wuresults_05.dat from core
[19:57:05]   (Read 3522087 bytes from disk)
[19:57:05] Connecting to http://128.143.199.96:8080/
[19:57:05] - Couldn't send HTTP request to server
[19:57:05] + Could not connect to Work Server (results)
[19:57:05]     (128.143.199.96:8080)
[19:57:05] + Retrying using alternative port
[19:57:05] Connecting to http://128.143.199.96:80/
[19:57:05] - Couldn't send HTTP request to server
[19:57:05] + Could not connect to Work Server (results)
[19:57:05]     (128.143.199.96:80)
[19:57:05] - Error: Could not transmit unit 05 (completed October 6) to work server.
[19:57:05] - 9 failed uploads of this unit.


[19:57:05] + Attempting to send results [October 6 19:57:05 UTC]
[19:57:05] - Reading file work/wuresults_05.dat from core
[19:57:05]   (Read 3522087 bytes from disk)
[19:57:05] Connecting to http://130.237.165.141:8080/
[19:58:33] Posted data.
[19:58:33] Initial: 0000; - Uploaded at ~39 kB/s
[19:58:33] - Averaged speed for that direction ~38 kB/s
[19:58:33] - Server does not have record of this unit. Will try again later.
[19:58:33]   Could not transmit unit 05 to Collection server; keeping in queue.

Re: Problem with stats server

Posted: Sat Oct 06, 2012 8:58 pm
by Tobit
In the log file I posted, I suspect what is happening is:

Client tries to upload to 128.143.199.96
Server has been in and out of reject state for awhile
Client makes a long enough connection that the WS thinks it receives a WU and records 0 credit or partial credit
Client ultimately timesout transmitting results
Keeps in queue to try again
Rinse - Repeat

Re: Problem with stats server

Posted: Sat Oct 06, 2012 9:00 pm
by bruce
The problem with 128.143.199.96 in REJECT and the follow-on problem of the Collection Server not having a record of the WU is being discussed here. Those units are NOT being credited, so I'm confident that it's an entirely different problem. Getting 0 points and counting one more WU means that data IS getting to the stats server, even if it's zero points.

Project: 6947 (Run 0, Clone 143, Gen 189) has not been credited, which is another reason to believe they're separate problems.

I've changed the title. It looks like the stats SERVER is properly crediting every stats record that's being generated by the Work Servers.