Page 3 of 3

Re: 20210206 Missing Work?

Posted: Tue Feb 09, 2021 6:21 pm
by Neil-B
Welcome to the forums :)

Some catchup may be happening .. however with the way things have been lately it might be a while before it all properly stabilises so be ready for a bit more lumpiness in the stats graphs .. I may be wrong but worth being prepared ;)

Re: 20210206 Missing Work?

Posted: Tue Feb 09, 2021 9:17 pm
by bollix47
Please check your points ... there was a rather large update on the 3PM update at EOC :D

Re: 20210206 Missing Work?

Posted: Tue Feb 09, 2021 11:20 pm
by cine.chris
17M points arrived.
Currently running at 14MPPD 14/8~1.75M is current production
============================================
I guess the only way to effectively monitor this is to finish building my own client logger & catch discrepancies earlier.
Critical test is to validate that WU count matches.
I've tested all the pieces to do that and already processing the FAH daily actives so getting my daily point/wu credit is already being done.
280232475314602 cine.chris 6744169 63 2694858926 17452 257944

Re: 20210206 Missing Work?

Posted: Wed Feb 10, 2021 5:11 am
by UofM.MartinK
Hope this post helps visualize which WS/CS process their WUs, and which don't/didn't, and since when.

For that, I added api-wu-crawling to my log analyzer and added a new type of output table:

Code: Select all

  WS/CS-IP       |  #acct'd  #newest-accounted-PRCG                  DATE   TIME   |  #pending  #oldest-pending-PRCG                   DATE   TIME
128.174.73.74    |  45       Project:14916,Run:0,Clone:620,Gen:92    02/09  06:23  |  -         -                                      -      -
128.252.203.10   |  6        Project:14521,Run:0,Clone:396,Gen:256   02/07  20:50  |  -         -                                      -      -
128.252.203.11   |  1        Project:17219,Run:493,Clone:1,Gen:0     02/06  04:46  |  -         -                                      -      -
128.252.203.9    |  94       Project:17428,Run:0,Clone:2516,Gen:229  02/09  22:53  |  -         -                                      -      -
129.32.209.201   |  34       Project:16927,Run:22,Clone:1482,Gen:12  01/31  08:38  |  130       Project:16927,Run:17,Clone:591,Gen:45  01/18  15:07
129.32.209.203   |  117      Project:16933,Run:22,Clone:98,Gen:287   02/09  23:40  |  -         -                                      -      -
140.163.4.200    |  28       Project:17328,Run:3,Clone:229,Gen:14    02/09  23:41  |  -         -                                      -      -
140.163.4.210    |  32       Project:17800,Run:68,Clone:42,Gen:44    02/09  21:46  |  -         -                                      -      -
155.247.166.219  |  -        -                                       -      -      |  14        Project:14188,Run:7,Clone:1469,Gen:1   01/19  09:39
18.188.125.154   |  30       Project:13445,Run:6576,Clone:36,Gen:1   02/09  16:03  |  -         -                                      -      -
206.223.170.146  |  44       Project:17425,Run:0,Clone:1266,Gen:132  02/09  22:40  |  -         -                                      -      -
66.170.111.50    |  77       Project:17415,Run:0,Clone:475,Gen:524   02/09  20:25  |
For above table, my script analyzed all log lines between January 18th and February 9th (UTC) from 8 of my folding systems for successfully submitted WUs, and the PRCG for each WU was queried from https://apps.foldingathome.org/wu to check if my username was listed for that PRCG.

If it was, that WU is counted as "accounted for", and the number of accounted-for WUs per WS/CS is in column2 ("#acct'd"), and for reference, the PRCG and date/time of the newest WU accounted-for by that WS/CS is also shown.

If it wasn't, that WU is counted as "pending", and the number of pending WUs per WS/CS is in column6 ("#pending"), and for reference, the PRCG and date/time of the oldest "pending" WU by that WS/CS is also shown.

It can be seen that two CS/WS are still backlogged, even after the latest "drop" 8 hours ago:
129.32.209.201 (vav18.fah.temple.edu | voelz)
155.247.166.219 (vav3.ocis.temple.edu | voelz)


155.247.166.219 just didn't stat-account any WU for me (and presumably anybody else) since at least January 19, which is the "normal" scenario for backlogged stats and will most likely rectify itself once that machine's backlog is taken care of.

129.32.209.201 tells a different, more interesting story: It did account for a seemingly-random subset if "my" WUs between at least January 18th and January 31st, to then stop accounted any of "my" WUs since January 31st, 08:38am. Is there a good hypothesis of what can cause a behavior were just a seemingly-random (in time) subset of WUs is stats-processed? (all submitted to the same CS/WS IP!)

Re: 20210206 Missing Work?

Posted: Thu Feb 11, 2021 4:34 am
by cine.chris
UofM.MartinK wrote:Hope this post helps visualize which WS/CS process their WUs, and which don't/didn't, and since when.

It can be seen that two CS/WS are still backlogged, even after the latest "drop" 8 hours ago:
129.32.209.201 (vav18.fah.temple.edu | voelz)
155.247.166.219 (vav3.ocis.temple.edu | voelz)


155.247.166.219 just didn't stat-account any WU for me (and presumably anybody else) since at least January 19, which is the "normal" scenario for backlogged stats and will most likely rectify itself once that machine's backlog is taken care of.
Thanks Martin, for sharing this.
Interesting data.
I think it's important to see that everyone is fully credited for their work.
The other nemesis, waiting for work, has returned too, but not a serious problem. Hopefully, there are sufficient quality projects to keep the current work capacity entertained with the opportunity to help sustain the science.
That said, I see two GPUs waiting for work...

Re: 20210206 Missing Work?

Posted: Thu Feb 11, 2021 9:18 am
by ajm
I just looked at a one log (one machine with 2 GPUs), starting about two hours before now and moving backward in time.
The servers 140.163.4.200 and 140.163.4.210 still don't seem to connect to the stats. 140.163.4.200 also gives a lot of errors when assigning work.

Not found:
project:17324 run:0 clone:2277 gen:12 - 140.163.4.200
project:17800 run:4 clone:31 gen:37 - 140.163.4.210
project:17800 run:0 clone:87 gen:42 - 140.163.4.210
project:17800 run:53 clone:57 gen:24 - 140.163.4.210
project:17800 run:60 clone:45 gen:29 - 140.163.4.210
project:17800 run:14 clone:83 gen:11 - 140.163.4.210
project:17800 run:0 clone:19 gen:23 - 140.163.4.210
project:17800 run:61 clone:128 gen:7 - 140.163.4.210
project:17800 run:3 clone:116 gen:7 - 140.163.4.210
project:17800 run:46 clone:114 gen:5 - 140.163.4.210

OK:
project:17432 run:0 clone:1885 gen:90 - 206.223.170.146
project:17432 run:0 clone:171 gen:119 - 206.223.170.146
project:17432 run:0 clone:331 gen:109 - 206.223.170.146
project:17431 run:0 clone:1696 gen:102 - 206.223.170.146

Re: 20210206 Missing Work?

Posted: Thu Feb 11, 2021 2:14 pm
by DOMiNiON79
same here, folding all day on Project 17800 and no stats are reported, the "upload-log" is ok, f.i.:

Code: Select all

14:00:44:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
14:00:44:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17800 run:50 clone:34 gen:36 core:0x22 unit:0x00000022000000240000458800000032
14:00:44:WU00:FS01:Uploading 3.30MiB to 140.163.4.210
14:00:44:WU00:FS01:Connecting to 140.163.4.210:8080
14:00:45:WU01:FS01:Connecting to assign1.foldingathome.org:80
14:00:45:WU01:FS01:Assigned to work server 140.163.4.210
14:00:45:WU01:FS01:Requesting new work unit for slot 01: gpu:1:0 GP106 [GeForce GTX 1060 6GB] 4372 from 140.163.4.210
14:00:45:WU01:FS01:Connecting to 140.163.4.210:8080
14:00:46:ERROR:WU01:FS01:Exception: Server did not assign work unit
14:01:05:WU00:FS01:Upload complete
14:01:05:WU00:FS01:Server responded WORK_ACK (400)
14:01:05:WU00:FS01:Final credit estimate, 58509.00 points
14:01:05:WU00:FS01:Cleaning up

Re: 20210206 Missing Work?

Posted: Thu Feb 11, 2021 2:50 pm
by UofM.MartinK
Same here, 140.163.4.200 and 140.163.4.210 seem to have new trouble.

Here the last WUs credited and the first pending (not-yet-credited):

Code: Select all

WS/CS-IP         |   #newest-accounted-PRCG                  DATE   TIME   |  #pending  #oldest-pending-PRCG                    DATE   TIME
140.163.4.200    |   Project:17315,Run:0,Clone:2448,Gen:11   02/10  22:25  |  4         Project:17313,Run:0,Clone:1786,Gen:226  02/10  00:26
140.163.4.210    |   Project:17800,Run:36,Clone:117,Gen:3    02/10  23:29  |  8         Project:17800,Run:13,Clone:15,Gen:36    02/11  00:31
And while we're at it, here a different visualization of the back-logged WUs for the currently 4 WS/CS with troubles, starting January 10th:

Code: Select all

WS/CS-IP                    |- January 10th                                                              Now(February 11th)-|
140.163.4.200    #credited  .  .  1  1  .  .  .  .  1  .  1  .  .  .  .  .  .  .  .  .  .  .  .  .  .  3  1  9  .  7  6  +  .
140.163.4.200    #pending   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  1  3
---
140.163.4.210    #credited  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  2  4  7  +  9  3  .
140.163.4.210    #pending   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  8

129.32.209.201   #credited  3  2  .  .  .  .  .  4  2  5  5  7  7  4  1  .  .  .  .  .  .  3  .  .  .  .  .  .  .  .  .  .  .
129.32.209.201   #pending   2  3  1  6  .  4  3  1  1  .  .  .  .  1  3  .  4  8  +  5  6  8  7  3  +  9  8  +  +  7  +  +  4
---
155.247.166.219  #credited  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
155.247.166.219  #pending   .  .  .  .  .  .  .  .  .  3  1  .  1  1  .  .  .  .  .  1  .  .  .  1  .  .  1  .  3  .  2  .  1
Each single-digit number represent the # of WUs credited (top line) or still pending (bottom line) for a single day ("+" represents >10 WUs)

Re: 20210206 Missing Work?

Posted: Thu Feb 11, 2021 7:54 pm
by UofM.MartinK
140.163.4.200 and 140.163.4.210 are OK again, no backlog anymore, nice little spike in the stats.

That makes 129.32.209.201 and 155.247.166.219 the only WS/CS with backlogged stats for now.

Re: 20210206 Missing Work?

Posted: Sat Mar 06, 2021 1:02 am
by comixgoddess
UofM.MartinK wrote:That makes 129.32.209.201 and 155.247.166.219 the only WS/CS with backlogged stats for now.
Same here; those are the only two servers that I have a backlog of stats from.