20210206 Missing Work?

Moderators: Site Moderators, FAHC Science Team

Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: 20210206 Missing Work?

Post by Neil-B »

Welcome to the forums :)

Some catchup may be happening .. however with the way things have been lately it might be a while before it all properly stabilises so be ready for a bit more lumpiness in the stats graphs .. I may be wrong but worth being prepared ;)
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: 20210206 Missing Work?

Post by bollix47 »

Please check your points ... there was a rather large update on the 3PM update at EOC :D
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 20210206 Missing Work?

Post by cine.chris »

17M points arrived.
Currently running at 14MPPD 14/8~1.75M is current production
============================================
I guess the only way to effectively monitor this is to finish building my own client logger & catch discrepancies earlier.
Critical test is to validate that WU count matches.
I've tested all the pieces to do that and already processing the FAH daily actives so getting my daily point/wu credit is already being done.
280232475314602 cine.chris 6744169 63 2694858926 17452 257944
Image Image
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 20210206 Missing Work?

Post by UofM.MartinK »

Hope this post helps visualize which WS/CS process their WUs, and which don't/didn't, and since when.

For that, I added api-wu-crawling to my log analyzer and added a new type of output table:

Code: Select all

  WS/CS-IP       |  #acct'd  #newest-accounted-PRCG                  DATE   TIME   |  #pending  #oldest-pending-PRCG                   DATE   TIME
128.174.73.74    |  45       Project:14916,Run:0,Clone:620,Gen:92    02/09  06:23  |  -         -                                      -      -
128.252.203.10   |  6        Project:14521,Run:0,Clone:396,Gen:256   02/07  20:50  |  -         -                                      -      -
128.252.203.11   |  1        Project:17219,Run:493,Clone:1,Gen:0     02/06  04:46  |  -         -                                      -      -
128.252.203.9    |  94       Project:17428,Run:0,Clone:2516,Gen:229  02/09  22:53  |  -         -                                      -      -
129.32.209.201   |  34       Project:16927,Run:22,Clone:1482,Gen:12  01/31  08:38  |  130       Project:16927,Run:17,Clone:591,Gen:45  01/18  15:07
129.32.209.203   |  117      Project:16933,Run:22,Clone:98,Gen:287   02/09  23:40  |  -         -                                      -      -
140.163.4.200    |  28       Project:17328,Run:3,Clone:229,Gen:14    02/09  23:41  |  -         -                                      -      -
140.163.4.210    |  32       Project:17800,Run:68,Clone:42,Gen:44    02/09  21:46  |  -         -                                      -      -
155.247.166.219  |  -        -                                       -      -      |  14        Project:14188,Run:7,Clone:1469,Gen:1   01/19  09:39
18.188.125.154   |  30       Project:13445,Run:6576,Clone:36,Gen:1   02/09  16:03  |  -         -                                      -      -
206.223.170.146  |  44       Project:17425,Run:0,Clone:1266,Gen:132  02/09  22:40  |  -         -                                      -      -
66.170.111.50    |  77       Project:17415,Run:0,Clone:475,Gen:524   02/09  20:25  |
For above table, my script analyzed all log lines between January 18th and February 9th (UTC) from 8 of my folding systems for successfully submitted WUs, and the PRCG for each WU was queried from https://apps.foldingathome.org/wu to check if my username was listed for that PRCG.

If it was, that WU is counted as "accounted for", and the number of accounted-for WUs per WS/CS is in column2 ("#acct'd"), and for reference, the PRCG and date/time of the newest WU accounted-for by that WS/CS is also shown.

If it wasn't, that WU is counted as "pending", and the number of pending WUs per WS/CS is in column6 ("#pending"), and for reference, the PRCG and date/time of the oldest "pending" WU by that WS/CS is also shown.

It can be seen that two CS/WS are still backlogged, even after the latest "drop" 8 hours ago:
129.32.209.201 (vav18.fah.temple.edu | voelz)
155.247.166.219 (vav3.ocis.temple.edu | voelz)


155.247.166.219 just didn't stat-account any WU for me (and presumably anybody else) since at least January 19, which is the "normal" scenario for backlogged stats and will most likely rectify itself once that machine's backlog is taken care of.

129.32.209.201 tells a different, more interesting story: It did account for a seemingly-random subset if "my" WUs between at least January 18th and January 31st, to then stop accounted any of "my" WUs since January 31st, 08:38am. Is there a good hypothesis of what can cause a behavior were just a seemingly-random (in time) subset of WUs is stats-processed? (all submitted to the same CS/WS IP!)
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: 20210206 Missing Work?

Post by cine.chris »

UofM.MartinK wrote:Hope this post helps visualize which WS/CS process their WUs, and which don't/didn't, and since when.

It can be seen that two CS/WS are still backlogged, even after the latest "drop" 8 hours ago:
129.32.209.201 (vav18.fah.temple.edu | voelz)
155.247.166.219 (vav3.ocis.temple.edu | voelz)


155.247.166.219 just didn't stat-account any WU for me (and presumably anybody else) since at least January 19, which is the "normal" scenario for backlogged stats and will most likely rectify itself once that machine's backlog is taken care of.
Thanks Martin, for sharing this.
Interesting data.
I think it's important to see that everyone is fully credited for their work.
The other nemesis, waiting for work, has returned too, but not a serious problem. Hopefully, there are sufficient quality projects to keep the current work capacity entertained with the opportunity to help sustain the science.
That said, I see two GPUs waiting for work...
Image Image
ajm
Posts: 750
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: 20210206 Missing Work?

Post by ajm »

I just looked at a one log (one machine with 2 GPUs), starting about two hours before now and moving backward in time.
The servers 140.163.4.200 and 140.163.4.210 still don't seem to connect to the stats. 140.163.4.200 also gives a lot of errors when assigning work.

Not found:
project:17324 run:0 clone:2277 gen:12 - 140.163.4.200
project:17800 run:4 clone:31 gen:37 - 140.163.4.210
project:17800 run:0 clone:87 gen:42 - 140.163.4.210
project:17800 run:53 clone:57 gen:24 - 140.163.4.210
project:17800 run:60 clone:45 gen:29 - 140.163.4.210
project:17800 run:14 clone:83 gen:11 - 140.163.4.210
project:17800 run:0 clone:19 gen:23 - 140.163.4.210
project:17800 run:61 clone:128 gen:7 - 140.163.4.210
project:17800 run:3 clone:116 gen:7 - 140.163.4.210
project:17800 run:46 clone:114 gen:5 - 140.163.4.210

OK:
project:17432 run:0 clone:1885 gen:90 - 206.223.170.146
project:17432 run:0 clone:171 gen:119 - 206.223.170.146
project:17432 run:0 clone:331 gen:109 - 206.223.170.146
project:17431 run:0 clone:1696 gen:102 - 206.223.170.146
DOMiNiON79
Posts: 5
Joined: Sat Apr 04, 2020 10:27 am

Re: 20210206 Missing Work?

Post by DOMiNiON79 »

same here, folding all day on Project 17800 and no stats are reported, the "upload-log" is ok, f.i.:

Code: Select all

14:00:44:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
14:00:44:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:17800 run:50 clone:34 gen:36 core:0x22 unit:0x00000022000000240000458800000032
14:00:44:WU00:FS01:Uploading 3.30MiB to 140.163.4.210
14:00:44:WU00:FS01:Connecting to 140.163.4.210:8080
14:00:45:WU01:FS01:Connecting to assign1.foldingathome.org:80
14:00:45:WU01:FS01:Assigned to work server 140.163.4.210
14:00:45:WU01:FS01:Requesting new work unit for slot 01: gpu:1:0 GP106 [GeForce GTX 1060 6GB] 4372 from 140.163.4.210
14:00:45:WU01:FS01:Connecting to 140.163.4.210:8080
14:00:46:ERROR:WU01:FS01:Exception: Server did not assign work unit
14:01:05:WU00:FS01:Upload complete
14:01:05:WU00:FS01:Server responded WORK_ACK (400)
14:01:05:WU00:FS01:Final credit estimate, 58509.00 points
14:01:05:WU00:FS01:Cleaning up
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 20210206 Missing Work?

Post by UofM.MartinK »

Same here, 140.163.4.200 and 140.163.4.210 seem to have new trouble.

Here the last WUs credited and the first pending (not-yet-credited):

Code: Select all

WS/CS-IP         |   #newest-accounted-PRCG                  DATE   TIME   |  #pending  #oldest-pending-PRCG                    DATE   TIME
140.163.4.200    |   Project:17315,Run:0,Clone:2448,Gen:11   02/10  22:25  |  4         Project:17313,Run:0,Clone:1786,Gen:226  02/10  00:26
140.163.4.210    |   Project:17800,Run:36,Clone:117,Gen:3    02/10  23:29  |  8         Project:17800,Run:13,Clone:15,Gen:36    02/11  00:31
And while we're at it, here a different visualization of the back-logged WUs for the currently 4 WS/CS with troubles, starting January 10th:

Code: Select all

WS/CS-IP                    |- January 10th                                                              Now(February 11th)-|
140.163.4.200    #credited  .  .  1  1  .  .  .  .  1  .  1  .  .  .  .  .  .  .  .  .  .  .  .  .  .  3  1  9  .  7  6  +  .
140.163.4.200    #pending   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  1  3
---
140.163.4.210    #credited  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  2  4  7  +  9  3  .
140.163.4.210    #pending   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  8

129.32.209.201   #credited  3  2  .  .  .  .  .  4  2  5  5  7  7  4  1  .  .  .  .  .  .  3  .  .  .  .  .  .  .  .  .  .  .
129.32.209.201   #pending   2  3  1  6  .  4  3  1  1  .  .  .  .  1  3  .  4  8  +  5  6  8  7  3  +  9  8  +  +  7  +  +  4
---
155.247.166.219  #credited  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .
155.247.166.219  #pending   .  .  .  .  .  .  .  .  .  3  1  .  1  1  .  .  .  .  .  1  .  .  .  1  .  .  1  .  3  .  2  .  1
Each single-digit number represent the # of WUs credited (top line) or still pending (bottom line) for a single day ("+" represents >10 WUs)
UofM.MartinK
Posts: 59
Joined: Tue Apr 07, 2020 8:53 pm

Re: 20210206 Missing Work?

Post by UofM.MartinK »

140.163.4.200 and 140.163.4.210 are OK again, no backlog anymore, nice little spike in the stats.

That makes 129.32.209.201 and 155.247.166.219 the only WS/CS with backlogged stats for now.
comixgoddess
Posts: 85
Joined: Wed Apr 08, 2020 9:57 pm
Location: Pacific Northwest

Re: 20210206 Missing Work?

Post by comixgoddess »

UofM.MartinK wrote:That makes 129.32.209.201 and 155.247.166.219 the only WS/CS with backlogged stats for now.
Same here; those are the only two servers that I have a backlog of stats from.
Image
Post Reply