20210206 Missing Work?
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 78
- Joined: Sun Apr 26, 2020 1:29 pm
Re: 21210206 Missing Work?
It'll be one week tomorrow that the stats servers issues started.
I'm still seeing a fraction of my work logged.
20M PPD translates to 2.5M pts/3hr. Of course, I'd reasonably expect to see #s in the 2-3M range.
The 3 days prior to this break, I'd just installed 3070#2 and averaging ~19M/day.
Part of the frustration is adding a dual-Xeon server to consolidate GPUs & an RTX3070 and seeing fewer points than prior.
I'm still seeing a fraction of my work logged.
20M PPD translates to 2.5M pts/3hr. Of course, I'd reasonably expect to see #s in the 2-3M range.
The 3 days prior to this break, I'd just installed 3070#2 and averaging ~19M/day.
Part of the frustration is adding a dual-Xeon server to consolidate GPUs & an RTX3070 and seeing fewer points than prior.
-
- Posts: 188
- Joined: Fri Jan 04, 2008 11:02 pm
- Hardware configuration: Hewlett-Packard 1494 Win10 Build 1836
GeForce [MSI] GTX 950
Runs F@H Ver7.6.21
[As of Jan 2021] - Location: England
Re: 21210206 Missing Work?
I have had a similar issue- posted on an old thread.
I lost 2 days work and accepted as valid but not in either official or EOC stats.
I corrected by re-installing F@H.
Even then the online control reported differently to the Adv.control till I stopped and restarted that.
I have the old logs still so can post some data if it doesn't recover itself.
Recent cases of lost data to EOC did correct- but that was just their interface issue, I think.
I lost 2 days work and accepted as valid but not in either official or EOC stats.
I corrected by re-installing F@H.
Even then the online control reported differently to the Adv.control till I stopped and restarted that.
I have the old logs still so can post some data if it doesn't recover itself.
Recent cases of lost data to EOC did correct- but that was just their interface issue, I think.
Re: 21210206 Missing Work?
I don't see updates on the stats page for my account since
2021-02-08 09:57:27. It is
2021-92-09 14:18:00 JST now in Japan.
I can't see 1m+ points made from my clients since then.
At the time of writing this the page this page is reporting
Date of last Work Unit 2021-02-08 09:57:27
Total score 30,192,984
Total WUs 855
Overall rank (if points are combined) 42,639 of 2,791,739
Active clients (within 50 days) 16
Active clients (within 7 days) 12
2021-02-08 09:57:27. It is
2021-92-09 14:18:00 JST now in Japan.
I can't see 1m+ points made from my clients since then.
Code: Select all
stats.foldingathome.org/donor/439946803
Date of last Work Unit 2021-02-08 09:57:27
Total score 30,192,984
Total WUs 855
Overall rank (if points are combined) 42,639 of 2,791,739
Active clients (within 50 days) 16
Active clients (within 7 days) 12
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 21210206 Missing Work?
Just to manage your expectations:cine.chris wrote:It'll be one week tomorrow that the stats servers issues started.
Stats issues rarely take hours/days to resolve ... most take a few weeks ... some have taken months ... but in my experience thaey have always been sorted eventually.
Is this right that it should take so long - probably not - but with the restricted dev effort and the dispersed (physically and organisationally) nature of the FaH infrastructure this is unfortunately the reality.
Stats issues have been around since that day points were introduced - but this particular issue may be part of a build up of things (as far as I can tell) that possibly started over a month ago where one set of stats issues were sorted but seemed to unfortunately have a knock on impact at another part - this may make it even more challenging to unpick and resolve ... Posting PRCGs helps the team trace/track down the issues ... and a bit of patience helps manage expectations
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 64
- Joined: Mon Apr 13, 2020 11:47 am
Re: 21210206 Missing Work?
Started seeing a similar issue yesterday. EOC showed about 4/7 of my usual PPD. Moreover, stats,foldingathome.org hasn't showed an update for me since about 10AM GMT yesterday.
Re: 21210206 Missing Work?
There seems to be more than one disconnect. If I check some individual work units they show up. But main stats is showing that I last turned in a work unit on the 8th... and I've definitely turned in quite a few (that I can verify checking the WU directly!). So there is all sorts of messed up going on.
-
- Posts: 30
- Joined: Thu Sep 24, 2020 6:06 pm
- Hardware configuration: iMac 2017 Intel Quad-Core i5 3,4 GHz, 8 GB RAM, Radeon Pro 560 4 GB, typically with the latest macOS update. 5 Raspberry Pi 4B (2 GB).
- Location: Oberhausen, Germany
- Contact:
Re: 21210206 Missing Work?
I'd say the statistics are completely broken, or they were turned off for maintenance / bug fixing. Even Anonymous didn't upload any good work units since yesterday morning: https://stats.foldingathome.org/donor/1437
The good thing is that Anonymous never ever will complain about it
The good thing is that Anonymous never ever will complain about it
My Raspberry Pi folding rack: http://www.anne-emscher.net/fah/
-
- Posts: 78
- Joined: Sun Apr 26, 2020 1:29 pm
Re: 21210206 Missing Work?
Hi Neil-B,
Good to hear from you.
It appears to be a hard-coded address issue, from the view of an engineer that was often forced to deal with the vagaries of IT organizations. I heard mention of server transitions.
Perhaps "BIND" could be an appropriate pun to apply for symptoms like this?
It's a fragile architecture that's connected like a chain vs a web.
I've shutdown systems until I see this is rectified.
Currently at about 40%, until this is corrected.
Good to hear from you.
It appears to be a hard-coded address issue, from the view of an engineer that was often forced to deal with the vagaries of IT organizations. I heard mention of server transitions.
Perhaps "BIND" could be an appropriate pun to apply for symptoms like this?
It's a fragile architecture that's connected like a chain vs a web.
I've shutdown systems until I see this is rectified.
Currently at about 40%, until this is corrected.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 21210206 Missing Work?
Yup fragile, in a bind, even this isn't the way to do this but it has kindof evolved beyond where it was ever designed... but hey it is what we have ... The science will be progressing fine ... and the points do always catch up ... shutting down is obviously your choice but as long as the logs show work acknowledge and an estimated points then science is progressing ... what happens is that at some point a points/stats reconstruction is done znd z spike sometimes a really big one appears and everything is back to etrre it should be ... shutting down kit means science isn't progressed
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: 20210206 Missing Work?
When systems that were created 20 years ago by non-programmers reach the point of being fragile and repeatedly failing, there's usually only one alternative: Have a professional programmer rewrite it from the ground up. That means its performance continues to degrade until it can be replaced by a new system.
It looks like we may have reached that point. Of course none of us sees the big picture. Almost everybody looks at their total points -- which is not helpful in identifying a problem which is an aggregate of many small errors plus many small successes -- and not particularly useful in identifying a reparable problem or repairing or replacing the overall system.
Treating it as problems that may be associated with individual work servers, are there identifiable work servers that ARE working correctly? That may be the first sign that progress is being made?
It looks like we may have reached that point. Of course none of us sees the big picture. Almost everybody looks at their total points -- which is not helpful in identifying a problem which is an aggregate of many small errors plus many small successes -- and not particularly useful in identifying a reparable problem or repairing or replacing the overall system.
Treating it as problems that may be associated with individual work servers, are there identifiable work servers that ARE working correctly? That may be the first sign that progress is being made?
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 78
- Joined: Sun Apr 26, 2020 1:29 pm
Re: 20210206 Missing Work?
It appears that the current Linux client 7.6.21, ignored the specified collection server & returned work to the errant 206.223.170.146.
The PRCG showed 'not found'.
Update:
Watched another WU with same results for: project:17431 run:0 clone:1731 gen:105
The PRCG showed 'not found'.
Update:
Watched another WU with same results for: project:17431 run:0 clone:1731 gen:105
Last edited by cine.chris on Tue Feb 09, 2021 5:59 pm, edited 2 times in total.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 20210206 Missing Work?
If you mean not found in the stats system that can happen - if your log shows work ack and estimated point then the WU will be useful to science - CS are only used if WS can't accept ... my guess WS accepted it but stats connection for that WS is borked (and I think this is only one of a number of ssues all overlapping) - to the point the stats system doesn't even know the WU exists ... Luckily the stats system can be totalled wrecked and the science can still continue ... I am just glad I am not the poor person who has to track down, resolve, then catch up all the stats ... but in my experience it always gets sorted eventually
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 78
- Joined: Sun Apr 26, 2020 1:29 pm
Re: 20210206 Missing Work?
Bruce,
Yes, they need to pick their battles, host resolution appears to be a good candidate for a patch.
Creating 'A' records for critical servers & migrating code to name resolution would be a doable plan. Even 'foreign' servers can have managed 'A' records in the native domain (I just tested that...). It resolved within seconds, the first ping worked. Of course, cached updates would need to be tested for latency.
Services could easily be redirected to a back-up or new service, even migrated back if the 'new' service failed, etc.
Hope this is resolved soon.
Yes, they need to pick their battles, host resolution appears to be a good candidate for a patch.
Creating 'A' records for critical servers & migrating code to name resolution would be a doable plan. Even 'foreign' servers can have managed 'A' records in the native domain (I just tested that...). It resolved within seconds, the first ping worked. Of course, cached updates would need to be tested for latency.
Services could easily be redirected to a back-up or new service, even migrated back if the 'new' service failed, etc.
Hope this is resolved soon.
-
- Posts: 1
- Joined: Mon Feb 08, 2021 7:09 pm
Re: 20210206 Missing Work?
Stats seem to be slowly updating now? - I just jusmped from 5k to 95k
-
- Posts: 30
- Joined: Thu Sep 24, 2020 6:06 pm
- Hardware configuration: iMac 2017 Intel Quad-Core i5 3,4 GHz, 8 GB RAM, Radeon Pro 560 4 GB, typically with the latest macOS update. 5 Raspberry Pi 4B (2 GB).
- Location: Oberhausen, Germany
- Contact:
Re: 20210206 Missing Work?
Good enough for me to keep them running.Neil-B wrote:if your log shows work ack and estimated point then the WU will be useful to science
My Raspberry Pi folding rack: http://www.anne-emscher.net/fah/