Project 13424 (Moonshot) very low PPD

Moderators: Site Moderators, FAHC Science Team

PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: Project 13424 (Moonshot) very low PPD

Post by PantherX »

cine.chris wrote:... now, if the cluster PPD dips out of my tolerance range, I pull slots if caught early.
Often, it self-repairs, if I wait.
...
I like setting goals & with 5M pts by tomorrow I'll have 300M for August, which is good for an old retired guy quarantined in the basement.
...
It's good to hear that you're passionate about folding and data collection! Since you have installed HFM.NET, I would suggest that you build a baseline which is to leave your GPUs without any kind of interference for a week. Right now can be tricky since Moonshot WUs are in high demand. But, setting a baseline is better than none. Once you have done that, see what the average PPD is for your individual GPU using the awesome new feature of graphing:
Image
Details: https://github.com/harlam357/hfm-net/issues/308

Now that you have a baseline, the policy is that PPD can vary within 10% of the average PPD of the GPU. Since it seems that you have dedicated systems for folding, there's minimal chance of background applications skewing the data which is nice. See what the data suggests. Keep in mind that any Projects in the 134XX series prior to 13424/13425 should be ignored as they were the outliers. Would be great if you shared your results since most Projects do fall within the 10% PPD margin.

When it comes to Slot deletion, while it's a few clicks for you, it's several steps back for science. Reason is that you effectively dumped a WU and now it has to wait for the next re-assignment before it gets folded. WUs within F@H are unique and sequential. This is something that's special and rare within the distributed computing platform. Other distributed computing send out multiple copies to prevent cheating and data verification while F@H doesn't do that. Thus, from your end, it's a few seconds but from the researcher's end, it can be many hours or even days.

Another take is some Projects do tend to give higher PPD (within the 10% average PPD) while others are lower (but still within the 10% average PPD). A common reason is the complexity of the Project. Let's say that cancer related Projects are "low PPD" while COVID are "high PPD" so would you prefer to dump cancer WUs and only fold COVID WUs knowing that you can help people suffering from Cancer and COVID without the need to choose between them? I know what I would do, fold any WU that gets assigned to my GPU and report my findings if they deviate from the norm. After that, the researchers are SME and will make an informed decision once they have sufficient reports.

Maybe you can set yourself a goal to fold 100 WUs a day :ewink:

Finally, I do understand that you would like WUs that are best suited for your GPU to be assigned to them to ensure that they are being used in the most efficient manner. Good news is that I too agree with you. Great news is that researchers too agree with us. We are all on the same page! Now, work is underway to come up with an automated system that will match the optimum WU with the best hardware in real-time. Of course, this won't happen overnight but work is happening in the background and once it is ready, it will be announced. Unfortunately, there's no ETA on that since F@H development doesn't provide timelines as there are seriously under-staffed and under-resourced. The current system of classifying GPUs is manual, intensive, and very crude.... but only when you compare it to nowadays. 10 years ago, it was a great system and just worked! It's the technological debt that dealing with is taking time but once dealt with, will be able to unlock much more potential across all donor systems for free!

P.S. If you think that OpenCL on Nvidia GPU is good, just wait until you see CUDA on Nvidia GPUs...
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
ajm
Posts: 750
Joined: Sat Mar 21, 2020 5:22 am
Location: Lucerne, Switzerland

Re: Project 13424 (Moonshot) very low PPD

Post by ajm »

Some 13424's still are quite low or irregular (2080S):
13424 (1301, 0, 3) -> 37.98% lower than average (after over 12% completion, based on 191 data samples)
13424 (249, 46, 5) -> 18.30% lower than average (after over 47% completion, based on 191 data samples)
Ichbin3
Posts: 96
Joined: Thu May 28, 2020 8:06 am
Hardware configuration: MSI H81M, G3240, RTX 2080Ti_Rev-A@220W, Ubuntu 18.04
Location: Germany

Re: Project 13424 (Moonshot) very low PPD

Post by Ichbin3 »

cine.chris wrote: My 2060/2070 super combo would consistently match the 2080Ti PPD level. So, your 3.2M PPD for 13424 is very good in comparison!
You may find that it is a little bit more difficult and depends on the atom count of the WUs. Your 2060/2070 are more suitable to lower atom-count-WUs than a 2080TI.
At higher atom-count-WUs your combo would loose.
Image
MSI H81M, G3240, RTX 2080Ti_Rev-A@220W, Ubuntu 18.04
Kjetil
Posts: 175
Joined: Sat Apr 14, 2012 5:56 pm
Location: Stavanger Norway

Re: Project 13424 (Moonshot) very low PPD

Post by Kjetil »

ajm wrote:Some 13424's still are quite low or irregular (2080S):
13424 (1301, 0, 3) -> 37.98% lower than average (after over 12% completion, based on 191 data samples)
13424 (249, 46, 5) -> 18.30% lower than average (after over 47% completion, based on 191 data samples)
Jepp, 13422 is running now. 1731832 on 2060S, 13424 is 1375725.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 13424 (Moonshot) very low PPD

Post by Neil-B »

As mentioned a number of time in this thread that p13422 (which had a boosted base points to allow for some very low point WUs) and p13424 (which has a more normal base points as it should have significantly fewer low point WUs but not none as ajm has shown) are really not relevant ... all people are really showing by this is that those people lucky enough to get 13422 WUs actually were getting "bonus points" above and beyond the level that projects are normally benchmarked to ... my guess would be there may still be some variance within the p13524 WUs as unlike "traditional" projects this project is looking at a variety of potential scenarios.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Kjetil
Posts: 175
Joined: Sat Apr 14, 2012 5:56 pm
Location: Stavanger Norway

Re: Project 13424 (Moonshot) very low PPD

Post by Kjetil »

Jepp i now, but is not normal to have 1.3 m on 2060s, normal is 1.6 to 2,0m ppd.
Folded long time. Something is wrong with wu 13424,
I will run them.
Sorry for my pore English
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 13424 (Moonshot) very low PPD

Post by Neil-B »

Interesting - I have been seeing ppds for the latest sprint right in the middle of where I expect (pre early moonshot projects) the ppd for my slots to be but I use slower/older kit ... I wonder if the atom count and processing style is still to small to fully utilise your GPUs? ... As a matter of interest what ppd did you see from any p16600 your kit processed? - I ask as this project was one that hit right in the middle of my kits expected range.
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
NormalDiffusion
Posts: 124
Joined: Sat Apr 18, 2020 1:50 pm

Re: Project 13424 (Moonshot) very low PPD

Post by NormalDiffusion »

ajm wrote:Some 13424's still are quite low or irregular (2080S):
13424 (1301, 0, 3) -> 37.98% lower than average (after over 12% completion, based on 191 data samples)
13424 (249, 46, 5) -> 18.30% lower than average (after over 47% completion, based on 191 data samples)

Seeing the same here on Radeon VII:

13424 (1308, 44, 2) over 40% lower than average 13424 (PPD<900k vs. PPD>1.500k on average for 13424).
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: Project 13424 (Moonshot) very low PPD

Post by cine.chris »

Since installing HFM late yesterday... for 13424
all system win10
7wu 2070s 1,826-1,827K PPD (newer X570 PC)
9wu 2060s 1,308-1,312K PPD
5wu 2060s 1,253-1,259K PPD (older Q87 mobo)
3wu 2060KO 1,205K PPD
5wu 1660Ti 767-794K PPD (monitoring PC)

proj 13422
2wu 2070s 2,256K 2,273K PPD which ~24% > 13424

That's the default, Bonus value, not sure what is expected??
Last edited by cine.chris on Tue Sep 01, 2020 1:05 am, edited 2 times in total.
Image Image
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: Project 13424 (Moonshot) very low PPD

Post by JohnChodera »

Thanks for the reports, everyone!

We had found that we essentially eliminated the RUN-to-RUN variability when benchmarking locally on a GTX 1080 Ti:

Code: Select all

13425/RUNS/RUN0                                                                  :  117.891 ns/day
13425/RUNS/RUN1                                                                  :  118.636 ns/day
13425/RUNS/RUN2                                                                  :  118.290 ns/day
13425/RUNS/RUN3                                                                  :  121.537 ns/day
13425/RUNS/RUN4                                                                  :  116.377 ns/day
13425/RUNS/RUN5                                                                  :  123.469 ns/day
13425/RUNS/RUN6                                                                  :  119.956 ns/day
13425/RUNS/RUN7                                                                  :  122.902 ns/day
13425/RUNS/RUN8                                                                  :  119.734 ns/day
13425/RUNS/RUN9                                                                  :  120.736 ns/day
13425/RUNS/RUN10                                                                 :  119.134 ns/day
13425/RUNS/RUN11                                                                 :  118.896 ns/day
13425/RUNS/RUN12                                                                 :  119.253 ns/day
13425/RUNS/RUN13                                                                 :  118.442 ns/day
13425/RUNS/RUN14                                                                 :  121.231 ns/day
However, it's still possible some variation slipped by, and we can dig in to understand what is going on there and fix it. So please do let us know if you see abnormally slow RUNs.

I did a quick check on our GTX 1080 Tis on the 13424 RUNs reported above (249, 1301):

Code: Select all

13424/RUNS/RUN245                                                                :   18.790 ns/day
13424/RUNS/RUN246                                                                :   18.753 ns/day
13424/RUNS/RUN247                                                                :   18.713 ns/day
13424/RUNS/RUN248                                                                :   18.721 ns/day
13424/RUNS/RUN249                                                                :   18.608 ns/day
13424/RUNS/RUN250                                                                :   18.681 ns/day
What this is telling me is that there is something specific about the GENs that is causing the variation, rather than the RUN. Here's the profile from your science.log over the checkpoint blocks:

Code: Select all

 Performance since last checkpoint: 103.8461538 ns/day
  Performance since last checkpoint: 101.4084507 ns/day
  Performance since last checkpoint: 101.1709602 ns/day
  Performance since last checkpoint: 101.6470588 ns/day
  Performance since last checkpoint: 101.8867925 ns/day
  Performance since last checkpoint: 96.42857143 ns/day
  Performance since last checkpoint: 96.21380846 ns/day
  Performance since last checkpoint: 96.42857143 ns/day
  Performance since last checkpoint: 96.21380846 ns/day
  Performance since last checkpoint: 96.42857143 ns/day
  Performance since last checkpoint: 101.4084507 ns/day
  Performance since last checkpoint: 101.4084507 ns/day
  Performance since last checkpoint: 101.1709602 ns/day
  Performance since last checkpoint: 101.1709602 ns/day
  Performance since last checkpoint: 100.6993007 ns/day
  Performance since last checkpoint: 96 ns/day
  Performance since last checkpoint: 96.21380846 ns/day
  Performance since last checkpoint: 109.3670886 ns/day
  Performance since last checkpoint: 113.9841689 ns/day
  Performance since last checkpoint: 113.9841689 ns/day
and here's the profile over another RUN without issues (248,42,3):

Code: Select all

  Performance since last checkpoint: 115.5080214 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 115.8176944 ns/day
  Performance since last checkpoint: 115.2 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 114.2857143 ns/day
  Performance since last checkpoint: 113.9841689 ns/day
  Performance since last checkpoint: 114.2857143 ns/day
  Performance since last checkpoint: 114.2857143 ns/day
  Performance since last checkpoint: 114.893617 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 116.4420485 ns/day
  Performance since last checkpoint: 116.4420485 ns/day
  Performance since last checkpoint: 116.7567568 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 113.9841689 ns/day
  Performance since last checkpoint: 114.5888594 ns/day
  Performance since last checkpoint: 114.2857143 ns/day
  Performance since last checkpoint: 114.5888594 ns/day
  Performance since last checkpoint: 114.5888594 ns/day
So the performance did dip a bit, but I think this is just because of normal statistical fluctuations in the system we won't be able to do anything about.

Looking into RUN 1301, I do see a much bigger dip in performance early in the RUN, but it recovers:

Code: Select all

  Performance since last checkpoint: 80.74766355 ns/day
  Performance since last checkpoint: 80.59701493 ns/day
  Performance since last checkpoint: 79.26605505 ns/day
  Performance since last checkpoint: 75.92267135 ns/day
  Performance since last checkpoint: 75.78947368 ns/day
  Performance since last checkpoint: 72.97297297 ns/day
  Performance since last checkpoint: 73.97260274 ns/day
  Performance since last checkpoint: 76.8683274 ns/day
  Performance since last checkpoint: 94.52954048 ns/day
  Performance since last checkpoint: 109.3670886 ns/day
  Performance since last checkpoint: 114.893617 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 115.5080214 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 116.1290323 ns/day
  Performance since last checkpoint: 110.2040816 ns/day
  Performance since last checkpoint: 110.4859335 ns/day
  Performance since last checkpoint: 107.1960298 ns/day
  Performance since last checkpoint: 105.3658537 ns/day
  Performance since last checkpoint: 107.4626866 ns/day
I'm not sure if we can do anything about this, but we can add it to the things to investigate.

Thanks again for helping out! That 2080 S really screams on these WUs!

~ John Chodera // MSKCC
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: Project 13424 (Moonshot) very low PPD

Post by JohnChodera »

Whoops! Forgot to post our local GTX 1080 Ti benchmarks around RUN 1301:

Code: Select all

13424/RUNS/RUN1329                                                               :   18.629 ns/day
13424/RUNS/RUN1330                                                               :   18.801 ns/day
13424/RUNS/RUN1331                                                               :   18.833 ns/day
13424/RUNS/RUN1332                                                               :   18.859 ns/day
13424/RUNS/RUN1333                                                               :   18.886 ns/day
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: Project 13424 (Moonshot) very low PPD

Post by JohnChodera »

@cine.chris: We hear you, and we will adjust upwards again for the next Sprint as a compromise. Please do stick with us for Sprint 4, starting Sunday!

~ John Chodera // MSKCC
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: Project 13424 (Moonshot) very low PPD

Post by cine.chris »

JohnChodera wrote:@cine.chris: We hear you, and we will adjust upwards again for the next Sprint as a compromise. Please do stick with us for Sprint 4, starting Sunday!
~ John Chodera // MSKCC
Thx John, I will.
I know your team has many challenges & people must wear many 'hats'.

For me, in addition to crunching WUs efficiently...
Keeping people on-board & interested, is now an interesting challenge & concern for me.
I'm on the MicroCenterOfficial team, but now, I'm also watching the behavior of other teams like LinusTechTips & Default... these are teams that manifest 'the power of many', which F@H has taped-into & harnessed.
Of course the big clusters are impressive, but the 'power-of-many' is what captures the imagination.

I've also rummaged through some of the EOC site data, looking at the Curecoin & other 'greys' (abandon color) looking for patterns. It could an interesting student project.

Chris
Image Image
cine.chris
Posts: 78
Joined: Sun Apr 26, 2020 1:29 pm

Re: Project 13424 (Moonshot) very low PPD

Post by cine.chris »

Primary motivation for this post is checking img-bbcode connection to my web server.
Chart image which dramatizes my concern as our MicroCenter team plummeted to 65%.
We also lost our #2 donor MCO45, which occurred over that span.
Other Teams effected:
LinusTechTip 78%
Tom's Hdwr 82%
PCMastRace 76%
several teams seemed unaffected?

One HUGE exception was HardOCP
Mexican food must work for Folding, new donor, after 4days:
DelTacoRay 180M PPD

Image
Image Image
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Project 13424 (Moonshot) very low PPD

Post by Neil-B »

You may want to look at which teams massively expanded in March/April and which didn't ... a hypothesis might be that those teams which have built up membership over years incorporating long term folders may have a lesser attrition rate than those whose membership had significant rises during early covid where many folders joined perhaps without considering/understanding fully the impacts of doing so ... another hypothesis might be that the impact/number of new early covid folders utilising small/medium sized corporate/public infrastructure to fold may differ across teams and as these cease to fold as the infrastructure gets reutilised for its original use that the teams with greater proportions of these are suffering greatest attrition rates ... three traditional hypothesis are - that only a percentage of those who try folding will continue to fold once the impacts upon kit are actually understood - that folding is seasonal which meant some early covid surge may have been put on hold during summer and may return - that any surge based upon marketing (especially social media) may have a higher attrition rate than that based on organic growth even though there may well still be a net increase
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply