Page 1 of 3
Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 11:20 am
by Ichbin3
It looks like the project 13420 is showing the same variability as 13418 - means there are some fast folding and some slower folding.
@JohnChodera - would you mind to consider to increase the base credit as you did for the 13418?
13420 (3082, 36, 0) - this is a slow one, just folding, for example.
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 11:56 am
by HaloJones
TBH, I've not seen much variability in 13420. I've done 16 of them so far and they've all been around the same ppd.
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 2:54 pm
by Ichbin3
13420 (3082, 36, 0)
Normal time for a 13420 is indeed 1:02 TPF
This one had 1:21, without me using the computer.
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 6:36 pm
by JohnChodera
Thanks for the heads-up. I suspect some GPUs see much more variability than others.
I've incremented the base credit for 13420-1 by 10% to help compensate for this variability.
We're still investigating how we can further minimize this in our setup or through changes to OpenMM.
Thanks for bearing with us!
~ John Chodera // MSKCC
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 6:46 pm
by Ichbin3
Thanks for listening ;- )
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 6:54 pm
by gunnarre
Thank you, I noticed that 13421 was projected to make just 52k PPD on a GPU which usually does between 70k-95k PPD.
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 8:15 pm
by Neil-B
hmmm ... I start to wonder ... bumping ppds up because of variability on some cards (and then only variability of some WUs) ... tbh begins to feel less and less point in even keeping track of points ... cpu projects delivering >20% less than was normal across the board ... gpu projects being bumped up by 10% on a single request ... perhaps the "cpu is irrelevant message" has some grounds ... probably close down the team tbh and just fold for anonymous
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 8:40 pm
by gunnarre
My CPU normally makes twice as many points as that GPU, so CPUs don't feel irrelevant for folding for me. In fact, if the PPD drops much lower, it would be better to shut the GPU down and let the CPU run more threads. As long as the points are roughly equivalent to the science benefit of running the work units, they're doing their job of rewarding the most effective folding configurations.
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 9:14 pm
by Neil-B
Sorry, but imho boosting base points to "make up for" a few variable WUs on some GPUs make a mockery of the equivalent science benefit argument - actually rewarding lack of performance !!
I use rolling ppd averages to monitor my kit (and to spot issues on beta testing) .. dropping 10-15 percent (275k per day to 250k per day) overnight helped me identify the performance impact of certain intel firmware patches ... since then (the last few months) a variety of projects have degraded "normal" for CPU points so that the 250k ppd is now under 200k most days ... so over the last few months obviously my server is delivering over 20% less scientific benefit - feels like the time will come sooner rather than later that it will not be considered to be delivering any scientific value at which point I'll retire it ... maybe the new ARM/Android folders can take up the slack
Re: Proj 13420 same variability as 13418
Posted: Tue Jul 28, 2020 9:44 pm
by JohnChodera
> hmmm ... I start to wonder ... bumping ppds up because of variability on some cards (and then only variability of some WUs) ... tbh begins to feel less and less point in even keeping track of points ... cpu projects delivering >20% less than was normal across the board ... gpu projects being bumped up by 10% on a single request ... perhaps the "cpu is irrelevant message" has some grounds ... probably close down the team tbh and just fold for anonymous
We had adjusted the previous projects in the series (which are almost identical) upwards after lots of reports, and the internal testers saw less variation on a small number of test projects before we had to go live. I'm comfortable bringing the base credit back up since we had many reports of this before and no good data that things had improved _except_ for no reports of variation during testing.
We have some ideas for how to reduce variability going forward, but we've been focusing on the science until we can get the infrastructure fully automated and can turn our attention back to these issues.
~ John Chodera // MSKCC
Re: Proj 13420 same variability as 13418
Posted: Wed Jul 29, 2020 1:06 am
by aetch
I'm assuming, when a work unit is completed and the results are uploaded back to F@H, somewhere in there is a log of the actual hardware the work unit ran on. Hopefully you'll have a big enough sample to look at individual gpus and separate out the fast and slow units and figure out what makes them different.
Re: Proj 13420 same variability as 13418
Posted: Wed Jul 29, 2020 11:53 am
by gunnarre
These types of GPU work units - (low atom count?) - seem to benefit from being fed by a CPU with high single core clocks. Typical gaming/graphics oriented systems with a fast GPU and a stock cooled CPU might actually make more PPD by stopping CPU folding while these WUs are running on the GPU, so the CPU can clock up to max stock "boost" frequencies on the single core polling the GPU. PPD/watt would also be better.
Production oriented systems with a modest GPU and many CPU cores likely won't benefit from stopping CPU folding, especially if is well cooled and "boost" is switched off (it's running all cores at the same frequency). In those systems, the CPU can be faster than the GPU and in some cases adding more threads to the CPU gives more PPD than configuring the GPU for folding - at least until CUDA support hopefully reduces CPU usage while GPU folding.
Re: Proj 13420 same variability as 13418
Posted: Wed Jul 29, 2020 5:05 pm
by bruce
The variablity HAS been reduced. Taken as a group, projects 13420 and 13421 are less variable. The really short WUs are now being assigned to slower GPUs and the fater ones are retained in 13420. That allows the average points for each group to be consistent with the GPU performace of half of the spectrum of FAH GPUs. It does not remove all variability when you consider the overall variability of a spectrum of P134xx assignments.
In this case, the union of projects 13420 and 13421 represent a wide variety of projects just as Project MoonShot represents a wide variety of suggested protein fragments.
Re: Proj 13420 same variability as 13418
Posted: Thu Jul 30, 2020 1:04 pm
by Ichbin3
Got another slow one right now
13420 (6171, 23, 0)
For all the people who say there aren't any ;- )
Re: Proj 13420 same variability as 13418
Posted: Thu Jul 30, 2020 3:26 pm
by JohnChodera
Thanks, Ichbin3! We're still working on this.
~ John Chodera // MSKCC