PCI-e bandwidth/capacity limitations

Post by **Joe_H** » Fri Jun 17, 2016 1:16 pm

Take a look at the Project Summary pages for info on the different projects. One important metric is the number of atoms in the simulation. So for example using the two different projects mentioned in Bruce's post:

Project 11414 - 77003 atoms (Core_21)

Project 9160 - 46000 atoms (Core_18)

Atom counts listed for current GPU projects range from under 10,000 to over 270,000 doing just a quick scan of the nearly 200 listed for cores 17, 18 and 21. My assumption is that there will be some correlation between project size in atoms and bandwidth requirements as more data is going to need to be moved to and from the GPU for larger WU's.

hiigaran · Post by **hiigaran** » Fri Jun 17, 2016 2:53 pm

Interesting. Right, so we have another variable in play. Now, when users select whether they want small, normal, or big work units, would this option request projects based on atom count, or would it just alter the length of the simulation to be calculated within the WU?

Would it also be safe to assume that computation time and atom counts are exponentially proportional to each other? I'm guessing it wouldn't be linear, since more atoms = more variables, but I have absolutely no idea how the WU is processed, let alone what is actually within it.

What I'm leading to here is this: If WU size determines project atom count, and the count is exponentially proportional to computation time, could bandwidth saturation be avoided by choosing to download small WUs exclusively? And if so, would the rate of progress from F@H be altered, if we completely forget about points?

Post by **bruce** » Fri Jun 17, 2016 3:52 pm

hiigaran wrote:Right, so it's a riser. This is where my confusion was stemming from.

That's perfectly reasonable confusion.

Your moderators have not enforced "off topic" policies on this wide-ranging topic and the topic title still says "splitter"
I'll edit my post.

hiigaran · Post by **hiigaran** » Fri Jun 17, 2016 4:41 pm

Actually, good idea. Thread title edited.

Though before being aware of the high bandwidth consumption, I was actually originally referring to hardware that can split any PCI-e slot into multiple ones, rather than risers.

Post by **bruce** » Fri Jun 17, 2016 5:52 pm

hiigaran wrote:Interesting. Right, so we have another variable in play. Now, when users select whether they want small, normal, or big work units, would this option request projects based on atom count, or would it just alter the length of the simulation to be calculated within the WU?

The size of the download packet increases as the number of atoms increases, but the data is compressed so it's no longer proportional to anything.
The size of the upload packet increases with the number of checkpoints being reported to the server plus the length of some log files. Packet-size is supposed to consider both, but the scientists are only moderately consistent about setting that number for their project.

You're welcome to monitor messages such as
Downloading 1.36MiB
Uploading 6.05MiB to xx.xx.xx.xx
and report any discrepancies you see.

Would it also be safe to assume that computation time and atom counts are exponentially proportional to each other? I'm guessing it wouldn't be linear, since more atoms = more variables, but I have absolutely no idea how the WU is processed, let alone what is actually within it.

Portions of the calculation are proportional to the number of atoms N. Other portions are somewhat less than to N**2 so you can estimate the time as being somewhere between N and N**2. (That's why projects have to be benchmarked.)

What I'm leading to here is this: If WU size determines project atom count, and the count is exponentially proportional to computation time, could bandwidth saturation be avoided by choosing to download small WUs exclusively? And if so, would the rate of progress from F@H be altered, if we completely forget about points?

Neither. The other factor that needs to be taken into account is the number of steps which is not the same as the number of checkpoints. Double the number of steps and anything will run twice as long.

Stanford's limitations are probably not bandwidth related, but rather based on the frequency of new connection requests, although work has been done on an unreleased streaming core that uploads semi-continuously at a rather slow average rate.

hiigaran · Post by **hiigaran** » Sun Jun 19, 2016 11:49 am

Hmm. Guess we probably can't do anything with saturation and high-end GPUs, so let's change things a bit then. If high end will suffer from x1 saturation, what's the fastest card that could run properly from x1? You mentioned on the last page that a GT740 only used 17% of x1 bandwidth, so that's a start. Perhaps mid-range cards might work?

Also, if I split away from PCI-e for a bit, I always see people running server hardware for folding as well. As far as I'm aware, GPU folding is more cost effective, but with people going out of their way to get their hands on server CPUs and dual or quad socket mobos, are these CPUs actually better?

Nathan_P · Post by **Nathan_P** » Sun Jun 19, 2016 12:55 pm

The dual/quad cpu buying phase is now well and truly over for F@H, back when they ran the Bigadv projects people needed server boards with 2+ cpu's to get the WU done in time, PPD was upto 1 million per machine if you had the right cpu's and 500k+ was not uncommon. Back then GPU's were only scoring around 100k at most so as always people went with where the points were. Now the focus is on GPU and the bigadv projects are finished, anyone using server hardware today either still has leftover kit from the bigadv days like myself, or uses the server boards to get maximum PCIe lanes - a dual socket 2011 board gets you 80 PCIe lanes and upto 7 x16 slots with 4 running at PCIe 3.0 x16 and the other 3 at PCIe 3.0 x8 - you are not going to find that on a consumer board

foldy · Post by **foldy** » Fri Aug 19, 2016 4:29 pm

There is test of gtx 1080 on pcie 2.0 x1 and only 7% performance loss compared to pcie 2.0 x16 on Linux
viewtopic.php?f=38&t=28897&p=287927#p287927

b4441b29 · Post by **b4441b29** » Tue Oct 18, 2016 12:03 am

Here is another data point. Ubuntu 16.06 and Nvidia driver 370.28. I ran FAHBench 2.2.5 tests with a GTX 1070 on an ASRock (Intel) Z170 Extreme3 motherboard. This is running at PCIe 3.0 speed. In a 16x lane slot it scored 142.223 on Single Precision and 9.31145 on Double Precision. In a 4x lane slot that is shared with ethernet,storage, etc. it scored 140.841 on Single Precision and 9.28171 on Double Precision. So less than 1% difference between PCIe 16x and 4x. I'm completing actual work units now, and so far it looks like the difference in PDD is similar to the benchmark. I''ll keep you updated.

Duce H_K_ · Post by **Duce H_K_** » Wed Oct 19, 2016 7:49 am

bruce wrote:In this particular case, the 1x riser keeps the GPU at 100%. at least during the routine analysis portions of the WU. It might or might not show different results during startup/finish/checkpointing/etc. but that's a small portion of the run.

Environment tested (as above except):
GPU: GTX 980 (Maxwell)
FAHCore_18 (Project 9160)

Results:
x16 utilization: 1%
GPU utilization: 99-100%
TPF: 2:26
PPD: 138877

From all of 19 p9160s my GTX970OC had average PPD of 293840,27

Could a riser cause such a performance loss? All the WUs ran at PCI-E x16 2.0

Nathan_P · Post by **Nathan_P** » Wed Oct 19, 2016 3:20 pm

Possible, did you leave a cpu core free to feed the gpu? My testing never went down as far as x1 but others did see a dip, not as severe as yours though.

hiigaran · Post by **hiigaran** » Sat Oct 22, 2016 7:20 pm

Keep this data coming! I've got less than three days before I start buying the parts!

Also, that data was tested for Linux. While I plan to have one system running a distro, I plan to have another Windows system for DC projects that are more optimised for Windows. Have similar results been verified for Windows?

yalexey · Post by **yalexey** » Wed Nov 02, 2016 12:52 pm

FAHbench show significant performance loss if I connect one GTX 1070 card to PCIe 3.0 x1 slot. Depending on test pattern - 9%-47% low performance. And 75-85% bus controller load on this card.
Something similar occurs in the processing of real jobs by core 21.

Win 10. Asrok B150 motherboard.

b4441b29 · Post by **b4441b29** » Thu Nov 10, 2016 2:31 pm

Here are some real PPD numbers for the GTX 1070 on the Ubuntu 16.06 system I posted FAHBench tests on earlier.

In the 16x lane slot with Nvidia driver version 367.44 it averaged 659240 PPD one week and 629492 PPD the next.

In the 4x lane slot with Nvidia driver version 370.28 it averaged 662106 PPD over a week.

The driver upgrade sped up the folding more than switching to the 4x slot slowed it. That corresponds to what I saw in the benchmarks. I'm running the 370.28 driver in the 16x lane slot now. It hasn't been running a full week yet.

foldy · Post by **foldy** » Thu Nov 10, 2016 3:45 pm

x4 slot should be fine but x1 is too slow.

Folding Forum

PCI-e bandwidth/capacity limitations

Re: PCI-e splitter?

Re: PCI-e splitter?

Re: PCI-e splitter?

Re: PCI-e splitter?

Re: PCI-e splitter?

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e splitter?

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations

Re: PCI-e bandwidth/capacity limitations