GPU: Stream Processors - Bits, Number - what is important?

[WHGT]Cyberman · Post by **[WHGT]Cyberman** » Fri Dec 07, 2012 9:34 am

(I wasn't sure which forum is best suited, so I chose GD. Feel free to move.)

I've tried to find an answer through searching the forum, but the bits I found didn't really answer my questions, no doubt because I'm not completely up-to-date with regard to how GPUs actually work.
I understand that AMD and nVidia are using different approaches regarding their GPUs, so anything I say is only directed towards AMD. (And Folding@Home, of course.)

From what I can see, the most declared features of a GPU are the number of Stream Processors, the Memory (Bus?) Width, the clock Speed (GPU and memory), and the number of Texture Units. Perhaps shader version as well.

Which of those are important to FAH an in what relation?

My guess would be number of SP, and width of memory interface (as in, if it's smaller than the numbers being processed, it'll take two steps to get the whole number through), then clock speed.
(Of course, only within limits, no matter of processors is going to make up for an endlessly slow bus.)
Is the shader model relevant at all? What about texture units?

However, what is the required bus width? Does it have to be 128Bit, or would 64Bit suffice as well?
Is FAH (currently, in the near future) going to use up all the processors, or is there a reasonable limit above which it's better to go for more clock speed, or wider bus?

derrickmcc · Post by **derrickmcc** » Fri Dec 07, 2012 12:27 pm

[WHGT]Cyberman wrote: ...
I understand that AMD and nVidia are using different approaches regarding their GPUs, so anything I say is only directed towards AMD. (And Folding@Home, of course.)

From what I can see, the most declared features of a GPU are the number of Stream Processors, the Memory (Bus?) Width, the clock Speed (GPU and memory), and the number of Texture Units. Perhaps shader version as well.
...
Is the shader model relevant at all? What about texture units?

However, what is the required bus width? Does it have to be 128Bit, or would 64Bit suffice as well?
Is FAH (currently, in the near future) going to use up all the processors, or is there a reasonable limit above which it's better to go for more clock speed, or wider bus?

Most important are:
1) number of Stream Processors
2) GPU clock speed
3) Power comsumption

Regarding shader model/version, generally a later model/version may be better, but may depend on the development of a new core to utilise the GPU effectively (with Nvidia the Kepler GPUs did not fold until a new core 15 v2.25 was released)

Least significant are:
a) Memory speed
b) memory size
c) memory bus
d) texture units

History with projects for Nvidia suggests that we will see GPU WUs with larger number of atoms, so it is unlikely that FAH is going to use up all the processors.

P5-133XL · Post by **P5-133XL** » Fri Dec 07, 2012 12:55 pm

I would add the number of single precision floating point units (FPU's) is also highly important and the number of double precision FPU's are not.

Post by **proteneer** » Fri Dec 07, 2012 7:06 pm

A a general rule of thumb, we look at the # of cores multiplied by the clock speed. Unfortunately we don't support double precision for now, because all GPUs for a given project would need to be in the same precision. For most cases, I don't think we are bound by memory. When enough GPUs out there start supporting double precision we can perhaps set up some double precision exclusive projects.

Rattledagger · Post by **Rattledagger** » Sat Dec 08, 2012 12:31 am

proteneer wrote:When enough GPUs out there start supporting double precision we can perhaps set up some double precision exclusive projects.

The hardware isn't a problem, just look at Milkyway@home.

Post by **bruce** » Sat Dec 08, 2012 12:49 am

The real question is whether DP improves science or not.

First some history: Gromacs for the CPU was first introduced running pure x86 code. Later optimized code was introduced to use Single Precision SSE. At that stage of hardware development, some donors had SSE support and others did not so both codepaths were included the same FahCore with switches to manage whether to run optimized or not. Later, Double Precision was seriously considered and a new FahCore and a new set of projects were introduced for SSE2 along with the server logic to exclude assigning those projects to non-SSE2 machines.

When running the same protein, double precision ran significantly slower (at that time, and with that generation of hardware). After a short time, no new projects for FahCore_79 were introduced. I conclude from that that apparently they concluded that the scientific gain of double precision was less than the cost of reduced performance on the same protein. (Also: for protein folding, does SP support sound scientific conclusions or would increased precision show additional results that might not be found from SP results?)

Fast Forward to today's hardware, but ask the same questions. How would performance compare for DP vs. SP across the entire spectrum of DP-capable GPUs and how would that compare to whatever scientific benefit would derive from having more accurate results? If the answer favors DP, would the servers be able to differentiate between DP-capable GPUs and SP-only GPUs so that the assignment process can use both classes of GPUs effectively?

Just because you have a nice hardware feature doesn't necessarily mean you need it -- and what's needed by project X is not necessarily what's needed by project Y.

Rattledagger · Post by **Rattledagger** » Sat Dec 08, 2012 3:26 am

bruce wrote:Later, Double Precision was seriously considered and a new FahCore and a new set of projects were introduced for SSE2 along with the server logic to exclude assigning those projects to non-SSE2 machines.

(snip)

Fast Forward to today's hardware, but ask the same questions. How would performance compare for DP vs. SP across the entire spectrum of DP-capable GPUs and how would that compare to whatever scientific benefit would derive from having more accurate results? If the answer favors DP, would the servers be able to differentiate between DP-capable GPUs and SP-only GPUs so that the assignment process can use both classes of GPUs effectively?

Well, I do remember back in the days with "SSE2-only" wu's I deleted some of them, since got downloaded to my SSE-only computer. The server-logic didn't work very well back in the days, and how it's working now is just to look on some of the other forum-threads...

mmonnin · Post by **mmonnin** » Sat Dec 08, 2012 3:34 am

Yeah I remember those days when my Dothan Laptop at 1.6GHz ran the same WU 50% faster than my Athlons at 2.4GHz because it had SSE2. 50% slower clock for 50% faster WUs.

But yeah as derrick said, SP, clock, power then I'd say cooling (to OC more

)

Post by **proteneer** » Sat Dec 08, 2012 6:52 pm

On the double precision point - GPUs are unfortunately terribad at double precision (even the Teslas). Our internal testing shows 1/6-1/8th the performance of Single Precision.

Post by **Jesse_V** » Sat Dec 08, 2012 7:23 pm

proteneer wrote:GPUs are unfortunately terribad at double precision

Thanks for the new word. "Terribad". Huh. http://www.urbandictionary.com/define.php?term=Terribad

codysluder · Post by **codysluder** » Sun Dec 09, 2012 7:15 am

proteneer wrote:On the double precision point - GPUs are unfortunately terribad at double precision (even the Teslas). Our internal testing shows 1/6-1/8th the performance of Single Precision.

So I guess that means even if you do create some Double Precision projects, nobody is going to want to run them. In the days that Bruce is talking about, SSE2 was half as fast as SSE so DP on gpus is a lot less practical.

mmonnin · Post by **mmonnin** » Sun Dec 09, 2012 3:01 pm

Oh yeah SSE2 WUs on hardware with SSE2 smoked much faster hardware (MHz) if it didn't have SSE2. Hopefully that would be able to assign those WUs to just the better GPUs that can do them the best. But if they benchmark them like the SSE2 WUs, they will be average PPD on non-DP GPUs but excellent on DP GPUs as they will be able to complete much faster.

Rattledagger · Post by **Rattledagger** » Sun Dec 09, 2012 5:34 pm

mmonnin wrote:Oh yeah SSE2 WUs on hardware with SSE2 smoked much faster hardware (MHz) if it didn't have SSE2. Hopefully that would be able to assign those WUs to just the better GPUs that can do them the best. But if they benchmark them like the SSE2 WUs, they will be average PPD on non-DP GPUs but excellent on DP GPUs as they will be able to complete much faster.

Well, chances are a double-precision application will just error-out if tries to run it on a non-DP-GPU, but even if assuming it would work the performance would be really bad.

As for FAH using DP on GPU, with FAH's dreadful Amd-performance and Nvidia's abysimally low double-precision-speed, especially on 6xx-series of cards, I wouldn't expect it...

Folding Forum

GPU: Stream Processors - Bits, Number - what is important?

GPU: Stream Processors - Bits, Number - what is important?

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan

Re: GPU: Stream Processors - Bits, Number - what is importan