Type of SSE used

Mindmatter · Post by **Mindmatter** » Mon Feb 28, 2011 9:43 pm

I came across an article outlining the floating point processing unit in the upcoming AMD Bulldozer processor. Since 90% of my CPU usage goes to F@H, this paragraph caught my attention.

The floating point unit is also much more robust than it used to be. The Phenom had a single 128 bit unit per core, and Bulldozer now has it as 2 x 128 bit units. It can combine those units when running AVX and act as a single 256 bit unit. There are some performance limitations there as compared to the Intel CPUs which support AVX, and in those cases Intel should be faster. However, AVX is still very new, and very unsupported. AMD will have an advantage here over Intel when running SSE based code. It can perform 2 x 128 bit operations, or up to 4 x 64 bit operations. Intel on the other hand looks to only support 1 x 128 bit operation and 2 x 64 bit operations. The unit officially supports SSE3, SSE 4.1, SSE 4.2, AVX, and AES. It also supports advanced multiply-add/accumulate operations, something that has not been present in previous generations of CPUs.

So what does F@H use for floating point processing? Single core and SMP, also does bigadv use anything different than SMP?

Link:
http://www.pcper.com/article.php?aid=1083

7im · Post by **7im** » Mon Feb 28, 2011 10:10 pm

Trade ya info, as I'd like to read the whole article. Please post a link...

To answer, SSE mostly, one fahcore uses SSE2, but those projects are few, and one of the new CPU fahcore uses all SSE versions available. -bigadv is also SSE, no difference.

Please also take in to consideration not only the bit width, but how many cpu cycles it takes to do 4x64 on AMD vs 2x64 on Intel. If the AMD takes 4 CPU cycles and the Intel only takes 2 CPU cycles, then effectively the performance is the same. (That's why I'd like to read more...)

codysluder · Post by **codysluder** » Mon Feb 28, 2011 10:45 pm

The other consideration is how many operations can be done simultaneously. Some AMD fanboys have criticized the Intel fanboys because chips like the i7 have virtual cores rather than real cores and imply that Bulldozer will have real cores. Whether you call something a real or a virtual core is asking the wrong question entirely. The i7 has four shared SSE units and each can process a certain number of single precision or SSE operations per second (and SMP can use all of them). Bulldozer has a certain number of shared SSE units and each can process a certain number of single precision or SSE operations per second (and SMP can use all of them).

The hardware pieces may be called by different names and may be structured differently, but the bottom line will be how many SSE operations can be done, not what you call them.

P5-133XL · Post by **P5-133XL** » Mon Feb 28, 2011 10:58 pm

I try to wait till I see how well a particular CPU will fold rather than do the speculation game. I think there are too many variables that can affect the numbers so any speculation tends to be wildly inaccurate.

Mindmatter · Post by **Mindmatter** » Mon Feb 28, 2011 11:13 pm

Added the link.

The other consideration is how many operations can be done simultaneously. Some AMD fanboys have criticized the Intel fanboys because chips like the i7 have virtual cores rather than real cores and imply that Bulldozer will have real cores.

Well from what I can tell Bulldozer will be able to do one 128-bit floating point op per core. I thought that was per clock as well but I could be mistaken. It can also split that and do two 64-bit FP ops, and it sounds like if one core in the module is not doing any FP ops then the other core can do four 64-bit, two 128-bit, or one 256-bit FP ops per clock.

I'm having a hard time getting good information on Intel's Sandy Bridge FPU. It looks like it can do one 256-bit FP op per clock but I have seem some articles say two, although I don't know if that is with hyper threading or how that works out especially since HT shares core components. I'm also reading that Sandy Bridge doesn't even have a true 256-bit FPU, people are saying it doubles the core frequency of a 128-bit FPU to effectively process a 256-bit instruction. That would also mean it would run at 6GHz for a 3GHz core!

Edit: I kind of missed part of my point above with the Intel FPU. How many F@H FP ops can Sandy Bridge do? I also can't find any info on how Intel splits the FPU for 64-bit instructions, I would imagine it is the same as AMD's method.

I try to wait till I see how well a particular CPU will fold rather than do the speculation game. I think there are too many variables that can affect the numbers so any speculation tends to be wildly inaccurate.

Well I'm an AMD guy so it really doesn't matter to me who is faster, I know even if BD is slower than SB then it will be priced as such. But at the same time I like speculating about this stuff, it is half the fun of waiting for a new product

7im · Post by **7im** » Mon Feb 28, 2011 11:29 pm

http://www.behardware.com/articles/623- ... -test.html

The above article was very in-depth comparing the then new Core 2 Duo architecture to the previous P4 netburst hardware.

Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8.

The sections in red are the important parts. It accurately predicted that C2D chips were 2x faster on FAH than P4 systems. FAH is single precision, so that's what we need to start gather info about for BD and SB.

EDIT: Found another good blog post discussion 32 bit FPU performance on SB vs BD... http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/

If I read it correctly, you'll get full speed from BD without a recompile of FAH, but only half speed from SB until FAH is recompiled with AVX.

Mindmatter · Post by **Mindmatter** » Tue Mar 01, 2011 1:04 am

If I read it correctly, you'll get full speed from BD without a recompile of FAH, but only half speed from SB until FAH is recompiled with AVX.

I see what you are saying. AMD's method will allow legacy code to use all of the 256-bit FPU whereas Intel's method requires AVX in order to split the FPU up in anyway the thread likes.

But at the same time it looks like even a Core 2 can do more per clock FP ops than BD unless we are talking about only a single core in the module using the FPU. This makes me wonder if BD will be more efficient with F@H code if we only use 4 out of the 8 cores with the SMP client. Basically would it be more efficient having 4 threads pumping out 8 FP's each or 8 threads with four FP's each?

This could make for an interesting mix of CPU and GPU clients. Have all of the FPU's fully loaded and running while still having four integer only cores that could feed four or more GPU's data without slowing down the SMP client. Might not be a speed king but could put a new spin on efficient usage of cores.

codysluder · Post by **codysluder** » Tue Mar 01, 2011 6:25 am

Mindmatter wrote:Basically would it be more efficient having 4 threads pumping out 8 FP's each or 8 threads with four FP's each?

At this point, it's all speculation. Nobody really knows (especially me) but I do like to speculate.

If four threads can run exactly twice as fast as eight and if there's no interruption from other programs running, it doesn't matter. How much interruption from other tasks you'll see, how frequently, and how those interruptions are balanced between the four or eight threads probably has a bigger influence than anything.

Even when a program stops sequential operation to take a branch, there's a minor interruption when the process takes a different path than the predicted path. That's the whole concept behind Hyper-Threading. Every unpredicted branch causes the pipeline to stall while new instructions are prepared to re-fill the pipeline. During those times, a real processor would be idle while for a virtual processor it matters less because the hardware can process data from another task while one task is stalled. That seems to be worth about 15% to FAH-SMP which is not a huge amount, but it's certainly as much as the uncertainties we're speculating about. I suppose BD will do something similar, so even if 85% of FAH's performance depends on the SSE process, we can't make a dependable prediction.

Folding Forum

Type of SSE used

Type of SSE used

Re: Type of SSE used

Re: Type of SSE used

Re: Type of SSE used

Re: Type of SSE used

Re: Type of SSE used

Re: Type of SSE used

Re: Type of SSE used