Type of SSE used

Moderators: Site Moderators, FAHC Science Team

Post Reply
Mindmatter
Posts: 5
Joined: Tue May 27, 2008 1:53 pm

Type of SSE used

Post by Mindmatter »

I came across an article outlining the floating point processing unit in the upcoming AMD Bulldozer processor. Since 90% of my CPU usage goes to F@H, this paragraph caught my attention.
The floating point unit is also much more robust than it used to be. The Phenom had a single 128 bit unit per core, and Bulldozer now has it as 2 x 128 bit units. It can combine those units when running AVX and act as a single 256 bit unit. There are some performance limitations there as compared to the Intel CPUs which support AVX, and in those cases Intel should be faster. However, AVX is still very new, and very unsupported. AMD will have an advantage here over Intel when running SSE based code. It can perform 2 x 128 bit operations, or up to 4 x 64 bit operations. Intel on the other hand looks to only support 1 x 128 bit operation and 2 x 64 bit operations. The unit officially supports SSE3, SSE 4.1, SSE 4.2, AVX, and AES. It also supports advanced multiply-add/accumulate operations, something that has not been present in previous generations of CPUs.
So what does F@H use for floating point processing? Single core and SMP, also does bigadv use anything different than SMP?

Link:
http://www.pcper.com/article.php?aid=1083
Last edited by Mindmatter on Mon Feb 28, 2011 10:57 pm, edited 1 time in total.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Type of SSE used

Post by 7im »

Trade ya info, as I'd like to read the whole article. Please post a link...

To answer, SSE mostly, one fahcore uses SSE2, but those projects are few, and one of the new CPU fahcore uses all SSE versions available. -bigadv is also SSE, no difference.

Please also take in to consideration not only the bit width, but how many cpu cycles it takes to do 4x64 on AMD vs 2x64 on Intel. If the AMD takes 4 CPU cycles and the Intel only takes 2 CPU cycles, then effectively the performance is the same. (That's why I'd like to read more...) ;)
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Type of SSE used

Post by codysluder »

The other consideration is how many operations can be done simultaneously. Some AMD fanboys have criticized the Intel fanboys because chips like the i7 have virtual cores rather than real cores and imply that Bulldozer will have real cores. Whether you call something a real or a virtual core is asking the wrong question entirely. The i7 has four shared SSE units and each can process a certain number of single precision or SSE operations per second (and SMP can use all of them). Bulldozer has a certain number of shared SSE units and each can process a certain number of single precision or SSE operations per second (and SMP can use all of them).

The hardware pieces may be called by different names and may be structured differently, but the bottom line will be how many SSE operations can be done, not what you call them.
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Type of SSE used

Post by P5-133XL »

I try to wait till I see how well a particular CPU will fold rather than do the speculation game. I think there are too many variables that can affect the numbers so any speculation tends to be wildly inaccurate.
Image
Mindmatter
Posts: 5
Joined: Tue May 27, 2008 1:53 pm

Re: Type of SSE used

Post by Mindmatter »

Added the link.
The other consideration is how many operations can be done simultaneously. Some AMD fanboys have criticized the Intel fanboys because chips like the i7 have virtual cores rather than real cores and imply that Bulldozer will have real cores.
Well from what I can tell Bulldozer will be able to do one 128-bit floating point op per core. I thought that was per clock as well but I could be mistaken. It can also split that and do two 64-bit FP ops, and it sounds like if one core in the module is not doing any FP ops then the other core can do four 64-bit, two 128-bit, or one 256-bit FP ops per clock.

I'm having a hard time getting good information on Intel's Sandy Bridge FPU. It looks like it can do one 256-bit FP op per clock but I have seem some articles say two, although I don't know if that is with hyper threading or how that works out especially since HT shares core components. I'm also reading that Sandy Bridge doesn't even have a true 256-bit FPU, people are saying it doubles the core frequency of a 128-bit FPU to effectively process a 256-bit instruction. That would also mean it would run at 6GHz for a 3GHz core!

Edit: I kind of missed part of my point above with the Intel FPU. How many F@H FP ops can Sandy Bridge do? I also can't find any info on how Intel splits the FPU for 64-bit instructions, I would imagine it is the same as AMD's method.
I try to wait till I see how well a particular CPU will fold rather than do the speculation game. I think there are too many variables that can affect the numbers so any speculation tends to be wildly inaccurate.
Well I'm an AMD guy so it really doesn't matter to me who is faster, I know even if BD is slower than SB then it will be priced as such. But at the same time I like speculating about this stuff, it is half the fun of waiting for a new product :D
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Type of SSE used

Post by 7im »

http://www.behardware.com/articles/623- ... -test.html

The above article was very in-depth comparing the then new Core 2 Duo architecture to the previous P4 netburst hardware.
Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8.
The sections in red are the important parts. It accurately predicted that C2D chips were 2x faster on FAH than P4 systems. FAH is single precision, so that's what we need to start gather info about for BD and SB. ;)


EDIT: Found another good blog post discussion 32 bit FPU performance on SB vs BD... http://blogs.amd.com/work/2010/10/25/the-new-flex-fp/

If I read it correctly, you'll get full speed from BD without a recompile of FAH, but only half speed from SB until FAH is recompiled with AVX.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Mindmatter
Posts: 5
Joined: Tue May 27, 2008 1:53 pm

Re: Type of SSE used

Post by Mindmatter »

If I read it correctly, you'll get full speed from BD without a recompile of FAH, but only half speed from SB until FAH is recompiled with AVX.
I see what you are saying. AMD's method will allow legacy code to use all of the 256-bit FPU whereas Intel's method requires AVX in order to split the FPU up in anyway the thread likes.

But at the same time it looks like even a Core 2 can do more per clock FP ops than BD unless we are talking about only a single core in the module using the FPU. This makes me wonder if BD will be more efficient with F@H code if we only use 4 out of the 8 cores with the SMP client. Basically would it be more efficient having 4 threads pumping out 8 FP's each or 8 threads with four FP's each?

This could make for an interesting mix of CPU and GPU clients. Have all of the FPU's fully loaded and running while still having four integer only cores that could feed four or more GPU's data without slowing down the SMP client. Might not be a speed king but could put a new spin on efficient usage of cores.
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Type of SSE used

Post by codysluder »

Mindmatter wrote:Basically would it be more efficient having 4 threads pumping out 8 FP's each or 8 threads with four FP's each?
At this point, it's all speculation. Nobody really knows (especially me) but I do like to speculate.

If four threads can run exactly twice as fast as eight and if there's no interruption from other programs running, it doesn't matter. How much interruption from other tasks you'll see, how frequently, and how those interruptions are balanced between the four or eight threads probably has a bigger influence than anything.

Even when a program stops sequential operation to take a branch, there's a minor interruption when the process takes a different path than the predicted path. That's the whole concept behind Hyper-Threading. Every unpredicted branch causes the pipeline to stall while new instructions are prepared to re-fill the pipeline. During those times, a real processor would be idle while for a virtual processor it matters less because the hardware can process data from another task while one task is stalled. That seems to be worth about 15% to FAH-SMP which is not a huge amount, but it's certainly as much as the uncertainties we're speculating about. I suppose BD will do something similar, so even if 85% of FAH's performance depends on the SSE process, we can't make a dependable prediction.
Post Reply