Project 5113

Moderators: Site Moderators, FAHC Science Team

shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Project 5113

Post by shatteredsilicon »

toTOW wrote:The following unit have been added as improvements to the two basic unit :
- MMX (MultiMedia eXtensions) : this unit was added by Intel to the Pentium core to speed up multimedia application. This is an extension to the ALU, and it can only work on integers too. This extension is usually useless, and multimedia application use floating point operation ... :roll:
The important part is that this is for integer vector processing (SIMD). Same as SSE, but for integers only. The interesting part is that it is implemented using the old i387 FPU registers. You can pack 2x 32-bit integers into one 80-bit i387 register (top 16 bits unused in that case), and effectively double the integer throughput at the expense of being able to use that register for FPU operations until the MMX operation is complete.

Now we can talk about power consumption and stability issues. The worst case is of course when a lot of processing units are used, with a lot of data to move between CPU and memory. Here are some examples, with FAH cores and WU :
toTOW wrote:- GroSMP is a bit different : it doesn't stresses CPU as hard as regular Gromacs because processing power is limited by data transfers between CPU cores, but it's easy to guess, it will stresses a lot the caches and memory subsystem. The A2 SMP core is progressively changing the rule as it better use the CPU cores ... So we can say this is one of the "worst" case using ALU, FPU, SSE, caches and system memory.
The main price of these migrations is to do with the L2 caches. On the C2D and Phenom, this migration is relatively cheap (although it still has a cost), because the L2 cache is shared between all the available cores. On the C2Q there is an additional cost when migrating between upper and lower core pair, because it is essentially 2x C2D, and there are separate L2 caches for the two dual-core dies. That means that if migrating processes between the separate dies, the data in the cache on one side is effectively wasted, and the data has to be fetched again from the main memory. I'm not sure if the current versions of the OS kernels pay enough attention to this when deciding to migrate processes.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Project 5113

Post by codysluder »

shatteredsilicon wrote:That means that if migrating processes between the separate dies, the data in the cache on one side is effectively wasted, and the data has to be fetched again from the main memory. I'm not sure if the current versions of the OS kernels pay enough attention to this when deciding to migrate processes.
Everything you say about migrating process is true, but there's a lot of data being moved between these processes even when the processes have locked affinity. Some of MPI uses the tcp/ip stack and some is done by direct inter-task memory transfers, but either way the pair of tasks that share a cache will have faster access than the pairs of tasks that do not. With four processes, this is probably half of the data movement.
Post Reply