7im wrote:shatteredsilicon wrote:WangFeiHong wrote:I thought we had established that 1x4Ghz processors were at least, faster than 2x2ghz processors due to inter-core bottlenecks.
That's what I thought I said - 1x4GHz is better than 2x2GHz.
Did I miss the post where you show SMP folding numbers to back up this statement?
I don't have a setup directly equivalent to this to demonstrate with, but the fact that running 4x clients on a quad with each affinity bound to one core only yields massively more PPD than running 1 client on all fore cores (1 FahCore per CPU core) is pretty strong evidence of it. If the scaling was perfectly balanced under real-world conditions, then the performance would favour the setup with fewest total processes, and thus running 4x clients would be slower because there is more process switching taking place (which is overheady and slows things down), and since there is 4x the amount of data being processed, cache effectiveness is also significantly reduced, not to mention the 4-fold increase in memory bandwidth contention. The fact that despite the extra process switching overheads and more cache and memory bandwidth contention from running multiple SMP folding processes each bound to one CPU core, this setup still yields much higher throughput (at the expense of a much less increased latency) means that the MPI FAH scaling is actually pretty dire.
The problem is exactly as Halo describes it - the performance of the whole operation is limited by the performance of the slowest core, which means that the effect of other processes competing for CPU time reflects 4-fold on the folding performance, i.e. another process using up 10% of one core should slow one thread down by 10%, but because the folding speed is limited by the slowest thread, it will actually slow
all four threads down by 10%.
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers