Quad-core 2Ghz vs Dual-core 4Ghz - Which faster?

Moderators: Site Moderators, FAHC Science Team

WangFeiHong
Posts: 47
Joined: Mon Oct 27, 2008 1:40 pm

Quad-core 2Ghz vs Dual-core 4Ghz - Which faster?

Post by WangFeiHong »

Just a quick question. It's said SMP alot faster than classic because they have some program that can let cores "talk" to each other very fast? But does it really do any calculations?

If it's only inter-core communication that makes SMP faster, then what difference is it from 2 classic client, do 2 WU, then results getting joined back up at Stanford (except it save them time joining results?)

The "talking" between cores doesn't do any calculations, so if i run SMP on 2Ghz Dual Core vs. 4Ghz Single Core, shouldn't we see similar FLOP performance?
Last edited by WangFeiHong on Sat Nov 22, 2008 11:31 am, edited 1 time in total.
Lgringo
Posts: 26
Joined: Thu Dec 06, 2007 1:02 am
Location: El Dorado

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by Lgringo »

WangFeiHong wrote: if i run SMP on 2Ghz Dual Core vs. 4Ghz Single Core, shouldn't we see similar FLOP performance?
Good question & welcome to the Forum! I'm also interested to see what the feedback on this will be. Happy Trails!
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by 7im »

The SMP FAQ would be a good read to start with, as I can't remember all the reasons. For instance, the SMP client does much larger and more complex work units than the CPU client, and the quick return of work units is important to the project.

Also take for example that if the SSE processing was the speed bottleneck, doubling the CPU GHz doesn't necessarily double the folding performance. However, doubling the number of SSE processing power DOES double the speed.

And the dual and quad core systems of today are typcially more advanced (faster SSE speed) than the single core systems of a few years ago.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by MtM »

WangFeiHong wrote:Just a quick question. It's said SMP alot faster than classic because they have some program that can let cores "talk" to each other very fast? But does it really do any calculations?
Does it do calculations. well no. What it does is allowing more calculations to be made per clock cycle ( as you have more cores/threads ).
WangFeiHong wrote:If it's only inter-core communication that makes SMP faster, then what difference is it from 2 classic client, do 2 WU, then results getting joined back up at Stanford (except it save them time joining results?)
WangFeiHong wrote:The "talking" between cores doesn't do any calculations, so if i run SMP on 2Ghz Dual Core vs. 4Ghz Single Core, shouldn't we see similar FLOP performance?
WangFeiHong wrote:If it's only inter-core communication that makes SMP faster, then what difference is it from 2 classic client, do 2 WU, then results getting joined back up at Stanford (except it save them time joining results?)
The diffrence is that with having that much greater processing power per clock cycle you can make more complex work units which can represent a longer piece of the time it takes for a proteine to fully fold. It's not the same as joining four uniprocessor wu's, because those are not 1/4 of what an smp wu is ( changed to quad core setup as the smp client is designed for four threads and it runs best on four cores ).
shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by shatteredsilicon »

In general, single fast core scales linearly up to the point where it is I/O (network/disk/memory) bound.
Multiple cores scale linearly only if all the operations are completely independent. This is very rarely the case in practice, so in general usage, 1x 4GHz is going to be faster than 2x 2GHz.

As a rule of thumb, in desktop application usage, performance scales linearly with clock speed, but logarithmically with the number of cores.

There are also overheads in migrating processes between CPU cores. C2Q tends to suffer from this more than the Phenom X4 due to it actually being a 2x2 design vs 4x1 on the Phenom. But there are STILL migration overheads, nonetheless.

In the context of folding, you can easily test this. Fire up the SMP client on a 4-core CPU. See how many PPD it is producing when the 4 threads run freely on the available CPU cores. Then fire up 4 SMP clients, binding each client to only one core (all 4 of it's threads will run on a single core). So you will end up with 4x4=16 folding processes running, 4 bound to each CPU core. What you will find is that the PPD of the latter (4x4) configuration is significantly higher than that of the former (4x1) configuration. This difference is made worse if the machine is experiencing other load (applications), because due to an inefficiency in the folding process distribution design, all threads seem to have fixed weighting, so if one thread bogs down, they will all slow down, which will result in a flurry of process migration (overheads) and thread imbalances (CPU idle time). In the case of 4x4 this won't be the case because there is no migration between the cores, and if some process is eating 20% of one core, all it will do is reduce CPU resources for the folding processes from 25% each to 20% each, and they'll still balance out.

So, if you want maximum PPD out of your SMP client, run one instance per core, and bind each instance to one core so it never migrates away from it. The difference can be massive in some cases. For example, since I run 6x GPU clients (no way to distribute them evenly over 4 cores), one SMP client will produce about 2200PPD. If I start 4 SMP clients and bind each one to it's own core, the total production of the 4 clients is about 4700 PPD. This is despite the fact that each client has less memory bandwidth (1/4 of what a single client has available). At that point, however, the client tends to get memory bound, so if you run a 4x4 configuration, the Phenom X4 may well scale better due to the fact that it has over 50% more RAM bandwidth than an equivalent Core2.
MtM wrote:changed to quad core setup as the smp client is designed for four threads and it runs best on four cores
I guess we'll just have to disagree on that again. I've presented my evidence, and without evidence, once can only have an opinion. ;)
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by MtM »

shatteredsilicon wrote:In general, single fast core scales linearly up to the point where it is I/O (network/disk/memory) bound.
Multiple cores scale linearly only if all the operations are completely independent. This is very rarely the case in practice, so in general usage, 1x 4GHz is going to be faster than 2x 2GHz.

As a rule of thumb, in desktop application usage, performance scales linearly with clock speed, but logarithmically with the number of cores.

There are also overheads in migrating processes between CPU cores. C2Q tends to suffer from this more than the Phenom X4 due to it actually being a 2x2 design vs 4x1 on the Phenom. But there are STILL migration overheads, nonetheless.

In the context of folding, you can easily test this. Fire up the SMP client on a 4-core CPU. See how many PPD it is producing when the 4 threads run freely on the available CPU cores. Then fire up 4 SMP clients, binding each client to only one core (all 4 of it's threads will run on a single core). So you will end up with 4x4=16 folding processes running, 4 bound to each CPU core. What you will find is that the PPD of the latter (4x4) configuration is significantly higher than that of the former (4x1) configuration. This difference is made worse if the machine is experiencing other load (applications), because due to an inefficiency in the folding process distribution design, all threads seem to have fixed weighting, so if one thread bogs down, they will all slow down, which will result in a flurry of process migration (overheads) and thread imbalances (CPU idle time). In the case of 4x4 this won't be the case because there is no migration between the cores, and if some process is eating 20% of one core, all it will do is reduce CPU resources for the folding processes from 25% each to 20% each, and they'll still balance out.

So, if you want maximum PPD out of your SMP client, run one instance per core, and bind each instance to one core so it never migrates away from it. The difference can be massive in some cases. For example, since I run 6x GPU clients (no way to distribute them evenly over 4 cores), one SMP client will produce about 2200PPD. If I start 4 SMP clients and bind each one to it's own core, the total production of the 4 clients is about 4700 PPD. This is despite the fact that each client has less memory bandwidth (1/4 of what a single client has available). At that point, however, the client tends to get memory bound, so if you run a 4x4 configuration, the Phenom X4 may well scale better due to the fact that it has over 50% more RAM bandwidth than an equivalent Core2.
MtM wrote:changed to quad core setup as the smp client is designed for four threads and it runs best on four cores
I guess we'll just have to disagree on that again. I've presented my evidence, and without evidence, once can only have an opinion. ;)
Jezus christ....
Back in the day, we couldn't get long trajectories. By long I mean, we couldn't even come close to getting simulations that were comparable to the time it typically takes for a protein to folding in experiments. For example, the villin headpiece molecules fold in about 50 microseconds in a test tube. Simulations were originally limited to maybe 50 or 100 nanoseconds, about 1,000 times less. The idea is that the probability of a 50-nanosecond trajectory resulting in a folding event, for a protein that folds in 50 microseoncds on average, is

p(folding) = 1 - exp[ -( 50 ns )/( 50,000 ns ) = ~ 0.1 %

It's kind of like flipping a biased coin: say 99.9 % of the time, you get heads, but if you're lucky one one flip you *might* get tails. Of course, if you want to see tails, then you could always try to flip a lot, and occasionally (like 1 in every 1,000 flips) you'll observe heads. Likewise, if there's ~0.1% chance of observing folding in a 50-ns trajectory, then if we run 1,000 simulations then on average one of them should fold, if we run 10,000 simulations then ~10 should fold, etc. This is the reason for F@H in the first place: thousands of client machines mean thousands of "coin flips" or attempts to fold the protein, which increases our chances of actually seeing tails or a protein fold.

(The key here is the randomness in the simulations, best represented by random kinetic energy. When we say some system is at some temperature, we are really making a statement about the distribution of kinetic energies of the particles, but nothing about specific kinetic energies about specific particles. What we do is to initially supply kinetic energies to all the particles that are consistent with the specified temperature in a random way. When you do this, then the analysis in the proceeding paragraph makes sense.)

So in the "early days" F@H simulations, completing more work units was better for the project: basically everyone's machines were doing the computational-chemical equivalent of flipping coins. We'd send a work unit to your machine, which would compute the result of the "coin flip" (by moving atoms around according to their kinetic energy, their interactions with one another, and by applying Newton's laws of motion).

The newer way of thinking about things is essentially a test of whether the formula I used to calculate the percent chance of folding is accurate. If everything is simple, that simple exponential formula applies, but experimental evidence, let alone Murphy's law, would seem to indicate that things aren't so simple. For example, there could be two exponentials, not just one, and using the strategy above you'd really only see the first (faster) exponential. The contemporary approach relies on really fast machines (SMP, GPU, PS3, etc.) to make trajectories longer, instead of doing more trajectories: running a new Gen for the same Run/Clone rather than starting a new Gen 0 for a new Run/Clone. This way we can see if there is more complicated behavior. And the way to get this done is to use fast hardware with fast turnaround (deadline) times, because we need the Gen you're working on back before we can start to work on the next Gen.
Take it from someone you will trust more then me for god's sake man.

And why are you talking about PPD when OP asked about RESULTS indicating it's science he's interested in?

You can have your opinion offcourse but don't mind me not only disagreeing but also finding you abit 'off the mark' in this thread :ewink:

And then I'm not even talking about the 'mistakes' in your post, it's just your pov in general being so wrong that I don't even need to.
shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by shatteredsilicon »

MtM wrote:And then I'm not even talking about the 'mistakes' in your post, it's just your pov in general being so wrong that I don't even need to.
Allow me to point you at a snippet from the original post in this thread:
WangFeiHong wrote:The "talking" between cores doesn't do any calculations, so if i run SMP on 2Ghz Dual Core vs. 4Ghz Single Core, shouldn't we see similar FLOP performance?
Since all things being equal (same project WU), the number of points for the WU will be fixed, as will the number of FLOPs required to process it. Thus, FLOPs will be proportional to points. Therefore, more points will yield more FLOPs, and higher number of processed WUs per (larger) unit of time (e.g. per month). What I have provided is information on how to minimize overheads and increase the long-term WU throughput, i.e. the amount of science being done per processor per month, if you will. That seems pretty relevant to what the original poster asked. The point is that a 2x2GHz will yield lower performance than 1x4GHz every time, which is what was being asked in the very subject line of the thread.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by MtM »

shatteredsilicon wrote:
MtM wrote:And then I'm not even talking about the 'mistakes' in your post, it's just your pov in general being so wrong that I don't even need to.
Allow me to point you at a snippet from the original post in this thread:
WangFeiHong wrote:The "talking" between cores doesn't do any calculations, so if i run SMP on 2Ghz Dual Core vs. 4Ghz Single Core, shouldn't we see similar FLOP performance?
Since all things being equal (same project WU), the number of points for the WU will be fixed, as will the number of FLOPs required to process it. Thus, FLOPs will be proportional to points. Therefore, more points will yield more FLOPs, and higher number of processed WUs per (larger) unit of time (e.g. per month). What I have provided is information on how to minimize overheads and increase the long-term WU throughput, i.e. the amount of science being done per processor per month, if you will. That seems pretty relevant to what the original poster asked. The point is that a 2x2GHz will yield lower performance than 1x4GHz every time, which is what was being asked in the very subject line of the thread.
Amount of science done is not dependant on throughput of wu's can't you read or are you just trying to perpetuate your wrong arguments with more wrong arguments?

For the sake of OP I'll rebute your arguments.

FLOPS and PPD don't have that tie when you talk about single core vs multi core ( or even comparing single cores with eachother but see 7im's post for that ), multi core enabels more complex workunits which are diffrent then the single core work units in their complexity and/or simulation length. 2x2 = more then 1x4 because you run those complex work units on them which couldn't be ran ( within reasonable times or due to hw limitiations ) on a 1x4. The scaling is very much non linear, and I'm not adressing utilisation of cores, but scaling in work unit complexity and that is the tie to scientific results. Not your constant refering to ppd being a good indication while you know it is not.

If I didn't know any better. I would say you where trying to get into an argument with me by taking the opposite stance in this... again.

I think you need to read the quote above again.
Last edited by MtM on Sat Nov 15, 2008 8:21 pm, edited 1 time in total.
uncle_fungus
Site Admin
Posts: 1288
Joined: Fri Nov 30, 2007 9:37 am
Location: Oxfordshire, UK

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by uncle_fungus »

Calm down people, there's no need to get hot under the collar about this.
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by MtM »

I'm not hot, I'm ice cold dissapointed in the fellow :lol:
shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by shatteredsilicon »

MtM wrote:Amount of science done is not dependant on throughput of wu's can't you read or are you just trying to perpetuate your wrong arguments with more wrong arguments?
That sounds like a contradiction in terms to me. For the record, I understand there is some extra usefulness in a WU being completed quicker, but even though this is the only argument you could be using against what I said, you seem not have gotten to that point (or were you partaking in this discussion with too limited an understanding?).
MtM wrote:FLOPS and PPD don't have that tie when you talk about single core vs multi core ( or even comparing single cores with eachother but see 7im's post for that ), multi core enabels more complex workunits which are diffrent then the single core work units in their complexity and/or simulation length.
Actually, I was only talking about SMP clients with SMP work units. I never made any mention, nor was I intending to imply using single-processor WUs. Everything I said was only referring to SMP WUs (a2, 4-thread ones for quad-core CPUs, to be most specific).
MtM wrote:2x2 = more then 1x4 because you run those complex work units on them which couldn't be ran ( within reasonable times or due to hw limitiations ) on a 1x4.
There are no such hardware limitations (even theoretically). Parallelized task can run at best scale linearly, although achieving such scalability this is highly unusual in the real world. You can run an SMP client on a single-core processor. As long as that CPU is at least around a 1.6GHz Core2 class processor, provided it is running 24/7 with no other significant load, it will still make the deadline for most SMP WUs. I'm not sure how that holds for 3 or 8-core WUs, the deadlines may require a faster single core than 1.6GHz, I haven't tested this configuration.
MtM wrote:The scaling is very much non linear, and I'm not adressing utilisation of cores, but scaling in work unit complexity and that is the tie to scientific results. Not your constant refering to ppd being a good indication while you know it is not.
Actually, any SMP WU will quite happily run on a single-core processor. There is nothing that SMP provides other than that WU completes faster. There is no extra complexity that using multiple processors magically solves, which is what you seem to be implying. The problem is that overheads eat much of the gain, as I explained. The MPI process controller seems to use a naive round-robin scheduler, which makes the throughput suffer very heavily when the machine has any other load going on. Running 4 threads on an idle 4-core CPU will yield very good results. But the moment you do something else, and you generate, say,50% load on one core, what'll end up happening is that all 4 workers will only run at 50%, because it only seems to scale at 4x the slowest worker thread - and that is not including the problem of migration of processes between processor cores, which will make the scaling slightly worse still.

Specifically, on the setup that I'm using, I have 6x GPU clients and a quad-core CPU. Each GPU client consumes around 20% of a single core (it's actually a bit less, but 20% is 1/5, a nice round number for the sake of the explanation. Here's an ASCII-art example of what happens when all run at the same time, with a single SMP client.

First line is the CPU core number. G is GPU client's CPU usage, I is idle CPU time, S is SMP client's CPU time.

Code: Select all

0 1 2 3
G G G G
G I G I
S S S S
S S S S
S S S S
Essentially, it means that there's 40% of one core not being used, due to the fact that the MPI scheduler not being quite up to the task of balancing the workload distributed to the cores. What happens then is that the OS process scheduler notices that the SMP cores want to use more CPU, so it tries to reschedule things to optimize the CPU utilization, and starts throwing processes around from core to core trying to do a better job. This starts introducing CPU migration latencies (typically 100-150ns) all over the place, and the performance drops through the floor. If you had a single core (or bound a process set, such as a single instance of an SMP client to a single core) this imbalance and process migration wouldn't happen, thus yielding a massive saving in wasted CPU time, which will increase the WU throughput. As I said, I have observed a difference of 2x under real-world conditions.

Even theoretically, 1x4GHz will come out at worst equal and on average significantly ahead of a 2x2GHz solution. Trust me - I'm a computer scientist. :)
MtM wrote:If I didn't know any better. I would say you where trying to get into an argument with me by taking the opposite stance in this... again.

I think you need to read the quote above again.
I promise you, I'm not picking on you specifically. I apologize if it looks that way. Perhaps I shouldn't have inserted a snippet from your post, and stuck purely to answering the original question asked. I'll try to not repeat that mistake.
uncle_fungus wrote:Calm down people, there's no need to get hot under the collar about this.
Sorry. I hope there's no offense taken.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
WangFeiHong
Posts: 47
Joined: Mon Oct 27, 2008 1:40 pm

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by WangFeiHong »

I did some reading and think can conclude that, SMP is employed to reduce the gap in performance between multi-core and single-core by having some kind of multithreading of the WUs.
We could just run multiple independent clients, but this would be throwing away a lot of power. What makes an SMP machine special is that it is more than just the sum of the individual parts (CPU cores), since those cores can talk to each other very fast. In FAH, machines talk to each other when they return WUs to a central server, say once a day. On SMP this happens once a millisecond or so (or faster). That 86,000,000x speed up in communication can be very useful, even if there isn't 100% utilization in the cores themselves.

there are some calculations which require fast interconnects, and that's where the FAH/SMP client is particularly important.
http://folding.typepad.com/news/2008/07/fahsmp-q-a.html
So I'm guessing that it lets SMP clients join up 4 trajectories together so that they don't need WU-1 0ns-100ns, WU-2 101-200ns, WU-3 201-300ns, WU-4 301-400ns, and then later join the results/trajectories back in their lab. And thus have results faster than running 4 classic.
In FAH, we've taken a different approach to multi-core CPUs. Instead of just doing more WU's (eg doing 8 WU's simultaneously), we are applying methods to do a single WU faster. http://folding.typepad.com/news/2008/06 ... re-do.html

They give us considerably longer trajectories in the same wall clock time, allowing us to turn what used to take years to simulate even on FAH, to a few weeks to months........whereas the SMP client can lead to a 4x speed up over the complete range of calculations we need to run. http://folding.stanford.edu/English/FAQ-SMP
So by multithreading (reduce inter-core overhead), SMP tries to make it look like you have 4Ghz for folding one WU instead of 2GHz on two WUs each, so they can complete faster.
shatteredsilicon
Posts: 87
Joined: Tue Jul 08, 2008 2:27 pm
Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by shatteredsilicon »

Indeed, WangFeiHong, that is the theory. In practice, however, the scalability isn't as good as you'd hope, hence the discrepancies I've mentioned. A single 4GHz core will always beat 2x2GHz cores. The reason why we bother with 2x2GHz setups is because vertical scalability (pushing up the clock speeds) is always going to be limited. Horizontal scalability (more cores) is much more viable, albeit there are overheads involved.
Image
1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
MtM
Posts: 1579
Joined: Fri Jun 27, 2008 2:20 pm
Hardware configuration: Q6600 - 8gb - p5q deluxe - gtx275 - hd4350 ( not folding ) win7 x64 - smp:4 - gpu slot
E6600 - 4gb - p5wdh deluxe - 9600gt - 9600gso - win7 x64 - smp:2 - 2 gpu slots
E2160 - 2gb - ?? - onboard gpu - win7 x32 - 2 uniprocessor slots
T5450 - 4gb - ?? - 8600M GT 512 ( DDR2 ) - win7 x64 - smp:2 - gpu slot
Location: The Netherlands
Contact:

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by MtM »

shatteredsilicon wrote:
MtM wrote:Amount of science done is not dependant on throughput of wu's.
That sounds like a contradiction in terms to me. For the record, I understand there is some extra usefulness in a WU being completed quicker, but even though this is the only argument you could be using against what I said, you seem not have gotten to that point (or were you partaking in this discussion with too limited an understanding?).
shatteredsilicon wrote:
MtM wrote:FLOPS and PPD don't have that tie when you talk about single core vs multi core ( or even comparing single cores with eachother but see 7im's post for that ), multi core enabels more complex workunits which are diffrent then the single core work units in their complexity and/or simulation length.
Actually, I was only talking about SMP clients with SMP work units. I never made any mention, nor was I intending to imply using single-processor WUs. Everything I said was only referring to SMP WUs (a2, 4-thread ones for quad-core CPUs, to be most specific).
shatteredsilicon wrote:
MtM wrote:2x2 = more then 1x4 because you run those complex work units on them which couldn't be ran (within reasonable times or due to hw limitiations) on a 1x4.
There are no such hardware limitations (even theoretically). Parallelized task can run at best scale linearly, although achieving such scalability this is highly unusual in the real world. You can run an SMP client on a single-core processor. As long as that CPU is at least around a 1.6GHz Core2 class processor, provided it is running 24/7 with no other significant load, it will still make the deadline for most SMP WUs. I'm not sure how that holds for 3 or 8-core WUs, the deadlines may require a faster single core than 1.6GHz, I haven't tested this configuration.
You should read the above quote again.
shatteredsilicon wrote:
MtM wrote:The scaling is very much non linear, and I'm not adressing utilisation of cores, but scaling in work unit complexity and that is the tie to scientific results. Not your constant refering to ppd being a good indication while you know it is not.
Actually, any SMP WU will quite happily run on a single-core processor. There is nothing that SMP provides other than that WU completes faster. There is no extra complexity that using multiple processors magically solves, which is what you seem to be implying. The problem is that overheads eat much of the gain, as I explained. The MPI process controller seems to use a naive round-robin scheduler, which makes the throughput suffer very heavily when the machine has any other load going on. Running 4 threads on an idle 4-core CPU will yield very good results. But the moment you do something else, and you generate, say,50% load on one core, what'll end up happening is that all 4 workers will only run at 50%, because it only seems to scale at 4x the slowest worker thread - and that is not including the problem of migration of processes between processor cores, which will make the scaling slightly worse still.
They would not run on any cpu or they would have been ran on them, multi core setups now are the first capable hw available to process them on while keeping their scientific worth. Even if you could run it on a p4 nortwood, it would take so much longer it will have no value when it gets back. Again, you should read that quote again.
shatteredsilicon wrote:Specifically, on the setup that I'm using, I have 6x GPU clients and a quad-core CPU. Each GPU client consumes around 20% of a single core (it's actually a bit less, but 20% is 1/5, a nice round number for the sake of the explanation. Here's an ASCII-art example of what happens when all run at the same time, with a single SMP client.

First line is the CPU core number. G is GPU client's CPU usage, I is idle CPU time, S is SMP client's CPU time.

Code: Select all

0 1 2 3
G G G G
G I G I
S S S S
S S S S
S S S S
Essentially, it means that there's 40% of one core not being used, due to the fact that the MPI scheduler not being quite up to the task of balancing the workload distributed to the cores. What happens then is that the OS process scheduler notices that the SMP cores want to use more CPU, so it tries to reschedule things to optimize the CPU utilization, and starts throwing processes around from core to core trying to do a better job. This starts introducing CPU migration latencies (typically 100-150ns) all over the place, and the performance drops through the floor. If you had a single core (or bound a process set, such as a single instance of an SMP client to a single core) this imbalance and process migration wouldn't happen, thus yielding a massive saving in wasted CPU time, which will increase the WU throughput. As I said, I have observed a difference of 2x under real-world conditions.

Even theoretically, 1x4GHz will come out at worst equal and on average significantly ahead of a 2x2GHz solution. Trust me - I'm a computer scientist. :)
You are a scientist? Why then have you got problems with rebuting the above quote, which you try to do with examples and arguments which aren't specific to the question/situation at hand? Trust me, I am not a scientist but I am eligible for Mensa and I can see you're just not hitting the mark with your 'observations'.
shatteredsilicon wrote:I promise you, I'm not picking on you specifically. I apologize if it looks that way. Perhaps I shouldn't have inserted a snippet from your post, and stuck purely to answering the original question asked. I'll try to not repeat that mistake.
Single core is not the same, first of all which current cpu has a single core variant?
Last edited by MstrBlstr on Thu Dec 11, 2008 10:25 am, edited 1 time in total.
Reason: Removed personal comments towards another member.
HaloJones
Posts: 906
Joined: Thu Jul 24, 2008 10:16 am

Re: Dual-core 2Ghz vs Single-core 4Ghz - Which faster?

Post by HaloJones »

I could run a single Windows SMP client on my quad and get around 2200ppd. (Let's pretend that the ppd reflects the benefit to Stanford.)
I actually run two Linux VMs each limited to 2-cpus. Each Linux VM runs two Linux SMP clients. This gets 4400ppd.

Now some people get all hot under the collar and complain that I'm delaying the return of the results but if they're making the deadline that can't be a problem for Stanford - after all, they set the deadline, not me. Secondly, it must also be of benefit to Stanford because more work gets done.

And finally, doesn't it prove that running a single SMP client on a quad is a massive waste of potential resource?
single 1070

Image
Post Reply