Linux SMP v6 compared to Windows SMP client

Balistyx · Post by **Balistyx** » Mon Jul 07, 2008 5:11 am

Do they perform at about the same rate?

Logic would tell me that the Linux client would perform better in console-only compared to having Windows running.

Post by **bruce** » Mon Jul 07, 2008 6:12 am

A) When you have the viewer open in Windows, folding slows down. The new viewer uses a lot more CPU time than the old one, but it has always been true. When the viewer is closed, the Windows GUI client and the Windows Console client are essentially the same.

B) Each of the versions of MPI for Windows-32 is inferior to MPI for Linux-64. That creates a difference that is not easily overcome. Some of that comes out as stability issues and some as performance issues.

C) The overhead for Windows sitting there but not doing anything isn't really zero, but it's a lot smaller than a lot of the Linux people believe -- unless Windows is actually doing something, which is what I covered in part A. Of course I'm assuming that you turn off some of the unnecessary functions like virus scans and file indexing, etc.

noorman · Post by **noorman** » Mon Jul 07, 2008 11:34 pm

.

There is a markedly higher performance with LinuxSMP against WinSMP.

You could say that Linux has the direct-drive system (or the DOHC motor) whereas Windows hasn't ...

That 's why I went with Linux, which I didn't really know before; it has better stability too (see reports)

.

7im · Post by **7im** » Tue Jul 08, 2008 5:51 am

It's not the OS that affects the speed. The MPICH packages are different between the two OSs, and so behave differently. In my experience, the Windows client isn't any less stable, just succeptable to more problems than the Linux client.

The performance difference is noteable, 10-15%, sometimes more, sometimes not at all. It's worth running Linux in VM to some people, not others. Run what you want to run.

shatteredsilicon · Post by **shatteredsilicon** » Tue Jul 08, 2008 2:42 pm

Logic would tell me that the Linux client would perform better in console-only compared to having Windows running.

It depends. I don't imagine there is a big difference. I run multiple uni-processor clients (v5.04) on Linux because I saw weird things happening with the Linux SMP client. For example, when another process is running and taking as little as 5% CPU, one of the F@H SMP threads gets migrated away to a different core, so I end up with 50% idle time that F@H isn't using. I didn't observe that behaviour on the Windows SMP client, but it is entirely possible that this is merely unobservable on Windows, not that it isn't actually happening.

I use SMP client on Windows just because I can live with 1 minimized console window in the tray, but 4 would start to annoy me. On Linux it's all forked into the background in rc.local, so I use 4x single-thread clients because it seems to yield most CPU utilization, so presumably it's actually doing more work. I haven't done any PPD benchmarking, though.

John Naylor · Post by **John Naylor** » Tue Jul 08, 2008 2:46 pm

@shatteredsilicon

Firstly, welcome to the forums!

If the only reason for you running SMP is because four minimised console windows would annoy you, you could always use something like TrayIt (as I believe it's called) to move the minimised windows into the System tray... or install 4 v5.04 console clients as a service (meaning that you don't get a window) then use something like FahMon to monitor them (which again can run from the system tray).

shatteredsilicon · Post by **shatteredsilicon** » Tue Jul 08, 2008 4:20 pm

Firstly, welcome to the forums!

Thank you!

If the only reason for you running SMP is because four minimised console windows would annoy you,

There is a point in there somewhere, that SMP client is actually quite pointless. The overheads involved, as per what I saw mentioned on the FAQ page, mean that it doesn't scale linearly with multiple cores. This is quite opposite to running multiple single-thread clients. So:

1) Why is anyone bothering running an SMP client? Perhaps the TrayIt suggestion ought to be in the FAQ.

2) Why was the SMP client even written? I cannot see what advantages it could possibly provide compared to running multiple separate clients.

I'm assuming here that I'm missing an important advantage of the SMP client, but I just can't see what.

Post by **bruce** » Tue Jul 08, 2008 6:04 pm

shatteredsilicon wrote:
Firstly, welcome to the forums!
There is a point in there somewhere, that SMP client is actually quite pointless. The overheads involved, as per what I saw mentioned on the FAQ page, mean that it doesn't scale linearly with multiple cores. This is quite opposite to running multiple single-thread clients. So:

1) Why is anyone bothering running an SMP client? Perhaps the TrayIt suggestion ought to be in the FAQ.

2) Why was the SMP client even written? I cannot see what advantages it could possibly provide compared to running multiple separate clients.

I'm assuming here that I'm missing an important advantage of the SMP client, but I just can't see what.

You should read up on Moore's Law.

Completing one WU in 25-30% of the time is MUCH more valuable to science than taking 100% of the time to complete four similar WUs. All of the recent development work has been aimed at creating high-performance clients. The SMP and PS3 and GPU clients do the same science much faster and that's a lot more important than you are assuming.

shatteredsilicon · Post by **shatteredsilicon** » Tue Jul 08, 2008 6:17 pm

I'm not sure I follow that logic. In this particular case, why is it more important you get the next one set of results in 4 hours than 4 sets of results in 12 hours? What is it that this faster turnaround gains you that isn't outweighed by getting 25-30% more done in the long term? The point is that with better scalability, more science gets done overall. We're not talking about long time intervals here, either.

Post by **bruce** » Tue Jul 08, 2008 6:52 pm

You might also read this thread which is going on concurrently: viewtopic.php?p=37556#p37556

FAH assignments consists of both WUs that can be done concurrently and WUs that must be done serially. It's the total of all the serial steps that turns out to have the most important scientific value, not the number of parallel tasks that can be started.

A single trajectory that takes 10 years to compute is a lot more valuable that 10 trajectories that take one year to compute -- especially if the event of interest doesn't happen in the first year's worth of work. Reducing that cure that takes 10 years to compute to 2.8 years is really important, and if it can be reduced to 1.4 years by using 8-core machines, then it's even better.

P5-133XL · Post by **P5-133XL** » Tue Jul 08, 2008 7:00 pm

Look at the deadlines for the uniprocessor client -- They are on the order of months to deal with the lowest common denominator. The deadlines on the SMP client is measured in days because they can assume a certain minimum speed that one can't with the uniprocessor machines. With each project, the majority of WU's need to be returned, before the next generation can be released. So you can go through many generations with the SMP client before the uni-processor client can get through one. Therfore the scientific value of the higher performance clients is far greater than the lower performance clients.

Now the value of the GPU clients are in the fact that they can process far more flops than even the SMP clients. What that gives is the ability to calculate a far bigger time-slice which again gives more scientific value, even though the deadlines are not a large scale difference from the SMP clients.

This is my interpretation of the reasons given. Please correct me, if I'm wrong

noorman · Post by **noorman** » Tue Jul 08, 2008 7:16 pm

shatteredsilicon wrote:I'm not sure I follow that logic. In this particular case, why is it more important you get the next one set of results in 4 hours than 4 sets of results in 12 hours? What is it that this faster turnaround gains you that isn't outweighed by getting 25-30% more done in the long term? The point is that with better scalability, more science gets done overall. We're not talking about long time intervals here, either.

.

The WU's all are a tiny piece of 1 timeline (per project); the results of a finished WU give Stanford new parameters to inject in a new WU and so on, and so on ...

That 's the simple (maybe too simple) explanation.

.

Post by **bruce** » Tue Jul 08, 2008 7:36 pm

noorman wrote:The WU's all are a tiny piece of 1 timeline (per project); the results of a finished WU give Stanford new parameters to inject in a new WU and so on, and so on ...

That 's the simple (maybe too simple) explanation.

Essentially correct . . . but somewhat too simple. Each WU is a tiny piece of a single timeline, but there are number of timelines within a single project. Until a WU is returned, the next WU for that same timeline cannot be created.

("Timeline" = "Trajectory"
Each PRC is a separate trajectory. Each Gen is another piece of the same trajectory.)

shatteredsilicon · Post by **shatteredsilicon** » Tue Jul 08, 2008 7:57 pm

Thanks for the clarification. That makes sense WRT usefulness.

As far as parallel performance scalability is concerned, the standard optimization paradigm says: "Vectorize inner loops, parallelize outer loops." Presumably, that is the paradigm followed in the SMP F@H client. Just out of interest - what compiler are F@H cores built with? ICC's optimizer can do vectorizing automatically for reasonably written code, as well as auto-parallelizing. Assembly can do the same, but I'm just wondering if leveraging a better compiler has been explored for F@H. I have personally seen speed improvements of up to 7x (700%) from using ICC to compile my own number crunching libraries (pure C++) compared to GCC. Just a thought. It might provide scope for completely avoiding MPI and some of the overheads. I'd try it myself, but F@H is closed source...

John Naylor · Post by **John Naylor** » Tue Jul 08, 2008 8:08 pm

You say reasonably written code... all the FAH cores are hand-coded to get the most out of the hardware they are using (well... maybe except the a1 SMP core lol)... so I guess that might negate some of the advantages of a new compiler. And besides, the Pande Group always wants more speed so I would guess they regularly look at their compilers to see if a new one can make the code run more efficiently and therefore faster

EDIT: I would also guess that the answer is no, new compilers cannot make the cores faster, for single core clients anyway... check the build dates, most are from 2006 on the older cores

Folding Forum

Linux SMP v6 compared to Windows SMP client

Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client

Re: Linux SMP v6 compared to Windows SMP client