Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

7im · Post by **7im** » Thu Feb 28, 2008 4:58 am

RAH wrote:I know in what I have run, the Linux SMP is faster. Though it still needs the 32libs for something.
...
Not enough difference for me to go one way or the other.

MPI in the Linux client needs 64 bit. The client is still 32 bit, hence the need for 32 bit support, but 64 bit OS and hardware.

@glassman, again, MPI was written for linux, so it runs faster in linux. Not so much overhead like in Windows. The MPI in windows is a hack at best, hence the slower speed. However, like most things in Windows stolen from other operating systems, they eventually make it work as well as in Linux or OSX.

There are also more than one version of MPI for Windows. One of the others may work better. That's how Windows will catch up.

Post by **bruce** » Thu Feb 28, 2008 6:25 am

TheGlassman wrote:
The fah smp client performance advantage in Linux is due to MPICH being written natively to linux, and only ported (kluged) to work in Windows (also a kluge in general). The Windows port of MPI is slower.
Funny, in Linux, mpiexec is asleep, 0% cpu and Windows shows 0% as well.

I guess you're not understanding that MPI is short for Message Passing Interface. It's a communications protocol. What are the CPUs doing when messages are being passed? They're waiting to get enough data so they can go back to work so 0% CPU is just what I'd expect. The "speed" of communications depends on how and where the data goes and how efficiently it is transferred.

There is a LOT of coordination data that has to flow between the four FahCore tasks (and back to whatever part of the system writes the checkpoints and things like that). All of the time that the FahCores are unable to process because they're waiting for MPI to provide the data is added to the time spend actually computing. I'm not at all surprised that the Windows "kludge" spends 15% more time just waiting.

TheGlassman · Post by **TheGlassman** » Thu Feb 28, 2008 1:09 pm

Thanks Bruce, You've broken my heart. Thought my A64's were finally being used to their potential. Sigh.
Anyway, since the cores are the same, makes sense that something else is responsible.
So I'll bow to your knowledge and experience.
What I still don't understand is why the 4 a1 cores are almost always at 100% (add up to it) in Win SMP. If they were stalled 15% of the time I would expect it to show. In linux gnome system monitor shows a small percentage of unaccounted time, this could be where mpi is doing it's stuff. how ever both show mpiexec doing nothing (sleeping by the way is a red herring, even the F@H cores are shown as sleeping) Is that because they are being called by the folding cores, not the OS?
Would a 64 bit Win MPI be easier and faster than the 32 bit? When might we see 64 bit programs, Win and Linux?

Thanks again

Post by **bruce** » Thu Feb 28, 2008 8:33 pm

TheGlassman wrote:What I still don't understand is why the 4 a1 cores are almost always at 100% (add up to it) in Win SMP. If they were stalled 15% of the time I would expect it to show.

Expectations are not always correct. I have no direct knowledge of how MPI does it's I/O, but I can give another example where your expectations would be wrong. The GPU client runs using DirectX9 to communicate between the PC and the GPU. From what I've read, the GPU is doing all of the folding calculations and the FahCore is just there to move data into the GPU Vram and to accept the results back into main RAM, write disk files, report progress, etc. That' FahCore and/or DX9 use 100% of a CPU even though the primarly thing that it's doing is waiting for I/O. (The CPU is busy while it's stalled because DX9 uses polling.) As I said, I don't know if the same sort of thing is going on in MPI, but it might be.

In linux gnome system monitor shows a small percentage of unaccounted time, this could be where mpi is doing it's stuff. how ever both show mpiexec doing nothing

Linux MPI is different code than the WIndows MPI so it's probably doing things differently. The Linux monitor also reports things differently than the Windows TaskManager. All that really means is that we can't tell.

TheGlassman wrote:Would a 64 bit Win MPI be easier and faster than the 32 bit? When might we see 64 bit programs, Win and Linux?

Linux uses 64-bit MPI. WIndows does not. That may or may not mean anything. I don't know how to apportion the differences.

GROMACS uses 99% floating point operations, which are identical in 32-bit code and 64-bit code. A change from 32-bit integers to 64-bit integers would not make any measurable difference but would require the support of twice as many versions of FAH's software.

TheGlassman · Post by **TheGlassman** » Fri Feb 29, 2008 2:51 pm

g.) As I said, I don't know if the same sort of thing is going on in MPI, but it might be....The Linux monitor also reports things differently than the Windows TaskManager. All that really means is that we can't tell.

Understood, posted them so if was worth anything, it would be there.

Linux uses 64-bit MPI. WIndows does not. That may or may not mean anything. I don't know how to apportion the differences.

I'd say 15% if the cores are identical.

GROMACS uses 99% floating point operations, which are identical in 32-bit code and 64-bit code.

Really?!?! I thought all this time (4 years?)Gromacs was SSE. I do know it slows to a crawl if SSE is disabled, intentionally or unintentionally. Of course SMP always uses SSE even if the client reports disabled. (no slowdown)

Assembly loops using SSE and 3DNow! Multimedia instructions are provided for x86 processors, resulting in exceptional performance on inexpensive PC workstations http://www.gromacs.org/content/view/101/33/

. Tinker was FPU.

Updated release of StressCPU - we had a new cluster to burn-in!
Version 2.0 now supports both ia32 (32bit) as well as x86-64/em64t (64bit) platforms. It is multithreaded (both pthreads and win32 threads) by default and will automatically sense the number of CPUs on Linux, Mac OS X, and windows. It runs slightly hotter, in particular for x86-64 systems, the checks are better, and you can now set it for a fixed excution time, e.g. 12 hours. The package includes pre-compiled binaries for Windows, 32 and 64 bit Linux, and 32 as well as 64 bit OS X.

Don't know if this is the same code as being used in SMP by F@H, but Gromacs is working in that direction.

A change from 32-bit integers to 64-bit integers would not make any measurable difference but would require the support of twice as many versions of FAH's software.

One actually, and the current Linux64 SMP would be replaced. Have you looked at the download page lately? If the code isn't ready you are of course correct. I suspect they are already running 2 32bit integers at once through the SSE units. I find it hard to believe that the wider and extra 64 bit registers wouldn't help quite a bit in speed, and the Pande group has no problem supporting a new core if they think there is a benefit, (QMD, Linux 64,gpu, PS3) accompanied by massive bonuses to get people to use them. Point is mute of course until Gromacs or someone else writes the software to support what is waiting in the installed base. Single cores have already been assigned as the modern Tinkers. Not worth the electricity to run them.

Anyway Bruce thanks, for your time. It has been enjoyable and informative as always.

7im · Post by **7im** » Fri Feb 29, 2008 5:35 pm

TheGlassman wrote:...

A change from 32-bit integers to 64-bit integers would not make any measurable difference but would require the support of twice as many versions of FAH's software.

One actually, and the current Linux64 SMP would be replaced.

I suspect they are already running 2 32bit integers at once through the SSE units. I find it hard to believe that the wider and extra 64 bit registers wouldn't help quite a bit in speed...

Single cores have already been assigned as the modern Tinkers. Not worth the electricity to run them.

As Bruce mentioned, folding is all FPU, and almost no Integer (except in the case of Tinkers, and we don't use them any more). As such, doubling the bit path for integers would have a very small impact if any.

In the old forum, a member of Pande Group posted that they recompiled the CPU client to try 64 bit (when that hardware first him the market) to see if there was a significant speed increase. IIRC, the increase was miniscule.

I think their limited development resources are better spent going after bigger performance targets, and they have... GPU, PS3, SMP. I also think they will add 64-bit support when that hardware becomes more much widespread in the user base than it is now, and they'll work that support in to the code while updating other components, not as a specific upgrade.

Post by **bruce** » Fri Feb 29, 2008 8:17 pm

I was a bit too brief with my explanation. It's still fundamentally correct, but let me clarify what I really meant.

TheGlassman wrote:
GROMACS uses 99% floating point operations, which are identical in 32-bit code and 64-bit code.
Really?!?! I thought all this time (4 years?)Gromacs was SSE. I do know it slows to a crawl if SSE is disabled, intentionally or unintentionally. Of course SMP always uses SSE even if the client reports disabled. (no slowdown)

When I said floating point, I was including SSE and/or SSE2 although technically they're different. SSE and SSE2 instructions are identical in 32-bit and 64-bit code, too. SSE accelerates single precision floating point and SSE2 accelerates double precision floating point by performing multiple instructions in parallel, but the data is still stored as floating point data, not as 32-bit or 64-bit integers, and the actual arithmetic results end up being the same. (I apologize for my sloppy language.)

Updated release of StressCPU - we had a new cluster to burn-in!
Version 2.0 now supports both ia32 (32bit) as well as x86-64/em64t (64bit) platforms. It is multithreaded (both pthreads and win32 threads) by default and will automatically sense the number of CPUs on Linux, Mac OS X, and windows. It runs slightly hotter, in particular for x86-64 systems, the checks are better, and you can now set it for a fixed excution time, e.g. 12 hours. The package includes pre-compiled binaries for Windows, 32 and 64 bit Linux, and 32 as well as 64 bit OS X.
Don't know if this is the same code as being used in SMP by F@H, but Gromacs is working in that direction.

Good point. I don't know what StressCPU is doing or why.

A change from 32-bit integers to 64-bit integers would not make any measurable difference but would require the support of twice as many versions of FAH's software.
One actually, and the current Linux64 SMP would be replaced. Have you looked at the download page lately? If the code isn't ready you are of course correct.

Also a good point.

Since the Linux/MacOS SMP client must be 64-bit, only one version is required, and it might as well be 64-bit. I'm not sure what sort of incompatibilities that might bring to the servers. What are the implications of a trajectory that contains one Gen run on 32-bit windows, the next Gen on 64-bit Linux, and the next Gen on something else? Would the WUs need to be different? Would there need to be a step that converts the length of the integers going in and going out? It's worth asking the Pande Group these questions.

I suspect they are already running 2 32bit integers at once through the SSE units. I find it hard to believe that the wider and extra 64 bit registers wouldn't help quite a bit in speed, and the Pande group has no problem supporting a new core if they think there is a benefit.

Technically, the parallel integer operations are processed with Multimedia instructions (aka - MMX) which all modern processors have, but that's not the point.

Let's assume that my guess of 99% floating point is correct (I really do not know). If we have one integer operation followed by 99 FP operations or 25 SSE operations or 50 SSE2 operations which is then followed by the next integer operation, there is absolutely nothing to be gained by delaying one integer operation so it can be done simultaneously with the next one. The slowest of APUs can finish the first operation while the floating point hardware is busy with the other 99/50/25 operations. As 7im has reported, there is a small gain, so the real code isn't as bad as my example, but I think you get the point.

Anyway Bruce thanks, for your time. It has been enjoyable and informative as always.

NP.

matheusber · Post by **matheusber** » Wed Apr 02, 2008 1:57 am

7im wrote:The fah smp client performance advantage in Linux is due to MPICH being written natively to linux, and only ported (kluged) to work in Windows (also a kluge in general). The Windows port of MPI is slower.

I've never looked for too much, but when I read this I can't see why not to have a 32bits version of the smp client for linux (and therefore FreeBSD - is what I want the most)

thanks,

matheus

7im · Post by **7im** » Wed Apr 02, 2008 4:53 am

matheusber wrote:
7im wrote:The fah smp client performance advantage in Linux is due to MPICH being written natively to linux, and only ported (kluged) to work in Windows (also a kluge in general). The Windows port of MPI is slower.
I've never looked for too much, but when I read this I can't see why not to have a 32bits version of the smp client for linux (and therefore FreeBSD - is what I want the most)

thanks,

matheus

There is a 32 bit version of MPI for Linux, but it doesn't work well with a FAH client, or so I've read somewhere. That's why the MPI used in the Linux SMP client is 64 bit.

Post by **bruce** » Wed Apr 02, 2008 5:38 am

7im wrote:There is a 32 bit version of MPI for Linux, but it doesn't work well with a FAH client, or so I've read somewhere. That's why the MPI used in the Linux SMP client is 64 bit.

There are several 32-bit versions of MPI for Windows and look at all the trouble that they've caused.

matheusber · Post by **matheusber** » Thu Apr 03, 2008 7:55 pm

7im wrote:
matheusber wrote:
7im wrote:The fah smp client performance advantage in Linux is due to MPICH being written natively to linux, and only ported (kluged) to work in Windows (also a kluge in general). The Windows port of MPI is slower.
I've never looked for too much, but when I read this I can't see why not to have a 32bits version of the smp client for linux (and therefore FreeBSD - is what I want the most)

thanks,

matheus
There is a 32 bit version of MPI for Linux, but it doesn't work well with a FAH client, or so I've read somewhere. That's why the MPI used in the Linux SMP client is 64 bit.

well, for what I saw from the pande group, they're too much busy for quite everything and to make a FreeBSD SMP as well, right ?

any sucessfull try to make fah smp to work on a amd64 FreeBSD under emulation ?

thanks for all info

matheus

Post by **bruce** » Fri Apr 04, 2008 7:07 pm

matheusber wrote:well, for what I saw from the pande group, they're too much busy for quite everything and to make a FreeBSD SMP as well, right ?

any sucessfull try to make fah smp to work on a amd64 FreeBSD under emulation ?

The Pande Group are in the business of doing Molecular Simulation. They're not equiped to develop new versions of MPI -- that is best left to peope who study Computer Science. They'll use whatever is developed provided it work when incorporated into their code.

Please do not double-post. You already are talking about a port for FreeBSD here.

Folding Forum

Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Re: Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?

Benchmarks for SMP In XP32, XP64, Vista, Linux, etc. ?