MPICH vs. OpenMP

alpha754293 · Post by **alpha754293** » Sun Jan 18, 2009 2:17 am

Why wouldn't the entire F@H project be written in FORTRAN (presumably either f77 or f90/f95) using OpenMP instead of MPICH?

Which compiler would you be using at that point? I think that PGI, Intel, and Sun all support OpenMP via the -omp flag.

alpha754293 · Post by **alpha754293** » Sun Jan 18, 2009 2:29 am

BTW...on the SMP FAQ, where it states:

"Isn't it needlessly complex to use MPI?

Unfortunately, there aren't other options right now (see the above)."

That's not ENTIRELY true.

OpenMP has been around for a while. (although I would definitely agree with you in that it wasn't available on Windows platforms for a rather long time), but now; it is fully supported by at least the Intel compiler and also PGI.

(Both of which support the OpenMP 3.0 spec).

That MIGHT be easier than using MPI. I don't know because I'm not a programmer; although I think that a lot of the HPC community still uses MPI (perhaps for some what of arachaic reasons).

Post by **Ivoshiee** » Sun Jan 18, 2009 8:22 am

Some changes are coming to FAH SMP client - SMP2. Maybe it is using OpenMP already, but closer details are still unknown. Just wait and see.

alpha754293 · Post by **alpha754293** » Sun Jan 18, 2009 8:47 pm

Ivoshiee wrote:Some changes are coming to FAH SMP client - SMP2. Maybe it is using OpenMP already, but closer details are still unknown. Just wait and see.

Well...from what I've been able to find out, there's still the great debate amongst the HPC community (sorta where I come from/where I'm heading into) about whether to use MPICH or OpenMP.

Supposedly, in terms of actual coding, OpenMP is a lot simplier, but on very large scale HPC applications (say...oh... > 10,000 cores), it gets a bit iffy. And considering that the latest and greatest supercomputers are sporting anywhere between 100,000 to 200,000+ cores, it's certainly quite interesting what's happening now.

From what I've read is that MPICH allows for more directed/pointed traffic, whereas with OpenMP, you almost need a monolithic OS installation in order for it really utilize the NUMA properly, and that sometimes just isn't possible/feasible.

http://www.hpccommunity.org/blogs/bearc ... openmp-98/

I think that it would be fan-fricken-tastic if they are able to read the number of cores/processors/processing units and then automatically set the omp_set_num_threads to that value so that it can automatically scale between systems.

I also won't be surprised if the SMP client ends up being the main client in the rather near future since parallelization is certainly getting more and more prevalent these days. (Finally!!!)

My suggestion would be to take a page of out HPC world from about oh..20 years ago, and mix it in with today's OpenMP 3.0 technology.

Pity that I don't code.

BTW...if the idea of implementing DeinoMPI was to get around the instabilities due to abnormal program termination; set the checkpoint timing to like 5 minutes or something and have it pick up from where it left off rather than restarting the WU.

7im · Post by **7im** » Mon Jan 19, 2009 3:27 am

alpha754293 wrote:...
BTW...if the idea of implementing DeinoMPI was to get around the instabilities due to abnormal program termination; set the checkpoint timing to like 5 minutes or something and have it pick up from where it left off rather than restarting the WU.

There was a lot of discussion about why MPICH was chosen when the client first came out, but I don't remember the details. A search of some old threads might turn something up.

As for Deino, it was an experiment to see if it was faster or more stable, because MPICH was problematic in Windows (MPICH is native to 'nix, not Win). It did fix the 0x7b (network interuption corrupted WUs), but it didn't go any faster.

As Ivo mentioned above, Pande Group is moving in a different direction for the SMP2 client (hints seen in Vijay's blog).

alpha754293 · Post by **alpha754293** » Mon Jan 19, 2009 7:48 am

7im wrote:[

Well...I can certainly understand the reasoning and logic behind choosing MPICH.

In traditional HPC, SMP, MPP, etc. class of programs, MPI and thus, MPICH is pretty much like a defacto standard. While I don't have specific numbers, but I would guess that MPICH would probably make up some like 80-90% of all parallel processing codes in the HPC world. There's some newer programs that use OpenMP instead because the technology has finally advanced and we're finally cycling through to the current generation of programmers, mathematicians, scientists, and engineers alike from about the 90s era that OpenMP is starting to come up to scratch/speed and implementation.

MPICH is more "historic" and these "newer paradigms" are trying to make it so that it's less programmer intensive in developing such applications.

On broad scale HPC, there's still a lot of alpha and beta-testing level work being done on it only because the supercomputers of today often exceed the number of processors/cores that a program can distribute itself onto (as in the case with MPICH), while on OpenMP, because of it's NUMA or NUMA-like architecture; you pretty much have to have a monolithic operating environment installation which is typically not practical.

However, for something like a distributed computing client like F@H, OpenMP might be a way to go (and as someone have already mentioned that it might be being already looked at and implemented for SMP2), I'm just pointing out that what it states in the FAQ isn't entirely true.

I'm pretty certain that there are other MPI (other than MPICH and DeinoMPI) that are being used, but ultimately, it also depends on whether the routines and the intrinsic calls can be parallelized.

Perhaps another way of doing it would be similiar to how Ansys CFX works in that it invokes a partitioner (e.g. MeTiS or something like that) and it actually decomposes the problem into smaller pieces, adding and internal interface between the partitions, and then solving the partitions via local or distributed MPICH. (PVM could work too, but in my experience, it is almost always significantly slower than just running MPICH.)

Very interesting though...I'd like to see how that works out.

And no, I wouldn't expect a different implementation of MPI to really speed anything up. You'd pretty much have to profile the program in order to find out what's going on and then you can pick a CPU architecture that would best be able to perform those operations the fastest since even with floating point operations, they're not all the same.

Post by **bruce** » Tue Jan 20, 2009 8:32 am

alpha754293 wrote:On broad scale HPC, there's still a lot of alpha and beta-testing level work being done on it only because the supercomputers of today often exceed the number of processors/cores that a program can distribute itself onto (as in the case with MPICH), while on OpenMP, because of it's NUMA or NUMA-like architecture; you pretty much have to have a monolithic operating environment installation which is typically not practical.

This is pretty much a non-issue for FAH. We're not talking about either cluster computing or a super-computer application. FAH-SMP runs only on multiple cores within a single computer (on a single copy of the OS). Inter-machine data transfers are disabled. The actual numbers of cores that I've seen range from 2, 3, 4, 8 and while cache structure does change the overall speed, it's generally just a given since it is not actively controlled by anyone. SMP allows all of your cores to cooperatively process the same WU.

Individual WUs are distributed to non-uniform hardware (your computer is different than mine) and collected before being redistributed, but that's not where MPI is being used.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 3:12 am

bruce wrote:
alpha754293 wrote:On broad scale HPC, there's still a lot of alpha and beta-testing level work being done on it only because the supercomputers of today often exceed the number of processors/cores that a program can distribute itself onto (as in the case with MPICH), while on OpenMP, because of it's NUMA or NUMA-like architecture; you pretty much have to have a monolithic operating environment installation which is typically not practical.
This is pretty much a non-issue for FAH. We're not talking about either cluster computing or a super-computer application. FAH-SMP runs only on multiple cores within a single computer (on a single copy of the OS). Inter-machine data transfers are disabled. The actual numbers of cores that I've seen range from 2, 3, 4, 8 and while cache structure does change the overall speed, it's generally just a given since it is not actively controlled by anyone. SMP allows all of your cores to cooperatively process the same WU.

Individual WUs are distributed to non-uniform hardware (your computer is different than mine) and collected before being redistributed, but that's not where MPI is being used.

I think that for the time being, it isn't an issue. But I think that given where CPU manufacturers are heading, it will be. And probably sooner than you think.

(And while I don't think that it would get as "bad" as the 80-core chip that Intel demonstrated, I would like to think that it wouldn't hurt to be prepared for something like that, and to make it as minimually intrusive for the end-user in utilizing the client.)

So, might as well start planning and preparing for > 8-core systems now. (I'll admit that I'm an odd ball in the sense that I already have plans to acquire a 16-core system within a year, if not a 32-core system.)

Don't get me wrong, there are certainly advantages to both MPICH and OpenMP. However, suffice it to say that if you want a truly versatile client, I would think that learning from the HPC world might garner some insight into which direction the F@H client should be heading. I'm also fairly certain that there are some very talented individuals working on the development as it stands, but it also probably can't hurt to learn from the best in the HPC "business". If it's good enough for the HPC world, it ought to be good enough for F@H.

Sidenote: 6-core Xeons are out.

I would have LOVED to be able to run F@H on a LAN distributed MPICH. Even MATLAB now has integrate "auto-parallelization" tools available.

Post by **bruce** » Wed Jan 21, 2009 3:41 am

alpha754293 wrote:I think that for the time being, it isn't an issue. But I think that given where CPU manufacturers are heading, it will be. And probably sooner than you think.

The issue is not so much how many cores can work cooperatively but that MPI is using memory to memory data transfers which precludes clusters that are only interconnected via LAN connections. The GPUs currently have thousands of shaders all working in parallel, so the numbers of CPU cores is not going to be a serious problem very soon.

Yes, the structure of the GPU client is rather different than the structure of the SMP client, but as the hardware moves closer together, I'm sure that the FahCore methodology will also converge.

Post by **VijayPande** » Wed Jan 21, 2009 7:02 am

I am confident that one (After looking at the Gromacs code) would see that it is in fact very, very difficult to port Gromacs to OpenMP (especially with any sort of good performance). By my estimates (as well as the Gromacs developers), OpenMP would likely get worse performance than we're getting right now (and would require many man-years to port to). For anyone who thinks porting Gromacs from MPI to OpenMP is easy, please go ahead give it a shot (go to http://www.gromacs.org to get the source code). It certainly would be nice to have (and would be useful for a wide range of people I think).

Therefore, we are pursuing other alternatives within the SMP2 project (more info to be released as we get closer to the release of the SMP2 client). Also, please keep in mind that we are very familiar with HPC methodology, etc, and are applying technologies where appropriate.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 8:26 pm

VijayPande wrote:I am confident that one (After looking at the Gromacs code) would see that it is in fact very, very difficult to port Gromacs to OpenMP (especially with any sort of good performance). By my estimates (as well as the Gromacs developers), OpenMP would likely get worse performance than we're getting right now (and would require many man-years to port to). For anyone who thinks porting Gromacs from MPI to OpenMP is easy, please go ahead give it a shot (go to http://www.gromacs.org to get the source code). It certainly would be nice to have (and would be useful for a wide range of people I think).

Therefore, we are pursuing other alternatives within the SMP2 project (more info to be released as we get closer to the release of the SMP2 client). Also, please keep in mind that we are very familiar with HPC methodology, etc, and are applying technologies where appropriate.

Thanks.

Good info to have.

*edit*
I did take a look at the GROMACS source and yea....it's all greek to me.

It does look like that it's almost written entirely in C and/or C++ though. (Although reading about it more, it said that some of the innermost loops are FORTRAN).

Also interestingly enough, GROMACS has a native benchmarking tool already, so I might just end up using that instead of benchmarking F@H (or both).

And apparently, you can also compile GROMACS pretty much for any *NIX system you can think of. According to their benchmarks page, they did run it on Solaris/UltraSPARC II and also AIX/POWER3 so...it's possible.

They even have specific instructions for IBM Bluegene\P and Bluegene\L (*thumbs up* NICEEEE).

7im · Post by **7im** » Thu Jan 22, 2009 7:14 am

Yes, but the native gromacs code does not exactly equal what fah uses in the fahcores. So recompiling that gromacs code is mostly an academic exercise.

And the realm of possibilities is far from reality. Sorry to appear to be a downer, but as mentioned already, pursuing a Sun or AIX port is much like the port to OpenMP. While yes, it is possible, it is not always helpful in the grand scheme.

codysluder · Post by **codysluder** » Thu Jan 22, 2009 9:13 pm

You also need to appreciate that even though you can run stand-alone Gromacs on a Sparc, that doesn't mean that it will be integrated into Folding@home. Porting the client and the middleware and then supporting it forever would use more of FAH's development time than many other things they might be doing and these other things could easily increase FAH production more than a few thousand sparc computers.

alpha754293 · Post by **alpha754293** » Thu Jan 22, 2009 11:53 pm

7im wrote:Yes, but the native gromacs code does not exactly equal what fah uses in the fahcores. So recompiling that gromacs code is mostly an academic exercise.

And the realm of possibilities is far from reality. Sorry to appear to be a downer, but as mentioned already, pursuing a Sun or AIX port is much like the port to OpenMP. While yes, it is possible, it is not always helpful in the grand scheme.

That's based on the link that Dr. Pande posted (and effectively sent me to) the gromacs.org page. I've actually been in contact with an Erik Lindahl in regards to benchmarking the systems (although he does mention a very good point in that most of the systems running the clients nowadays are Windows, Linux, and Macs.)

codysluder wrote:You also need to appreciate that even though you can run stand-alone Gromacs on a Sparc, that doesn't mean that it will be integrated into Folding@home. Porting the client and the middleware and then supporting it forever would use more of FAH's development time than many other things they might be doing and these other things could easily increase FAH production more than a few thousand sparc computers.

Gawd...I LOVE what a bunch of mindless drones you guys are.

While I can certainly understand and appreciate the lack of human resources in order to port the F@H program over to other platforms; I'd like to speculate at this point (however wrongly this might be) that this might just be a case of "you don't know because you've never tried."

And while the recent advances in porting over the client onto GPUs and PS3s, from what I've also read is that they're all still very specialized clients due to the architecture of the processors, and that while they return results faster, they can only do a limited subset of the simulations at hand, which means that the normal x86/x64 processors still have to do the remainder of them.

IF there is so much computational work that is left to be done, I am surprised that there wouldn't be a greater emphasis in illiciting computational assistance from some of the fastest mainframes. Yes, I would also agree and admit that (as the distributed.net project shows) that the cumulative computational effort of those would be very very small, BUT, on the other hand, for those of us that are going to be entering into the workforce within the next 5-10 years or so; where we will be placed in charge of those very same systems; WE might actually have the authority to make that call as to whether to put F@H on the mainframe or not.

As I've also mentioned earlier before, I'm not a programmer, and therefore; I am incapable of affecting change in such a manner where I can made code contributions to the project. However, if the intent is to be able to make this program usable (and I've read the 200+ pages of the GROMACS core, which actually looks like the same caliber of simluation stuff that I do for work) to the greatest number of people; then I am surprised that there hasn't been more emphasis or effort put into it (yes, I am making a presumption here and I could be VERY wrong about it, but I'm okay with being wrong.)

If it's not Windows/Linux/Mac on Intel/AMD (or in brevity PPC), it's like put the shaders on and let's bury ourselves to wallow in our self-pity.

I know that if I were a CTO or CIO, I'd be running F@H company wide. Oh wait. I can't, cuz there are clients for it. *rolls eyes*

Post by **kasson** » Fri Jan 23, 2009 1:13 am

It comes down quite simply to a matter of limited resources and what we judge to be best yield for our allocation of those resources. At this time, ports to architectures such as Sparc or POWER<X> are not a priority. If Sun, IBM, or a major supercomputing center were to offer the use of substantial compute resources and programmer time to make those ports happen, we would of course reconsider. Having FAH clients running on a given architecture is somewhat more complex than simply having a functional Gromacs port.

We make fairly extensive use of traditional HPC resources in addition to the distributed FAH clients and are quite familiar with those resources.

With regard to FORTRAN, there would be no performance advantage to re-coding in that language. Gromacs in its current form uses both the fastest numeric libraries and hand-coded assembly loops that yield faster per-processor performance than any other major MD code. In terms of parallelization, there are a number of options; the considerations for a FAH client are substantially different from standard HPC. For standard HPC, we typically use openMPI or proprietary/commercial libraries, depending on the architecture. D.E. Shaw Research has also developed a library (not yet released to the public) of point-to-point communication primitives that yields improved scaling on commodity clusters. But again this is not the environment that most FAH clients run in.

Folding Forum

MPICH vs. OpenMP

MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP

Re: MPICH vs. OpenMP