Project: 0 (Run 0, Clone 0, Gen 0)

Post by **bruce** » Tue Mar 17, 2009 9:18 pm

There are two Windows versions of MPI being used with some slight differences, but in general, some data is moved with in-core transfers and some proceeds through the loopback address at 127.0.0.1.

The fundamental problem is that with the older versions of the IP stack, if you drop the external network, the entire IP inteface is disabled and rebuilt, causing errors even for 127.0.0.1 -- so the answer is: You cannot disable the external network interface and DHCP cannot issue a new lease to that machine. If you shut down FAH, then disable the external network, then restart FAH, it will function without an external network until that WU finishes.

alpha754293 · Post by **alpha754293** » Tue Mar 17, 2009 9:20 pm

bruce wrote:There are two Windows versions of MPI being used with some slight differences, but in general, some data is moved with in-core transfers and some proceeds through the loopback address at 127.0.0.1.

The fundamental problem is that with the older versions of the IP stack, if you drop the external network, the entire IP inteface is disabled and rebuilt, causing errors even for 127.0.0.1 -- so the answer is: You cannot disable the external network interface and DHCP cannot issue a new lease to that machine. If you shut down FAH, then disable the external network, then restart FAH, it will function without an external network until that WU finishes.

What about Linux?

I have no idea how the TCP/IP stack works in SLES10 SP2.

I know that in Solaris, I think that the network interfaces are separate and it will print a message to console e.g. "e1000g0 link down"

Post by **bruce** » Tue Mar 17, 2009 9:28 pm

In Linux, some of the newer versions of the nucleus have the improved IP stack but there are a lot of versions still kicking around with the exact same problem. MacOS has not been fixed yet. Somebody else will have to tell you what specific Linux versions have the independent stacks.

As you probably know, the IP stack has a number of levels. Messages like "XXX link down" do not tell you to what the procedure that recognized this fact had to do to continue using the other links. The programmers took some shortcuts and provided some common code that is used by all interfaces and for most applications, that does not cause a problem.

alpha754293 · Post by **alpha754293** » Tue Mar 17, 2009 9:33 pm

bruce wrote:In Linux, some of the newer versions of the nucleus have the improved IP stack but there are a lot of versions still kicking around with the exact same problem. MacOS has not been fixed yet. Somebody else will have to tell you what specific Linux versions have the independent stacks.

As you probably know, the IP stack has a number of levels. Messages like "XXX link down" do not tell you to what the procedure that recognized this fact had to do to continue using the other links. The programmers took some shortcuts and provided some common code that is used by all interfaces and for most applications, that does not cause a problem.

What would be the fundamental premise of the core using the loopback interface?

As I mentioned before, I can understand if that was the case in preparation for a HPC installation and that for local system testing that it would use the loopback interface as the MPI system interconnect. But for a program such as F@H, unless it would be inherent in MPI requirements, I don't understand why it would need a loopback interface or an external network during the run.

Just a thought. Hmmm....*ponders*...

Post by **Ivoshiee** » Tue Mar 17, 2009 9:38 pm

alpha754293 wrote:
bruce wrote:In Linux, some of the newer versions of the nucleus have the improved IP stack but there are a lot of versions still kicking around with the exact same problem. MacOS has not been fixed yet. Somebody else will have to tell you what specific Linux versions have the independent stacks.

As you probably know, the IP stack has a number of levels. Messages like "XXX link down" do not tell you to what the procedure that recognized this fact had to do to continue using the other links. The programmers took some shortcuts and provided some common code that is used by all interfaces and for most applications, that does not cause a problem.
What would be the fundamental premise of the core using the loopback interface?

As I mentioned before, I can understand if that was the case in preparation for a HPC installation and that for local system testing that it would use the loopback interface as the MPI system interconnect. But for a program such as F@H, unless it would be inherent in MPI requirements, I don't understand why it would need a loopback interface or an external network during the run.

Just a thought. Hmmm....*ponders*...

It is just a MPI nature to do things over the network.

alpha754293 · Post by **alpha754293** » Tue Mar 17, 2009 9:47 pm

Ivoshiee wrote:
alpha754293 wrote:
bruce wrote:In Linux, some of the newer versions of the nucleus have the improved IP stack but there are a lot of versions still kicking around with the exact same problem. MacOS has not been fixed yet. Somebody else will have to tell you what specific Linux versions have the independent stacks.

As you probably know, the IP stack has a number of levels. Messages like "XXX link down" do not tell you to what the procedure that recognized this fact had to do to continue using the other links. The programmers took some shortcuts and provided some common code that is used by all interfaces and for most applications, that does not cause a problem.
What would be the fundamental premise of the core using the loopback interface?

As I mentioned before, I can understand if that was the case in preparation for a HPC installation and that for local system testing that it would use the loopback interface as the MPI system interconnect. But for a program such as F@H, unless it would be inherent in MPI requirements, I don't understand why it would need a loopback interface or an external network during the run.

Just a thought. Hmmm....*ponders*...
It is just a MPI nature to do things over the network.

Okay, so here's another question then: what compiler is PG currently using? Can/does it have support for OpenMP 3.0? If not, is it feasible to conduct a feasibility study for any one of the following:
* Intel 11.0: Linux (x86), Windows (x86) and MacOS (x86)
* Sun Studio Express 11/08: Linux (x86) and Solaris (SPARC + x86)
* PGI 8.0: Linux (x86) and Windows (x86)
* IBM 10.1: Linux (POWER) and AIX (POWER)

Does anybody know if OpenMP requires a network interface as well or can it do without?

Does this also mean that in theory, the F@H client can run in a HPC setting/environment using some like of clustering and/or grid engine management software because of the MPICH implementation?

7im · Post by **7im** » Tue Mar 17, 2009 11:12 pm

You may be a bit late with that question. The devleopment of the SMP2 client is already underway, as noted in Vijay's Blog (NEWS). Then again, they might be using OpenMP already.

In thoery, yes, the MPICH client could function in high speed clusters, but the interconnects would have to be extremely fast between each node, as the amount of data traveling between nodes is very large. GBs of data... The 4 and 8 node clusters of a Quad and dual Quad core system perform much better, and the SMP client doesn't currently scale well beyond that yet. So support for large clusters isn't a priority.

alpha754293 · Post by **alpha754293** » Tue Mar 17, 2009 11:31 pm

7im wrote:You may be a bit late with that question. The devleopment of the SMP2 client is already underway, as noted in Vijay's Blog (NEWS). Then again, they might be using OpenMP already.

In thoery, yes, the MPICH client could function in high speed clusters, but the interconnects would have to be extremely fast between each node, as the amount of data traveling between nodes is very large. GBs of data... The 4 and 8 node clusters of a Quad and dual Quad core system perform much better, and the SMP client doesn't currently scale well beyond that yet. So support for large clusters isn't a priority.

Well, I remember briefly talking about it quite a while ago and even then people said that it was already under development but there were no details in terms of what they were using for the parallelization, only speculation (which of course, is worse than having cold hard concrete data).

So I was just wondering if anybody knows of anything that might be only on the admin section of the forums or something?

I was also doing a little bit of research and apparently they are preparing OpenMP to be CUDA compatible for that might be an interesting approach so that in theory, if it works, you would have a SMP GPU client.

From what I know (the little bit), in the HPC world, MPI is still preferred for anything 10k-15k and up in terms of number of cores/processing units, where as OpenMP is starting to pop up some in > 1ki monolithic systems because of the availability of IB and Myrinet along with 100G networks.

A friend of mine and I were speculating that future generation processors would be more akin to the PowerCellX8 and also the GT200 series GPUs where the ALUs and FPUs will form fundamental building blocks of the CPU, but because they themselves cannot distinguish between general (CPU-style) data and visual (GPU/VPU-style) data that it will allow the processors to become very fast, and very flexible in terms of what it is doing.

It will also give a lot more options in terms of what's possible architecturally and physically, and the amalgamation of the CPU/GPU tasks as well as their physical and computational resources.

(In some ways, the UltraSPARC T2+ already does that. Not quite exactly the same, but they've already started.)

Post by **VijayPande** » Wed Mar 18, 2009 12:42 am

OpenMP and MPI are very different. In our experience, it is difficult to get the type of high performance one can get out of MPI in OpenMP, which is why MPI is used so frequently. Converting MPI code to OpenMP would be very difficult to do and would likely result in slower code anyway.

alpha754293 · Post by **alpha754293** » Wed Mar 18, 2009 1:28 am

VijayPande wrote:OpenMP and MPI are very different. In our experience, it is difficult to get the type of high performance one can get out of MPI in OpenMP, which is why MPI is used so frequently. Converting MPI code to OpenMP would be very difficult to do and would likely result in slower code anyway.

I don't suppose that there would be any way to make the MPI more fault tolerant to TCP/IP (temporary) disconnects, considering that it's either inherent in the MPI or in the TCP/IP stack of the host OS.

I might have to try distributed F@H using Cluster Knoppix or something.

Post by **bruce** » Wed Mar 18, 2009 1:54 am

alpha754293 wrote:I might have to try distributed F@H using Cluster Knoppix or something.

FAH SMP is not Distributed FAH. The in-core transfers are going to have trouble on your Cluster.

FAH is a massive cluster distributing work from Stanford at internet speeds with each node being a CPU client, a GPU client, a SMP client, or a PS3 client. Even with the SMP client, it's not designed for a cluster of clusters. If there were some kind of configuration setting that allowed you to convert a SMP client into a cluster client, it would be EXTREMLY slow because of the massive amounts of data that are passed between the nodes.

alpha754293 · Post by **alpha754293** » Wed Mar 18, 2009 2:10 am

bruce wrote:
alpha754293 wrote:I might have to try distributed F@H using Cluster Knoppix or something.
FAH SMP is not Distributed FAH. The in-core transfers are going to have trouble on your Cluster.

FAH is a massive cluster distributing work from Stanford at internet speeds with each node being a CPU client, a GPU client, a SMP client, or a PS3 client. Even with the SMP client, it's not designed for a cluster of clusters. If there were some kind of configuration setting that allowed you to convert a SMP client into a cluster client, it would be EXTREMLY slow because of the massive amounts of data that are passed between the nodes.

I dunno. I just thought that I would try it using LiveCD. I'm not even entirely sure that it would work because I'm trying to find a cluster Live CD and there used to be two -- clusterKNOPPIX and ParallelKNOPPIX that used to be able to do it, but the kernel may only be 32-bit and therefore, it is likely that it won't work. I'm still working on looking for others because I know that CentOS is used as a HPC/cluster OS, but I don't know it can do automatic clustering.

The idea just kind of came about because of the inherent nature of the network requirements and MPI, so I figure that I'd try and see how well (or how poorly) it works, if I can get it to work.

For my home systems, there are plans that over the next 3-5 years, that I am going to be moving almost all of my systems into a) rackmounts (more space and power efficient, as well as for thermal management reasons), b) prolly run a centralized server for my entire house (PXE boots across the board, two servers, one with SSDs for OS and app installations, the other being purely disk to go onto tape), c) move nearly the entire LAN over onto InfiniBand 4x DDR (16 Gbps) since it would be more cost effective than 12x QDR IB or 100G network or Myrinet (and faster being that Myri 10G is the highest it goes so far), and thus would d) potentially allow me to run IBoIP and then KVMoIP so that everything basically runs through the IB interconnect.

The initial cost will be high but ultimately, it would simplify my entire setup across the board by pushing everything including data, management, network, and communications all onto one layer (which I am pretty certain I can made HA or bonded).

That would also mean that regardless of the physical location of the system, which should translate to about 2 GB/s interface. Nowhere near as fast as the onboard system, but if it's passing through the loopback anyways, it probably doesn't/can't use to host memory interface even for inter-core communications.

(BTW...if someone knows of a way for me to tag and monitor the number of mem IOs from the Fah cores, let me know and I'll check it on my servers so that I would be able to do a compare and contrast later).

This would mean that any of the systems that are running a HPC OS (I'll eventually have to check for cross compatability) would also mean that it would truly be a uber mini HPC system setup here and that it wouldn't care where the cores were on the IB network.

7im · Post by **7im** » Wed Mar 18, 2009 4:49 am

I'm not one to discourage experimentation, but unless you're already playing with some of that high end iron you mentioned, the interconnects aren't fast enough, and the cluster would not meet the short deadlines, even assuming you could hack the client enough to make it work on a cluster. However, experimentation is a reward in and of itself and those roadblocks should not stand in the way of your own self gratification.

alpha754293 · Post by **alpha754293** » Wed Mar 18, 2009 5:17 am

7im wrote:I'm not one to discourage experimentation, but unless you're already playing with some of that high end iron you mentioned, the interconnects aren't fast enough, and the cluster would not meet the short deadlines, even assuming you could hack the client enough to make it work on a cluster. However, experimentation is a reward in and of itself and those roadblocks should not stand in the way of your own self gratification.

Well, the only other option would be to ask if anybody else has already played around with distributed parallel processing over Myrinet or IB?

Judging from the type and style of responses that I have been getting, I am going to presume "no" at the moment, unless I am told differently.

Which would also mean that any questions pertaining the performance loss (as a quantity) -- no data is available for that.

This would also tell me how I might need to consider structuring and configuring a proposed 768-core blade rack that can be used for F@H when I do not have other FEA/CFD running on it already.

So unless someone can present me with the data that will help me answer those questions, it looks like that I am going to have to do the testing myself.

7im · Post by **7im** » Wed Mar 18, 2009 5:27 am

alpha754293 wrote:Well, the only other option would be to ask if anybody else has already played around with distributed parallel processing over Myrinet or IB?

Yes. Google is your friend. http://post.queensu.ca/~afsahi/PPRL/papers/CAC07.pdf

Now if you really meant to ask if anybody else has already played around with FAH distributed parallel processing over Myrinet or IB, that would be an entirely different question that only Pande Group might be able to answer.

Folding Forum

Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)

Re: Project: 0 (Run 0, Clone 0, Gen 0)