Folding Forum

Posted: **Mon Mar 30, 2009 5:33 pm**

*edit*
Source: Huang, J. (2003). Understanding Gigabit Ethernet Performance on Sun Fire Systems. Sun Microsystems : Santa Clara, CA. retrived from http://www.sun.com/blueprints/0203/817-1657.pdf on March 30, 2009.

It states that in their tests on the Sun Fire 6800 that the round-trip latency is 140 ms (or 70 ms one-way). So, based on that, I think that the way for me to find out what the latency is between my systems is to ping the systems from in both directions.

I'm currently averaging about 1 ms (or slightly less than that, at about 0.7 ms). It seems that when I start the ping command (both in Windows and in Linux) that there's a bit of a lag (up to about 8 ms) before it settles down to the actual transfer rates. I'm doing a bit more data collecting now while the F@H (standalone SMP clients) are running in the background.

One of the two systems that I'm going to be trying the distributed parallel on is currently running Windows XP Pro 32-bit. (Q9550, 2.83 GHz, quad-core, 8 GB RAM)

The other system, (dual AMD Opteron 2220, 2.8 GHz, 8 GB RAM) is currently running RHEL4 WS.

I'm doing absolutely NOTHING to tweak/tune/refine/etc. and of the TCP parameters although if you read the PDF cited above, you'd see that they go into like change the payload size, TCP window size, reducing overhead, and a whole slew of parameters. I'm not going to realllly bother with all that. Mostly because I'm lazy, but also because while I'm pretty sure that they will have an effect on the effective transfer rates, that in reality, on a real distributed parallel deployment; I don't think that people are really going to be spending that much time to keep constantly tweaking the parameters. I suppose that if they had like a real-time solver or something so that they can automate the entire process; then sure, and let the system pick the best parameters to maximize the throughput; but I'm much too lazy (and don't really have the time) to do quite that extensive of a study.

If I were to go for a distributed parallel processing deployment, I'd want to know enough JUST to get the cluster up and running, know the method that will trigger the load balancing (like in ClusterKNOPPIX, it used to be that you needed to have about like 8 processes or something before it'll start load balancing. I don't know exact details, but that was what I noticed before when I did earlier testing with it for my work.) And then set up the F@H SMP clients, and let 'er rip. (and watch the GbE switch light up like a Christmas tree.)

So, my current presumptions now is that even with a 12 MB/s constant data flow rate; that GbE is essentially designed to try to best reach/achieve/sustain GbE speeds which would tend to also mean that it would have to have a low latency in order to achieve that if there's nothing that it is doing to tweak the payload size, or TCP window size, or anything else like that.

Of course, I won't know until I actually do the clustering tests BUT that's at least the current working presumptions for now; which would tend to suggest that it is actually quite possible to run F@H in a distributed parallel setting. I'm not entirely sure how the number of nodes would affect it's overhead, but this COULD be good news for a bunch of people who have a number of single and dual processor (cores) systems laying around in that they can string them together using cheap GbE and make it run the SMP client, thus getting them a much higher PPD than the uniprocessor clients could/would be able to.

However unrealistic, IF this actually works, and people started doing it; it would be nice to see a good portion of the uniprocessor units be converted to SMP units, so that we'd be able to throw more processors on it, and run it more efficiently. *shrug* I dunno. just a thought.

Posted: **Mon Mar 30, 2009 6:00 pm**

7im wrote:Looks about right. Many GBs of data per WU, I didn't recall the actual per second speed, and wasn't going to go looking either.

On what kind of processor were you getting those speeds?

Waiting for latencies to be added in... Good work so far.

BTW, I figured on this too. Except that I figured that it would be closer to like 2 GB/s rather than ~8 MB/s. Quite a huge difference there.

I know that I've overdriven my GbE 16% to support upto 116 MB/s sustained (dual AMD Opteron 254, 2.8 GHz, single-core, 8 GB, Solaris 10), and Windows usually maxes out at around 60% or so. (It's very rare that I see anything more than that). So, that means that if F@H were running 60 MB/s, then yes; the GbE (even in Windows, as distributed parallel) is still quite possible.

BUT, if it were to be more like the 2 GB/s that I was anticipating; then it would take an entire 4x DDR Infiniband connection JUST for the F@H traffic alone. THAT would be obviously substantially LESS feasible. (and you figure that even with DDR400, that's 3.2 GB/s).

IB SDR switch chips have a latency of 200 ns. DDR is 140 ns. End-to-end is upto 2.6 seconds.

This is why I was so like "dead set" on getting the data to support the stuff that you, toTow, et. al. have been saying because 8 MB/s distributed parallel is EASILY doable. 2 GB/s is not. I'd definitely would have agreed with the whole "can't be done" attitude if I knew that it was going to be 2 GB/s. But, on a 8 MB/s throughput, you can run it even on a 100 Mbps NIC if you REALLY wanted to. I wouldn't recommend it, but it WILL fit.

This to me also means that all of the concerns about the WUs not being able to meet timelines, etc. -- won't be because of the interconnect, but because of the processor speed itself. Now in terms of what happens if you were mix and match different processor speeds --- THAT I REALLY don't know/have no idea.

But even for one of my file servers (dual AMD Opteron 144, 1.8 GHz single core, 2 GB) right now, it's running the SMP client (4 threads on two cores) and it can still meet the preferred deadlines. (Granted, they're a bit longer than my other systems). So, maybe it would be nice if I can string that, with at least one of other workstations to help it out in terms of doing the processing. I don't know. And if I can boot it using a Live CD, all the better.

Posted: **Tue Mar 31, 2009 3:22 am**

Alpha, whilst this all sounds well and truly above my head I have enjoyed trying to decipher it all

My questions are:
If for EACH computer you had doing this you require ~8MB/s, wouldnt that mean that you'd only be able to operate a very finite amount of computers across a network and the network would be effectively non-functional for anything else? Where would something like this run? a University after-hours perhaps?

Even if you can manage the bandwidth requirements.... wouldnt the power consumption of this project be astronomical? I could understand if these computers were already running the fah uniprocessor client and you just want to make them all capable of producing more science as "one" rather than individually..

Posted: **Tue Mar 31, 2009 3:42 am**

kelliegang wrote:Alpha, whilst this all sounds well and truly above my head I have enjoyed trying to decipher it all

My questions are:
If for EACH computer you had doing this you require ~8MB/s, wouldnt that mean that you'd only be able to operate a very finite amount of computers across a network and the network would be effectively non-functional for anything else? Where would something like this run? a University after-hours perhaps?

Even if you can manage the bandwidth requirements.... wouldnt the power consumption of this project be astronomical? I could understand if these computers were already running the fah uniprocessor client and you just want to make them all capable of producing more science as "one" rather than individually..

Well...currently, on the single system, running the SMP client with 4 threads/processes show 8 MB/s aggregate on the loopback interface. Therefore; my current presumption is that the worst case scenario would be 8 MB/s per process, or what might be MORE likely, 2 MB/s per process.

If you are able to use the gigabit network as your "backbone" for all of the distributed processing, then in theory; you should be able to support 15 systems configured this way, or about 60 threads/processes. This also presumes that all of the data packets are hitting the network layer at exactly the same time. I would guess that you can probably take on more, betting on the offset.

The other thing that I would also presume is that once the traffic load increases, the ping times between nodes will start to increase, thus effectively slowing you down. I'd have to play around with it, but unfortunately, I definitely don't have 15 systems to play with (and to make each of them generate a 8 MB/s of constant traffic stream).

So, in theory, yes, you are right, that you may run into issues with it as you span out, but again, you'd be looking at nearly 60 single core processors (however many sockets they're configured), 30 dual-cores, or 15 quad-cores.

This also presumes that the data is processed on core0 is being sent to core59. (which means the theorectical farthest location).

And you are also right, in that there are a lot of caveats to this approached that would need to be studied.

In terms of power consumption, it depends on how your systems are set up. If you're using the latest and greatest, you're probably not going to likely be concerned with that as you can run up to four quad-core processors on a system with an 850 W PSU.

BUT, if you don't want to spend between $7k-15k on a system; and using existing hardware that you have; if you're really that uber power conscious, you can probably rig the PSUs to be able to power 2-4 single socket, single core systems each, and running the OS off a USB flash drive. In terms of SWaP efficiency, it's far from ideal, but in this economic climate where money is tight, especially for capital expenditures and purchasing new equipment; it may provide an inexpensive alternative to be able to run the SMP client without the need for much in the way of new hardware (if at all), and would be at least more beneficial from the PPD perspective.

In terms of the underlying science, as it has been mentioned to me before (and others); that all WUs are equally important in the end.

Consider this though: my current laptop (which has an Athlon64 3000+ 2.0 GHz, single core) consumes 90 W when it is charging the battery, 70 W nominally. If I had four of these, I can run the SMP Linux client on 280 W. The actual implementation will obviously depend on what it is that you want to achieve out of your setup. Mine (and testing that's being scheduled in the short run) is done to see IF it is even possible because there were a number of people that were either "you can't. it won't work." or "you can, but it will be really slow." This, for the time being, is providing me personally with some really important data and characteristics, so that if and when I ultimately will have a blade system (be it at work or at home), that if it isn't busy doing stuff for me; I can run F@H in distributed parallel mode, and let it go nuts. So long as each of the WUs or client instances are separated, I can run something like a clustering OS rather than individual or monolithic OS, and with the load balancing, run the F@H. That also means that I can also be a bit "careless" in terms of how the systems are configured because doing it this way, I don't really have to care/pay attention to whether the processes are on the local system, or being processed on another core, on another node somewhere else.

Posted: **Tue Mar 31, 2009 3:47 am**

I've never run a cluster. How do you plan to deal with IP addresses? Will your system broadcast 127.0.0.1 to all systems? When MPI starts four tasks, will they automatically be migrated to the proper CPU?

Posted: **Tue Mar 31, 2009 4:12 am**

bruce wrote:I've never run a cluster. How do you plan to deal with IP addresses? Will your system broadcast 127.0.0.1 to all systems? When MPI starts four tasks, will they automatically be migrated to the proper CPU?

While I can't speak for the specific implementation of distributed parallel F@H SMP; the current plan (based on what I remember from running clusterKNOPPIX) is when you boot it up, you decide which system(s) are the slave nodes. The last time I did something like this, I needed to spawn 8 processes at 100% CPU, and then clusterKNOPPIX did the load balancing by migrating half of the processes over onto the slave system.

Because I am going to be testing with only two quad-core systems, therefore; I won't really be able to find out what would happen if you were to run 16 single core, single processor systems, but I would imagine that the process would be pretty much the same.

The offload or process migration is usually controlled by the clustering engine or OS and I'm sure that there are probably specific parameters that you can use to tweak to make it migrate sooner.

In terms of the loopback interface, you probably can't broadcast 127.0.0.1 to all systems because theorectically each system would have it's own.

If I want to force it to use the actual ethernet interface, I might have to disable loopback altogether but I would also presume that that would probably only be a concern if there are more than two working processes on a system. On a multi-core system, there's probably going to be some that's going through the ethernet interface and some though the loopback.

I think that that would be more interesting to find out what would happen on a single-core system; and more relevant.

Truth is -- I don't know the answer to those questions. And I would need systems in order to test it out explicitly.

Posted: **Tue Mar 31, 2009 4:32 am**

Here's an example where the traffic going to 127.0.0.1 was big enough/fast enough to frighten the donor.
viewtopic.php?f=6&t=8062 Note that essential traffic DOES go through 127.0.0.1 so you're going to have to figure out how to do that.

Also, I'm not sure if you're seeing all of the traffic on in your statistics. (I remember TB, not GB.) At some point core-to-core transfers replaced most of the big transfers that went through the loopback adapter. I beleive that the code was updated to use the loopback adapter only for signalling, not for the actual data transfers. The only way to know is to try it, but don't be surprised if you can't make it work. If it does work and it proves to be productive, a lot of us are going to want to know how you did it.

Posted: **Tue Mar 31, 2009 4:50 am**

bruce wrote:Here's an example where the traffic going to 127.0.0.1 was big enough/fast enough to frighten the donor.
viewtopic.php?f=6&t=8062 Note that essential traffic DOES go through 127.0.0.1 so you're going to have to figure out how to do that.

Also, I'm not sure if you're seeing all of the traffic on in your statistics. (I remember TB, not GB.) At some point core-to-core transfers replaced most of the big transfers that went through the loopback adapter. I beleive that the code was updated to use the loopback adapter only for signalling, not for the actual data transfers. The only way to know is to try it, but don't be surprised if you can't make it work. If it does work and it proves to be productive, a lot of us are going to want to know how you did it.

I'll have to re-read this thread in order to find out what program I may be able to use to snoop into the core-to-core traffic.

Current final statistics on the loopback interface is 346.5 GiB in just under 13 hours (still ~8 MB/s).

I've never had a real need to snoop into core-to-core traffic, but then again; even some of the HPC/FEA/CFD applications now run on GbE (54% I think of the Top500 for November 2007).

Therefore; I am sure that Pande Group is well aware of this and I would be quite surprised if that F@H would be such an exception that you WON'T be able to run distributed parallel over GbE.

Again, only one true way to find out. (Although once I find out what programs I can use to snoop, I'll report the results from that here as well and see what happens.)

*edit*

P.S. That TB of data might be after a number of WUs. I'm not sure if 300-ish GB of data per WU is standard. I doubt it, but even if it were TB's of data, so long as the rate doesn't exceed connection capabilities/capacities; you're still good.

Posted: **Wed Apr 01, 2009 9:22 pm**

Does anybody here have experience with DTrace or SystemTap? Apparently, that's about as close as we'd be able to get to be able to snoop into core-to-core traffic rates, and considering that people here have said that it's been done, therefore; some insight into how to set it up would be GREATLY appreciated.

*edit*
More research...

http://blogs.sun.com/bmc/entry/dtrace_on_linux (yea yea. I know, it's someone from Sun...)
http://www.sauria.com/blog/2008/06/30/dtrace-on-linux/
http://www.redmonk.com/sogrady/2008/07/ ... tap-redux/
http://softwareblogs.intel.com/2007/05/ ... er-dtrace/
http://uadmin.blogspot.com/2005/08/dtra ... -only.html

"Dtrace equivalent for Linux only requires a PhD"

So...apparently getting core-to-core data isn't easy at ALL!

Um. In that case...HELP!?

Posted: **Wed Apr 01, 2009 9:56 pm**

Alpha take note of Bruce's post where he recalls one of the devs having posted something towards that not all info goes through the loopback anymore, I'm recalling the same thing vaguely. It's also been said before that the amount of data should be much bigger then you are seeing, which only strengthens my feeling you're not going to get anywhere on your current path

Posted: **Thu Apr 02, 2009 1:00 am**

MtM wrote:Alpha take note of Bruce's post where he recalls one of the devs having posted something towards that not all info goes through the loopback anymore, I'm recalling the same thing vaguely. It's also been said before that the amount of data should be much bigger then you are seeing, which only strengthens my feeling you're not going to get anywhere on your current path

I can understand if the AMOUNT of data they're seeing is after it's been running for a while. That assumption would also be true if you monitored the aggregate traffic on the loopback interface after a few WUs have gone through.

However, I'm not looking for the amount of data, but rather the data throughput rate because that is much more important than how much data goes through.

And the only way that I know of right now to be able to snoop into core-to-core traffic is using either DTrace or SystemTap, so if people have done it before (as you and others have mentioned), I'd like to learn how to use DTrace and/or SystemTap in order to get the same information. (Unless they're using a different tool/method, which I doubt).

Posted: **Thu Apr 02, 2009 7:49 am**

You'll get better answers if you do this -> ask one of the PG members ( I would suggest Peter Kasson as I think he does allot with smp

). Though not sure if you'll get your answers or they will tell you to sieze your investigation, or neither and just let you find out on your own

All I know for sure is that what Bruce posted above was said by I think Peter Kasson, not sure who exactly though, and it did entail exactly what he said, intercore communications accounted for the biggest part of the dataflow. Even when you where given numbers from testing back then, it wouldn't be relevant with all the core revisions and diffrent wu's, but I think it's just plain safe to assume the throughput wouldn't be less then before. And while you say you don't want to know the amount of data, you already calculated the bandwidth need for a cluster based on the data from the loopbackadapter, so if it's clear you are going to have to carry a multitude of your estimates, wouldn't that make your research in this direction fruitless?

Posted: **Thu Apr 02, 2009 5:31 pm**

MtM wrote:You'll get better answers if you do this -> ask one of the PG members ( I would suggest Peter Kasson as I think he does allot with smp ). Though not sure if you'll get your answers or they will tell you to sieze your investigation, or neither and just let you find out on your own

All I know for sure is that what Bruce posted above was said by I think Peter Kasson, not sure who exactly though, and it did entail exactly what he said, intercore communications accounted for the biggest part of the dataflow. Even when you where given numbers from testing back then, it wouldn't be relevant with all the core revisions and diffrent wu's, but I think it's just plain safe to assume the throughput wouldn't be less then before. And while you say you don't want to know the amount of data, you already calculated the bandwidth need for a cluster based on the data from the loopbackadapter, so if it's clear you are going to have to carry a multitude of your estimates, wouldn't that make your research in this direction fruitless?

Well...if even a gigabit network can support 125 MB/s transfers, then no, not necessarily.

People thought that there was a lot of data going through the loopback as well, and while that is true (in some regards), 8 MB/s is hardly enough to cease the research.

Knowing the data rate is farrr more important. If the results is that it requires say 250 MB/s; then I can start looking at options like network bonding etc. to provide that.

If it's going to be say 1.25 GB/s (10G), then we have options for Myri-10G, 10GbE, and Infiniband of course.

If it isn't feasible, we'd at least know the precise reason WHY it isn't feasible?

As a researcher, you don't get very far with a "can't do" attitude; not until you've exhausted all your options, and be able to provide a clear, concise, logical reason (technical or otherwise), as to WHY it can't be done; and be able to show/submit data to support the claim as well. I think that that ought to be a requirement in any field.

Posted: **Wed Apr 08, 2009 12:20 am**

kelliegang wrote:Alpha, whilst this all sounds well and truly above my head I have enjoyed trying to decipher it all

My questions are:
If for EACH computer you had doing this you require ~8MB/s, wouldnt that mean that you'd only be able to operate a very finite amount of computers across a network and the network would be effectively non-functional for anything else? Where would something like this run? a University after-hours perhaps?

Even if you can manage the bandwidth requirements.... wouldnt the power consumption of this project be astronomical? I could understand if these computers were already running the fah uniprocessor client and you just want to make them all capable of producing more science as "one" rather than individually..

BTW, on a switched fabric, each port gets the bandwidth of the system. So, for example, a 16-port GbE switch should have a backbone switching capacity of 16 Gbps. (I was rethinking your question).

My original answer was actually more for hubs than switch. The downside with hubs is collisons.

Switches on the other hand, should mean that each PORT has a ~125 MB/s throughput capacity; which also means that it is quite possible to use that to run F@H in distributed parallel mode.

I haven't been able to get any sort of data to confirm or disprove that theory yet because apparently SystemTap is quite dificult to use and I'm not sure if DTrace has been successfully ported to Linux yet in order to measure/monitor the core-to-core throughput.

In addition to that; there are currently other technical difficulties that are preventing me from even TESTING it using VM, let alone actually running it.

Posted: **Wed Apr 08, 2009 1:16 am**

In early testing, I believe that all of the data did go through the loopback adapter and that's probably why my foggy memories say the numbers are extremly large. At some point the FahCore was optimized to use core-to-core transfers rather than the loopback adapter because it ran significantly faster without the overhead of moving everything though the loopback adapter (even with the loopback adapter's equivalent of an infinite data transfer speed, as opposed to something slow like GbE).

I agree with MtM in one regard -- if you want to help develop a version of Gromacs for clusters, you probably should start with the stand-alone version at gromacs.org. If they already have a fully functional version (and they may), contact Dr. Pande and/or Dr. Kasson and offer to help them port that version to FAH. Experimenting with the version inside of FAH is probably pointless. It has been specifically designed to run on a single multi-core computer and disabling the cluster-based code improves throughput and reduces support costs.

There are several discussions of clusters on this forum and generally speaking they should all point back to discussions where we answer the question "Why isn't there a version of FAH for xxxx?" I expect that the ultimate answer will be that there is no cluster based version of FAH because as a platform, clusters are rare enough that it's not worth developing/supporting a new version for that specific platform -- but that's a Pande Group decision, not one for me to make.

Folding Forum

measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H

Re: measuring/monitoring F@H