alpha754293 wrote:Well, the only other option would be to ask if anybody else has already played around with distributed parallel processing over Myrinet or IB?
Now if you really meant to ask if anybody else has already played around with FAH distributed parallel processing over Myrinet or IB, that would be an entirely different question that only Pande Group might be able to answer.
Interesting paper.
iWARP has a higher latency, despite multiple connections whereas IB has like half at nearly any given number of connections.
and IB also has much higher bandwidth (close to 2 GB/s) while iWARP can't even break the 1 GB/s mark.
BTW 7im, I couldn't find any recent reference to SMP2 in the project blog; the latest news I found on it was here, dated 21 Jan 2009:
viewtopic.php?f=16&t=8000&p=79603&hilit=SMP2#p79603
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
As a Mod, I'm sure you know that Pande Group does not typically comment on unreleased clients. A lack of recent public comments does not indicate a lack of progress, nor does it indicate progress, for whatever that's worth. I too, wish they would sometimes comment more frequently. But often that just spurs more pestering and speculation, so not always helpful.
I hate to say it, but alpha754293 is, IMO, wasting his time in trying to run the fah client on a cluster. I do know the SMP client is somewhat restricted as to what type of environment it will run on. So while it may be a waste of time, he won't waste much time at all, as he'll hit a roadblock rather quickly.
I don't know either. I think that it would be slow, but then again, I am not entirely sure how much intercore I/O there really is.
I have one system right now that's dual socket single core 1.8 GHz Opteron (file server) that's running the SMP client and gets only about 900 PPD or so.
So, on one hand, if the MPI communications (size, length, and frequency) is high; then we will have some data for it.
Besides, I'm just using currently existing hardware, and this data will be useful (to me at least) to set up an HPC-like structure for F@H as an added perk to the normal HPC FEA/CFD that I'd otherwise be normally running anyways, even with a standard, conventional GbE.
In terms of looking at OS options, there's a PelicanHPC install, and another one that I downloaded that was a LiveCD cluster Linux. I will probably have to play around with CentOS if I really want to get that going, but I've done the install with that before and it actually looks a fair bit like RHEL. There are supposed to be clustering tools that come with it, but I've never played around with it, and the last time that I tried clusterKNOPPIX was like about 3 or 4 years ago, so things are probably quite different now.
Who knows. I will have to test/try it out since apparentl the openMOSIX that it was based of is no longer maintained either.
BUT...(to address/respond to 7im's points): in theory, I can set it to be deadlineless. Also in theory, this will allow a bunch of really small, old computers to work together, and with any sort of reasonable data collection; you can actually build the neural net application that ought to be able to have a pretty good guestimate as too whether or not you'd be able to make the WU deadlines based on the processor type/architecture/speed and your interconnect medium.
It's theorectically possible. And I think that while most people probably wouldn't even bother considering running it this way, for others who may not have a quad-core system, but have a few older P4s lying around that can't/don't use multi-socket or multi-core; running it as a distributed parallel application might be a way to circumvent some of these limitations.
And yes, I'm sure of the whole PG's internal development stuff as well. I'm not saying that there's no progress. I'm pondering on which direction they're headed next.
Course, if you also ask me, I'd be porting it over for the Roadrunner. But that's just me.
The only facts that I know about to back up the assertion that FAH would be slow on a cluster (if it's possible to run at all) is some early data where the loopback adapter statistics were noted before and after a WU was run. Both in the number of I/Os and in the total amount of data transferred were extremely large. I may be wrong about this, but I do think that the initial MPI implementations were based on everything going through the IP stack and later versions were optimized to use in-core transfers for the large blocks of data though some signaling still goes through the stack. I'm not sure how much of this applies to Linux/MacOS or to either of the Windows implementations, though. You might get three different answers if you test it now.
bruce wrote:The only facts that I know about to back up the assertion that FAH would be slow on a cluster (if it's possible to run at all) is some early data where the loopback adapter statistics were noted before and after a WU was run. Both in the number of I/Os and in the total amount of data transferred were extremely large. I may be wrong about this, but I do think that the initial MPI implementations were based on everything going through the IP stack and later versions were optimized to use in-core transfers for the large blocks of data though some signaling still goes through the stack. I'm not sure how much of this applies to Linux/MacOS or to either of the Windows implementations, though. You might get three different answers if you test it now.
Going from that, I once posted about the deino x64 being available and wonderd if it wouldn't be easier to use a single dependancy ( as I think deino is also what's used for the linux clients and I guess also for MacOS ). My 'hunch' therefore is that smp2 is a deino smp client which is more universal and therefore easier to maintain. Not sure how that relates to being more friendly/open to cluster computing, but I think deino's directory staging would make it allot easier.
I thought the build in MPI support was based on Deino rather then MPICH. Mpich is as far as I can tell from the wiki an initial implementation of the mpi 1.1 standards. Mpich2 is an implementation of the mpi 2.1 standards, Deino is a derative from Mpich2 and as I said I was under the impression Linux/MacOS did not ship with mpi but rather the deino derative. Guess I was wrong, think I been corrected on this before as well just couln't remember it before you just done so again.
Guess it's because I'm not familiar enough with linux, someone saying it's shipped with a. doesn't get retained in memory unless I can check it and see it. Memory has always been a problem area for me as well as language
from what I know, the MPI that's usually shipped is MPICH because Deino is newer (than MPICH). Therefore; I have always thought that the MPI implements are always MPICH unless otherwise specified.
If it's going through the IP stack, it's got to be quite slow even though it's loopback.
alpha754293 wrote:If it's going through the IP stack, it's got to be quite slow even though it's loopback.
Slower than memory-to-memory transfers but faster than if an actual device had to move the data. The loopback device is a virtual device that just enqueues the buffer on the stack without actually moving the data anywhere whereas even the utra-speed connections that you're talking about have to actually move the data.
but even as a vdev, it still has to go through the whole like "translation" process to/from core interconnect to the TCP/IP stack and back.
While you're right, it's quicker than actually passing data through the PHY layer, I don't suppose that it would really be that much slower (limited by the interconnect medium/chip/controller).
a 16-pair, 1 GHz HT link is only 4 GB/s. 4x DDR IB is 2 GB/s. And if you can theorectically bond two 4x DDR IB links together, you'd get the same bandwidth as the HT link. (With the added advantage that you can add any system to/from the IB network/fabric, where as you can't scale the HT link on the fly, dynamically, or hot.)
*edit*
Over GbE. yes. it's slow. 10G or 100G or IB QDR or DDR -- mehhh...it starts to kinda get close to the HT link (unless you count coherent links, but then again, all the coherent links basically mean is total aggregiate bandwidth anyways, which in theory, you can do the same calculation/representation with 10G, 100G, or IB DDR/QDR.)