Page 1 of 3
My Beowulf cluster
Posted: Wed Dec 21, 2011 5:30 pm
by ATG
My questions are in reference to a Beowulf style cluster that I built recently that I'm running FAH6 on.
The machines are all running Fully Automated Installation, a flavor of Debian Squeeze. They are not running FAH6 with message passing or any of that. Right now the cluster is more like a series of networked units with the main unit as the DHCP server.
Now the question is, my statistics aren't working quite correctly and I'm not sure what's happening. I'm trying to monitor them from my laptop, which is running Win7. I created a team (213973) and a passkey and all that, and I am getting a strange output from the stats.
The cluster has been running non stop doing nothing but FAH for more than a week now, but I only see some work units on the stats page. The machines are not all the same, it would take me a little while to explain the quirks with each unit (there are 4 total in the cluster, and the laptop here and there when I'm on it, so a total of 5 machines should show up on the stats page, but I've only been seeing 4 and without correct timestamps.)
I had the machines running with default settings for a while, do the servers at Stanford log the mac addresses for identification? If so I'm not sure why I wouldn't be seeing the correct numbers because I'm not doing any weird security things like using a temp mac address when the systems boot.
Just your run of the mill DHCP server as the head and a dumb netgear switch connecting them.
One reason I want to make sure the stats are right is that I want to make sure my systems aren't doing work overlapped and wasting time. I should be getting tons of work done, I want to know how much, so I know that the systems are functioning the way they should be.
I hope that's enough information about the systems so that my questions can be answered easily. By the way the team name is AaroneusTheGreat and the individual user name I've been using is the same as my forum name.
Re: My cluster
Posted: Thu Dec 22, 2011 5:22 am
by bruce
Well, considering your cluster as a series of networked machines is a good place to start because that how FAH is going to treat it.
FAH does not keep track of mac addresses. When a FAH6 client is installed on a Linux node, it creates a file called machinedependent.dat which identifies that node in the worldwide cluster called FAH. A work unit is assigned to that node and expects to get it back from that same identifier. If that same node requests work without completing it's current assignment, you're likely to get duplicated assignments so don't duplicate or moving machinedependent.dat between machines. That MIGHT be what's happening to your credits.
Re: My cluster
Posted: Thu Dec 22, 2011 7:01 pm
by ATG
Right, as it is right now, the cluster is just a series of networked machines. It's well on it's way to being a fully fledged Beowulf. I suppose you could call it a Beo-Puppy right now. It has a lot of the interfacing setup, but I've been having to learn a ton about setting up the interfaces in order to use them and I really have nothing to run on it yet to do much testing, so I'm going to have to program a few things myself. Now, that being said, how does one verify that the machine information is unique on each? I can't seem to find an identifier for the work units. The machines have been running for a few weeks, working on things, so I know it's sent stuff back to the servers at Stanford. Problem is, I don't know exactly what.
Also I tried installing FahMon on the head unit and ran into some big issues compiling and installing it. I searched the wiki's and read some forum stuff, but I can't find what's wrong. I'm getting some errors during the make process which doesn't make sense to me because it compiled fine, and the object code was spit out correctly, and it got through a ton of make stuff before encountering any errors during the make install step. It threw back a couple of errors that I wasn't sure of the meaning. So, I really don't know enough about tracking the progress to really give you an idea whether or not the machines are overlapping work.
Now one thing that's probably important is that this cluster will be made from fat clients, (I think that's the right term, they all have harddrive), so they don't pull the boot code from the head unit. I would think that because they are all running from their own os images and their own instances of FAH, that they would be able to run through the network and keep their individual identity. Please correct me if I've made any errors in my terms or any of that, I'm fairly new to a lot of this stuff.
Re: My cluster
Posted: Thu Dec 22, 2011 10:37 pm
by codysluder
I know nothing about clusters except the very basic concepts. Maybe my stupid questions will help focus on some of the possible issues.
When a task is started on a cluster, I think there must be some mechanism to migrate it to a specific node or nodes. I have no idea how that works, but I don't think FAH can't use that capability. With a "fat client" can't you just install FAH6 on the local HD and run it on that node, totally independent of any other node? That may be what you're already doing, but it should work.
Each FAH client should create a FAHlog.txt showing what it's doing. At the top, you'll see your User Name, your Team Number, and a User ID. Whether you can find machinedependent.dat or not, the hexadecimal values in User ID are what bruce called the FAH node number. If each one is unique, you're fine.
.
Re: My cluster
Posted: Fri Dec 23, 2011 10:36 pm
by ATG
Oh very good! I looked at the log but wasn't sure what I was looking at honestly, some of the numbers and things meant nothing to me because I don't know much about the internals of FAH. Each unit does have it's own HD and each has it's own operating system. That is what I meant by "fat client" you're right on the money. This does explain a bit about how to verify the data. How you described it is in fact how the units are running right now. And you were right about the software to migrate the tasks. The software I'm setting up for that is OpenMPI, and I was aware that FAH can't use it, probably due to the distributed nature of the program itself and that this is the standard FAH6, not some variation for what I'm running. It was your standard Linux based FAH installation.
Thanks for pointing out the hex code and the fact that they should be different. I may end up having to make up a user and passkey for each unit but I'm not sure. As it is I set them all up with the same info when it came to User Name and Passkey and Team, thinking that they would all contribute to the same statistic but keep their individual identity separate. By the way I now have 7 units in the cluster, so I really need to make sure that they are all working on different things so that the time isn't wasted, because the cluster will contribute a lot more now.
Thanks Bruce and Codysluder, any input is good input. I have a lot to learn, which is why I'm on here in the first place.
Re: My cluster
Posted: Sat Dec 24, 2011 1:50 am
by codysluder
ATG wrote:Thanks for pointing out the hex code and the fact that they should be different. I may end up having to make up a user and passkey for each unit but I'm not sure. As it is I set them all up with the same info when it came to User Name and Passkey and Team, thinking that they would all contribute to the same statistic but keep their individual identity separate. By the way I now have 7 units in the cluster, so I really need to make sure that they are all working on different things so that the time isn't wasted, because the cluster will contribute a lot more now.
I wouldn't recommend that you create separate user names and/or passkeys for each client. That's a waste of good points unless you have no other method of tracking them. The stats on the official Stanford pages will report the number of Active clients for that particular user name so at some level, they do keep track of the hexcode for User ID though you can only see the total by User Name.
Re: My cluster
Posted: Sun Dec 25, 2011 2:44 am
by ATG
I found out in the email that I got for creating the passkey that I had been using a different user name with the passkey, I'm not sure that matters, but I changed it to reflect all the correct data and restarted all 5 of my clients where they left off. I'm going to add the 3 new clients as I get them up and running, right now they are pretty bad. Some hardware issues I'm having to work around with software, and the fact that they are dinosaur machines. I believe when my work units are finished on the clients that I'll be getting the correct numbers, I'm assigning static ip addresses with my server right now, so there are no lease issues with dropping the ip and not getting a new one. Plus I need that done anyways to finish making this a real Beowulf not a Beopuppy. I'm trademarking that term by the way (just kidding!).
I checked out the hex codes in the log files and they are all unique, and now that I have my configs setup correctly I think from here on out I'll be getting the correct data back. I'm not really interested in recognition, I really just want to see what this cluster can do when it would otherwise be idle. I am currently part of a gaming club at UNCC, so chances are I'll be able to recruit a bunch of fellow gamers with some pretty sweet systems to work on this project as well. We'll see where it goes. Thanks for the help figuring this out. I'll let you know in a couple of days when my work units finish if everything is running smoothly and if not I'm sure I'll be back with questions.
I wondered why I was only seeing a single user active as the client in the team, I think it may be that when the work units are finished and sent in, that's when it gets updated in the stats, but this is my own conjecture, so I don't know for sure. What I was able to intuit was that the team number is probably much more important when it comes to tracking all the units, so I think I'm on track there. Good news is I now know more what I'm doing when it comes to setting up the next three units, so there shouldn't be many problems there. Anyways I'm rambling. I'm about to get cracking setting up the next three units with Linux. Wish me luck! I'll need it! haha.
Re: My cluster
Posted: Tue Dec 27, 2011 5:12 pm
by ATG
I believe I have it all straightened out now. I have all 7 units up and running with most of the final setup done for the cluster to operate as a Beowulf. The FAH clients are setup to start at boot for all the machines and to get to work in the background, I've verified all the configs and such and it appears to be all in order. My work unit progress seems to be ticking along in the stats correctly so far, it only shows 5 cpu's active right now, but the way I figured it, it takes a day or two for these dino machines to finish units, and I had 4 units working consistently during the setup of the other three, plus my laptop, so I think it's all working fine.
Thanks for the help. Should I run into any unexpected issues going forward from here, I'll just post my questions in this thread so as not to clutter up the board, if that's the local policy for such things.
Re: My cluster
Posted: Wed Dec 28, 2011 2:54 pm
by ATG
Okay so weirdness in the stats. I've checked all my units, all of which run headless now and I administrate them from my laptop. (Just learned how to do that, so cool!) I am still only seeing 5 active clients, and it should be more like 9 now. I have 7 machines in the cluster, one laptop and one desktop that run the stuff pretty much 24/7, (I leave the lappy on on top of the cluster so I can quickly grab it and login if I need to)
I checked out the logs and the unitinfo.txt files from the console, and the programs seem to be running just fine on the machines, however, the stats on my team page aren't showing up correctly. I'm wondering if there's something wrong on my end or if there might be something going on with the stats here. I'm not entirely certain. What can I do to figure that out?
It's a bit concerning that the stats aren't showing up even though I see my clients finishing work and autosending things in, I don't want to waste a bunch of processor time with all these machines running it would be quite a bit.
Any suggestions would be wonderful. Thanks in advance.
Re: My cluster
Posted: Wed Dec 28, 2011 3:37 pm
by bruce
Check the first page of each FAHlog. Confirm that all of the UserNames are identical and that all of the UserIDs are unique.
If that doesn't turn up anything, post the most recent Project/Run/Clone/Gen that was completed by each client.
Re: My Beowulf cluster
Posted: Wed Dec 28, 2011 5:01 pm
by ATG
Ok some kind of weirdness. I looked at all the logs and all the client.conf files and found out that for some odd reason the client.conf files on a couple of units weren't right even though I set them up through the fah6 -configonly command. I verified all the user ID's they're all different, and I also verified that my passkey is inserted correctly in each one. It may be fixed now. I don't know how long it'll take for any changes to show up in the stats. I set all the machines to small work units because I think they may have been getting messed up somehow working on normal size work units. I don't know but I figure that small units would probably be safer for these old machines, less chance of messing up if you have many units working on small bits according to my logic.
I haven't gotten a chance to look very closely at the Project/Run/Clone/Gen numbers, other than to verify they are different, unless I missed something, there shouldn't be a problem there. I'm not sure exactly how long it will take for my units to finish chewing through what they've got now and request a new smaller packet, but rest assured I'll be back here with the PRCG numbers if what I did doesn't fix it. I simply don't have the patience to go back through each unit right at the moment. O.o I haven't eaten breakfast yet so I'm off to do that.
Re: My Beowulf cluster
Posted: Wed Dec 28, 2011 5:35 pm
by bruce
Small units are classified based on the size of the download and upload so if you have a dial-up modem, they're a good idea. For the rest of us, it shouldn't matter, although they may be in short supply, at times.
Re: My Beowulf cluster
Posted: Wed Dec 28, 2011 5:43 pm
by PantherX
If you don't want to follow each WU, a kinda simple method is that note down the current PRCGs for each machine. Wait for 2 weeks and then post it here (or you can post it now and we can follow up after sometime) and we can see which ones are done and which aren't. If we don't have a record for one of the WUs, you can then manually check that system.
Re: My Beowulf cluster
Posted: Fri Dec 30, 2011 9:06 pm
by ATG
Alright, I took the time to login to each one and check out the numbers, they are as follows:
beowulf0:
Project: 7705 (Run 27, Clone 9, Gen 10)
beowulf1:
Projcet: 6871 (Run 618, Clone 2, Gen 101)
beowulf2:
Project: 7704 (Run 16, Clone 4, Gen 17)
beowulf3:
Project: 6894 (Run 53, Clone 2, Gen 3)
beowulf4
Project: 8001 (Run 21, Clone 54, Gen 13)
beowulf5
Project: 8001 (Run 33, Clone 48, Gen 14)
beowulf6
Project: 8001 (Run 106, Clone 42, Gen 13)
They all appear to be in order when it comes to them being different. I am not certain why, but my active cpu number still says 5, even though the logs show that I've had a ton of work sent in, and on top of that, my WU count and points are going up fairly quickly, so I'm kind of confused as to why I am still only seeing 5 active CPU's.
I have not the faintest idea what's going on there, all of the clients seem to be running correctly, not only that but the correct user name Aaron_Miller is only showing 4 active CPU's which would sort of make sense if this is a time issue, I had 4 of the 7 units in the cluster running for much longer. If you have any idea what's going on here, any help would be much appreciated. I don't want to waste a bunch of useful processor time.
A side note, I've nearly got my MPI setup correctly, so before too long I'll be able to run SMP/MPI type programs (I think that's the right terminology) on the cluster, and it will officially be a Beowulf cluster! Whoo!
Re: My Beowulf cluster
Posted: Fri Dec 30, 2011 11:38 pm
by bruce
For me to give you an update on those WUs, we'll need to wait until they've been completed.
beowulf0:
Project: 7705 (Run 27, Clone 9, Gen 10) Not completed by ATG yet.
beowulf1:
Projcet: 6871 (Run 618, Clone 2, Gen 101) Not completed yet.
beowulf2:
Project: 7704 (Run 16, Clone 4, Gen 17) Not completed by ATG yet.
beowulf3:
Project: 6894 (Run 53, Clone 2, Gen 3) Not completed yet.
beowulf4
Project: 8001 (Run 21, Clone 54, Gen 13) Not completed yet.
beowulf5
Project: 8001 (Run 33, Clone 48, Gen 14) Not completed by ATG yet.
beowulf6
Project: 8001 (Run 106, Clone 42, Gen 13) Not completed yet.
I'm not sure if it's you or someobody else using the name ATG but there are 4 different UserName/TeamNo in the stats:
1 atg 8131 30 0
2 aTg 1432 37 36057
3 ATG 336 4 213973
4 aTg 70 1 35067