Yes, Stanford does have problems. They are underfunded, understaffed, and overworked students and faculty. They are not technology professionals (well, most aren't). As Bruce indicated, the project usually has many servers to provide work and collect work so that individual server uptime is not a critical issue. And even when that doesn't work, the fah client is designed to cache the completed work unit, and download new work from a different server, and keep processing until a solution is found. Even then, a fah data packet is not the same priority as a banking transaction data packet. In real life, the bank is the priority. (to me, the fah data packet is more important, because I'm guaranteed to die someday, but never guaranteed to have a lot of money
)
Brainstorm...
1. Huge donation of IBM servers, IBM service, or cash, or all three.
2. Patience while stanford continues the process of installing and converting over to, and optimizing the new servers that $100,000 purchased for them last year.
3. Patience while the client and server code is rewritten from ground up to be more up to date, reliable, and easier to maintain. (2nd half of #1 could help speed this part along) V7 client may help, but new server code is much bigger.
4. Patience while Pande Group gets it's new crop of researchers up to speed, and expands the research and the location of FAH servers to several other universities. (co-location can be a great uptime helper, as long as they are well managed and well integrated)
5. And if no patience is available, then tolerance is an acceptable alternative.
Any additions to what's already taking place?
P.S. The head of the project addressed a similar issue like this... and I hope he doesn't mind if I repost it...
VijayPande wrote:One has to put this all in perspective. Supercomputer centers have 10x to 100x the budget we have for operations and still are often down over the weekends for much longer than FAH is when something unexpected comes up. We have some very dedicated people in our team -- people willing to do fixes on weekends and holidays -- but they do have to sleep. Also, running a FAH server is not like running apache (it is a lot more complex and people aren't familiar with it), so hiring a 3rd party firm to manage off hours wouldn't work (or would be very, very expensive).
So, if you see a problem that isn't being fixed and it's in between 10:30pm and 7:30am pacific time, odds are it will have to wait until 8:30am pacific time or so for someone to deal with it. We've built a lot of redundancy into FAH operations, but there are limits to this too, especially in very early beta projects like GPU3. Hopefully with this in mind, people can have a better sense of when fixes can be made, and how hard we work to fix them as quickly as possible.