Page 2 of 3

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 12:21 am
by PantherX
As long as the WU didn't pass the Preferred Deadline, I would suggest that you let the Client handle the upload. If the WU has passed the Preferred Deadline, then it will be reassigned, regardless if you uploaded the wuresult or not. Thus you have two choices:
A) Let it upload and get base points
B) Delete it and you will not be assigned any points. (This might count against the 80% return rate)

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 12:35 am
by VijayPande
MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this?
Unfortunately, this isn't a "lazy butt" issue, but instead the machine is down due to problems with its RAID. We had people from the manufacturer on site today to fix it, but they didn't give an ETA.

However, I'm always looking to see what we can do to try to improve this sort of situation in the future. In this case, I have pushed more project managers to initiate upgrades to the new WS code (not a simple thing to do) which will give those access projects access to a working CS, which should solve this issue. The main issue is that these upgrades really require a whole new project and are best done on a clean server, so much of the next steps for WS upgrades have been pushed to waiting for new hardware to come on line, which itself got slowed by other issues.

I'm not happy it's taking so long (and this post isn't meant to be an excuse for this problem), but I wanted to at least give you a better sense of what's going on behind the scenes and that we have a plan to address this sort of thing longer term.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 12:43 am
by MrNovi
This issue has been happening for years and every time we hear the same old crap that there will be a change in the way the servers work so this doesn't happen in the future. Funny how after several years it still hasn't been implemented and we still go through this crap time and time again.

If the manufacturer is having that much of a problem maybe it's time to look at a different one as this one seems to be fairly useless to me.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:10 am
by 7im
Where did Dr. Pande indicate the Mfr was having a big problem fixing the RAID? He did not. You wrongly assume facts not in evidence. All he said was they were there, and did not give an ETA.

The lack of an ETA is not uncommon. Even if the fix only took 2 minutes, it would still take most of the day to rebuild a multi-terabyte array.

Collection Servers worked in v4 software (server software, not client) v5 fixed several other things, but broke the CS option. Now they are in the processing of upgraded to v6 server software, where CS's work again.

It has NOT been the same old story all those years. You just didn't follow along with all the changes. :roll:

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:18 am
by VijayPande
PS The issue of a CS system that can scale is not an easy one (the main issue there is that there has to be lots of checks to make sure people can't just cheat and return anything to the CS – towards that end, we replicate data CS-side for that purpose and the scaling of that replication is the challenging part). Part of our motivation for completely redoing our backend infrastructure was to get this finally resolved.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:19 am
by bruce
MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this? I have finished WU's backup up that won't upload. I don't want to waste my time folding WU's that won't be uploaded.
I doubt that talking like that will get you in good graces with the people who have to do the work.

I don't have any information other than has already been shared here, but I did see the word "RAID" in a post, and I do know these server have large RAID arrays. Even if parts are on-hand or are not needed (and we don't know if that's true), it can easily take as much as 24 hours to re-sync the filesystem. Let's hope it doesn't need to do that.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:30 am
by MrNovi
I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:42 am
by k1wi
MrNovi wrote:I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.
Really? Since when has Stanford promised that you'll never have any issues with sending or receiving work units?

The fact of the matter is, they are committed to continually evolve the server environment, but in doing with a target that is moving. They're continually improving their systems, but that doesn't mean they'll never have an issue, such as a RAID array falling over or a design that create an unforeseen bug, perhaps purely due to scaling to unforeseen size...

I would like to say thanks to Vijay for coming on here and keeping us up to date with the behind the scenes work that is going on. I am in awe of, and appreciate, the level of openness shown in discussing netwrok/server issues.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:55 am
by World Control
Well, I'm giving my 24/7 boxes a rest. I've powered them off until this resolves (they could use a break). No sense in wasting electricity. I'm checking this thread for progress reports with my day-to-day computer. I hope my WUs upload when I power my 24/7's back up, but if they don't, I'll chalk it up to common misfortune and start again.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 2:50 am
by k1wi
I might also add to my previous post that it's not just us folders who want to get these results back quickly and grab the next lot, it's also in Stanford's. So I can imagine they're doing everything they can to get things running again. It sure sounds like it.

Vijay, I appreciate the scarcity of your time, but are these new servers coming on line part of the last big upgrade of servers that was talked about (cannot find when - possibly way back in September 09?) or is this part of an ongoing incremental upgrade cycle?

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 9:06 am
by shunter
I have 2 units waiting to upload from 10 am GMT yesterday. Bothe pcs have picked up and working on other units - one of which has been completed ans uploaded. I havce 3 others working which fortunately seem not to be using this erver so all is not lost. It's just a pain in the posterior when you've done the work and don't get credited with max points; especially with the price of electricity in the UK.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 10:49 am
by Geoffric
Keep the Faith, after all its only been roughly 24 hours.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 11:00 am
by AlphaWolf
Unfortunately 24 hours on WU's with a preferred deadline of ~65 hours is a considerable chunk of time. We understand that these things happen and that it's being dealt with as quickly as possible... it's just a little extra frustrating due to the bonus system. I'm sure it's even more frustrating for Stanford/Prof. Pande, so we should probably all keep that in mind.

Side effect of distributed computing -- distributed and highly parallel frustration when things go wrong ;)

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 1:52 pm
by Joshua_Mahr
If we wanted to host servers for F@H is that possible do you need more servers? Just a question cause I have a few servers not doing anything.

Re: 171.64.65.54 - Reject [Emergency downtime]

Posted: Wed May 11, 2011 2:16 pm
by kasson
We appreciate donors' frustration with this server downtime (and trust me, it's mutual--we very much want this back up). Service personnel were onsite, and work on the server is continuing.
As Dr. Pande mentions, we are doing several higher-level changes to try to improve redundancy for work unit returns. That said, it's also good to consider the overall uptime of the FAH servers. If we have ~6 SMP servers and one of them goes offline for a day every ~3 months, that's 0.02% downtime. Obviously, we'd like to do better than that still.

I do have some good news--the RAID is back up at the moment. I don't have a final status report from the hardware technicians, so it's possible that it may need to go back down later. But I've brought the FAH server up to try to clear the backlog of outstanding WU's.