171.64.65.54 - Reject [Emergency downtime]

Moderators: Site Moderators, FAHC Science Team

PantherX
Site Moderator
Posts: 6986
Joined: Wed Dec 23, 2009 9:33 am
Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB

Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400
Location: Land Of The Long White Cloud
Contact:

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by PantherX »

As long as the WU didn't pass the Preferred Deadline, I would suggest that you let the Client handle the upload. If the WU has passed the Preferred Deadline, then it will be reassigned, regardless if you uploaded the wuresult or not. Thus you have two choices:
A) Let it upload and get base points
B) Delete it and you will not be assigned any points. (This might count against the 80% return rate)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time

Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by VijayPande »

MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this?
Unfortunately, this isn't a "lazy butt" issue, but instead the machine is down due to problems with its RAID. We had people from the manufacturer on site today to fix it, but they didn't give an ETA.

However, I'm always looking to see what we can do to try to improve this sort of situation in the future. In this case, I have pushed more project managers to initiate upgrades to the new WS code (not a simple thing to do) which will give those access projects access to a working CS, which should solve this issue. The main issue is that these upgrades really require a whole new project and are best done on a clean server, so much of the next steps for WS upgrades have been pushed to waiting for new hardware to come on line, which itself got slowed by other issues.

I'm not happy it's taking so long (and this post isn't meant to be an excuse for this problem), but I wanted to at least give you a better sense of what's going on behind the scenes and that we have a plan to address this sort of thing longer term.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
MrNovi
Posts: 18
Joined: Wed Mar 19, 2008 8:37 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by MrNovi »

This issue has been happening for years and every time we hear the same old crap that there will be a change in the way the servers work so this doesn't happen in the future. Funny how after several years it still hasn't been implemented and we still go through this crap time and time again.

If the manufacturer is having that much of a problem maybe it's time to look at a different one as this one seems to be fairly useless to me.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by 7im »

Where did Dr. Pande indicate the Mfr was having a big problem fixing the RAID? He did not. You wrongly assume facts not in evidence. All he said was they were there, and did not give an ETA.

The lack of an ETA is not uncommon. Even if the fix only took 2 minutes, it would still take most of the day to rebuild a multi-terabyte array.

Collection Servers worked in v4 software (server software, not client) v5 fixed several other things, but broke the CS option. Now they are in the processing of upgraded to v6 server software, where CS's work again.

It has NOT been the same old story all those years. You just didn't follow along with all the changes. :roll:
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by VijayPande »

PS The issue of a CS system that can scale is not an easy one (the main issue there is that there has to be lots of checks to make sure people can't just cheat and return anything to the CS – towards that end, we replicate data CS-side for that purpose and the scaling of that replication is the challenging part). Part of our motivation for completely redoing our backend infrastructure was to get this finally resolved.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by bruce »

MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this? I have finished WU's backup up that won't upload. I don't want to waste my time folding WU's that won't be uploaded.
I doubt that talking like that will get you in good graces with the people who have to do the work.

I don't have any information other than has already been shared here, but I did see the word "RAID" in a post, and I do know these server have large RAID arrays. Even if parts are on-hand or are not needed (and we don't know if that's true), it can easily take as much as 24 hours to re-sync the filesystem. Let's hope it doesn't need to do that.
MrNovi
Posts: 18
Joined: Wed Mar 19, 2008 8:37 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by MrNovi »

I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.
k1wi
Posts: 909
Joined: Tue Sep 22, 2009 10:48 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by k1wi »

MrNovi wrote:I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.
Really? Since when has Stanford promised that you'll never have any issues with sending or receiving work units?

The fact of the matter is, they are committed to continually evolve the server environment, but in doing with a target that is moving. They're continually improving their systems, but that doesn't mean they'll never have an issue, such as a RAID array falling over or a design that create an unforeseen bug, perhaps purely due to scaling to unforeseen size...

I would like to say thanks to Vijay for coming on here and keeping us up to date with the behind the scenes work that is going on. I am in awe of, and appreciate, the level of openness shown in discussing netwrok/server issues.
Image
World Control

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by World Control »

Well, I'm giving my 24/7 boxes a rest. I've powered them off until this resolves (they could use a break). No sense in wasting electricity. I'm checking this thread for progress reports with my day-to-day computer. I hope my WUs upload when I power my 24/7's back up, but if they don't, I'll chalk it up to common misfortune and start again.
k1wi
Posts: 909
Joined: Tue Sep 22, 2009 10:48 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by k1wi »

I might also add to my previous post that it's not just us folders who want to get these results back quickly and grab the next lot, it's also in Stanford's. So I can imagine they're doing everything they can to get things running again. It sure sounds like it.

Vijay, I appreciate the scarcity of your time, but are these new servers coming on line part of the last big upgrade of servers that was talked about (cannot find when - possibly way back in September 09?) or is this part of an ongoing incremental upgrade cycle?
Image
shunter
Posts: 84
Joined: Sun Apr 06, 2008 8:22 am
Location: Hertfordshire, United Kingdom

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by shunter »

I have 2 units waiting to upload from 10 am GMT yesterday. Bothe pcs have picked up and working on other units - one of which has been completed ans uploaded. I havce 3 others working which fortunately seem not to be using this erver so all is not lost. It's just a pain in the posterior when you've done the work and don't get credited with max points; especially with the price of electricity in the UK.
Image
Geoffric
Posts: 25
Joined: Sat Feb 12, 2011 7:30 am

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by Geoffric »

Keep the Faith, after all its only been roughly 24 hours.
AlphaWolf
Posts: 13
Joined: Tue May 10, 2011 2:53 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by AlphaWolf »

Unfortunately 24 hours on WU's with a preferred deadline of ~65 hours is a considerable chunk of time. We understand that these things happen and that it's being dealt with as quickly as possible... it's just a little extra frustrating due to the bonus system. I'm sure it's even more frustrating for Stanford/Prof. Pande, so we should probably all keep that in mind.

Side effect of distributed computing -- distributed and highly parallel frustration when things go wrong ;)
Joshua_Mahr
Posts: 11
Joined: Sun Mar 06, 2011 11:19 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by Joshua_Mahr »

If we wanted to host servers for F@H is that possible do you need more servers? Just a question cause I have a few servers not doing anything.
Main Rig : i7-2600k @5.0Ghz/Maximus 4 Extreme/GSkill DDR3-1600/ASUS GTX580 CUII/Vertex 2/HX850/NH-D14/Dragon Rider
Folder Rig : i7-2600k @4.75/ASUS P8P67 Pro/GSkill DDR3-2166/ASUS GTX570 x2 SLi/Vertex 2/HX850/H70/Haf X

Image

Image
kasson
Pande Group Member
Posts: 1459
Joined: Thu Nov 29, 2007 9:37 pm

Re: 171.64.65.54 - Reject [Emergency downtime]

Post by kasson »

We appreciate donors' frustration with this server downtime (and trust me, it's mutual--we very much want this back up). Service personnel were onsite, and work on the server is continuing.
As Dr. Pande mentions, we are doing several higher-level changes to try to improve redundancy for work unit returns. That said, it's also good to consider the overall uptime of the FAH servers. If we have ~6 SMP servers and one of them goes offline for a day every ~3 months, that's 0.02% downtime. Obviously, we'd like to do better than that still.

I do have some good news--the RAID is back up at the moment. I don't have a final status report from the hardware technicians, so it's possible that it may need to go back down later. But I've brought the FAH server up to try to clear the backlog of outstanding WU's.
Post Reply