171.64.65.54 - Reject [Emergency downtime]
Moderators: Site Moderators, FAHC Science Team
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: 171.64.65.54 - Reject [Emergency downtime]
As long as the WU didn't pass the Preferred Deadline, I would suggest that you let the Client handle the upload. If the WU has passed the Preferred Deadline, then it will be reassigned, regardless if you uploaded the wuresult or not. Thus you have two choices:
A) Let it upload and get base points
B) Delete it and you will not be assigned any points. (This might count against the 80% return rate)
A) Let it upload and get base points
B) Delete it and you will not be assigned any points. (This might count against the 80% return rate)
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: 171.64.65.54 - Reject [Emergency downtime]
Unfortunately, this isn't a "lazy butt" issue, but instead the machine is down due to problems with its RAID. We had people from the manufacturer on site today to fix it, but they didn't give an ETA.MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this?
However, I'm always looking to see what we can do to try to improve this sort of situation in the future. In this case, I have pushed more project managers to initiate upgrades to the new WS code (not a simple thing to do) which will give those access projects access to a working CS, which should solve this issue. The main issue is that these upgrades really require a whole new project and are best done on a clean server, so much of the next steps for WS upgrades have been pushed to waiting for new hardware to come on line, which itself got slowed by other issues.
I'm not happy it's taking so long (and this post isn't meant to be an excuse for this problem), but I wanted to at least give you a better sense of what's going on behind the scenes and that we have a plan to address this sort of thing longer term.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Re: 171.64.65.54 - Reject [Emergency downtime]
This issue has been happening for years and every time we hear the same old crap that there will be a change in the way the servers work so this doesn't happen in the future. Funny how after several years it still hasn't been implemented and we still go through this crap time and time again.
If the manufacturer is having that much of a problem maybe it's time to look at a different one as this one seems to be fairly useless to me.
If the manufacturer is having that much of a problem maybe it's time to look at a different one as this one seems to be fairly useless to me.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: 171.64.65.54 - Reject [Emergency downtime]
Where did Dr. Pande indicate the Mfr was having a big problem fixing the RAID? He did not. You wrongly assume facts not in evidence. All he said was they were there, and did not give an ETA.
The lack of an ETA is not uncommon. Even if the fix only took 2 minutes, it would still take most of the day to rebuild a multi-terabyte array.
Collection Servers worked in v4 software (server software, not client) v5 fixed several other things, but broke the CS option. Now they are in the processing of upgraded to v6 server software, where CS's work again.
It has NOT been the same old story all those years. You just didn't follow along with all the changes.
The lack of an ETA is not uncommon. Even if the fix only took 2 minutes, it would still take most of the day to rebuild a multi-terabyte array.
Collection Servers worked in v4 software (server software, not client) v5 fixed several other things, but broke the CS option. Now they are in the processing of upgraded to v6 server software, where CS's work again.
It has NOT been the same old story all those years. You just didn't follow along with all the changes.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: 171.64.65.54 - Reject [Emergency downtime]
PS The issue of a CS system that can scale is not an easy one (the main issue there is that there has to be lots of checks to make sure people can't just cheat and return anything to the CS – towards that end, we replicate data CS-side for that purpose and the scaling of that replication is the challenging part). Part of our motivation for completely redoing our backend infrastructure was to get this finally resolved.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Re: 171.64.65.54 - Reject [Emergency downtime]
I doubt that talking like that will get you in good graces with the people who have to do the work.MrNovi wrote:Any update about when someone is going to get off of their lazy butt and fix this? I have finished WU's backup up that won't upload. I don't want to waste my time folding WU's that won't be uploaded.
I don't have any information other than has already been shared here, but I did see the word "RAID" in a post, and I do know these server have large RAID arrays. Even if parts are on-hand or are not needed (and we don't know if that's true), it can easily take as much as 24 hours to re-sync the filesystem. Let's hope it doesn't need to do that.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: 171.64.65.54 - Reject [Emergency downtime]
I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.
Re: 171.64.65.54 - Reject [Emergency downtime]
Really? Since when has Stanford promised that you'll never have any issues with sending or receiving work units?MrNovi wrote:I really don't give a darn if it get's me in anyone's good graces. I just want to see RESULTS instead of broken promises for once.
The fact of the matter is, they are committed to continually evolve the server environment, but in doing with a target that is moving. They're continually improving their systems, but that doesn't mean they'll never have an issue, such as a RAID array falling over or a design that create an unforeseen bug, perhaps purely due to scaling to unforeseen size...
I would like to say thanks to Vijay for coming on here and keeping us up to date with the behind the scenes work that is going on. I am in awe of, and appreciate, the level of openness shown in discussing netwrok/server issues.
Re: 171.64.65.54 - Reject [Emergency downtime]
Well, I'm giving my 24/7 boxes a rest. I've powered them off until this resolves (they could use a break). No sense in wasting electricity. I'm checking this thread for progress reports with my day-to-day computer. I hope my WUs upload when I power my 24/7's back up, but if they don't, I'll chalk it up to common misfortune and start again.
Re: 171.64.65.54 - Reject [Emergency downtime]
I might also add to my previous post that it's not just us folders who want to get these results back quickly and grab the next lot, it's also in Stanford's. So I can imagine they're doing everything they can to get things running again. It sure sounds like it.
Vijay, I appreciate the scarcity of your time, but are these new servers coming on line part of the last big upgrade of servers that was talked about (cannot find when - possibly way back in September 09?) or is this part of an ongoing incremental upgrade cycle?
Vijay, I appreciate the scarcity of your time, but are these new servers coming on line part of the last big upgrade of servers that was talked about (cannot find when - possibly way back in September 09?) or is this part of an ongoing incremental upgrade cycle?
Re: 171.64.65.54 - Reject [Emergency downtime]
I have 2 units waiting to upload from 10 am GMT yesterday. Bothe pcs have picked up and working on other units - one of which has been completed ans uploaded. I havce 3 others working which fortunately seem not to be using this erver so all is not lost. It's just a pain in the posterior when you've done the work and don't get credited with max points; especially with the price of electricity in the UK.
Re: 171.64.65.54 - Reject [Emergency downtime]
Keep the Faith, after all its only been roughly 24 hours.
Re: 171.64.65.54 - Reject [Emergency downtime]
Unfortunately 24 hours on WU's with a preferred deadline of ~65 hours is a considerable chunk of time. We understand that these things happen and that it's being dealt with as quickly as possible... it's just a little extra frustrating due to the bonus system. I'm sure it's even more frustrating for Stanford/Prof. Pande, so we should probably all keep that in mind.
Side effect of distributed computing -- distributed and highly parallel frustration when things go wrong
Side effect of distributed computing -- distributed and highly parallel frustration when things go wrong
-
- Posts: 11
- Joined: Sun Mar 06, 2011 11:19 pm
Re: 171.64.65.54 - Reject [Emergency downtime]
If we wanted to host servers for F@H is that possible do you need more servers? Just a question cause I have a few servers not doing anything.
Re: 171.64.65.54 - Reject [Emergency downtime]
We appreciate donors' frustration with this server downtime (and trust me, it's mutual--we very much want this back up). Service personnel were onsite, and work on the server is continuing.
As Dr. Pande mentions, we are doing several higher-level changes to try to improve redundancy for work unit returns. That said, it's also good to consider the overall uptime of the FAH servers. If we have ~6 SMP servers and one of them goes offline for a day every ~3 months, that's 0.02% downtime. Obviously, we'd like to do better than that still.
I do have some good news--the RAID is back up at the moment. I don't have a final status report from the hardware technicians, so it's possible that it may need to go back down later. But I've brought the FAH server up to try to clear the backlog of outstanding WU's.
As Dr. Pande mentions, we are doing several higher-level changes to try to improve redundancy for work unit returns. That said, it's also good to consider the overall uptime of the FAH servers. If we have ~6 SMP servers and one of them goes offline for a day every ~3 months, that's 0.02% downtime. Obviously, we'd like to do better than that still.
I do have some good news--the RAID is back up at the moment. I don't have a final status report from the hardware technicians, so it's possible that it may need to go back down later. But I've brought the FAH server up to try to clear the backlog of outstanding WU's.