plfah1-*.mskcc.org temporarily going down for maintenance

Moderators: Site Moderators, FAHC Science Team

Post Reply
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

plfah1-*.mskcc.org temporarily going down for maintenance

Post by JohnChodera »

`plfah1-*.mskcc.org` is having some temporary RAID issues, so the work servers are being suspended for a few hours.

~ The Chodera lab
JimboPalmer
Posts: 2521
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JimboPalmer »

Thank you for notifying us!

(It is always comforting to know it is not something we did)
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

Update: The RAID rebuild failed and the controller may be faulty, but we're trying to research the cabling first in case that fixes the issue. If not, we'll replace the controller and start to rebuild the RAID, bringing the server back online once the rebuild is complete.

We've heard some sporadic reports that the WUs did not have unaffected servers listed as collection servers (CSs), so we're coordinating with some other FAH Consortium labs to add more offsite collection servers so that disruption will be minimal in case this ever happens again in the future.

More updates soon. Again, our apologies for the downtime here.

The affected server IP range is 140.163.4.231-140.163.4.235
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

UPDATE: Reseating the cables did not resolve the issue, so Dell is dispatching a technician and parts within 4 hours today to replace the RAID controller, drawer, and SAS chain cable.

More updates on ETA for restoration once the RAID has started rebuilding.

~ The Chodera Lab
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

The hardware vendor apparently doesn't consider the chassis drawer to be subject to our 4-hour onsite warranty, so is having a replacement drawer shipped. Unfortunately, this means the earliest we project being online following RAID rebuild is Thu 25 Jan.

Apologies again for the disruption, and I'll update if there is any new information in the meantime.

~ The Chodera lab
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

Update: Dell is dispatching a tech with the replacement part this morning! Hopefully we will be back online sooner than planned!

~ The Chodera Lab
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

UPDATE: Our awesome Open Systems Group and datacenter team now have the hardware replaced and the RAID is rebuilding, with an ETA for completion of 60+ hours.

~ The Chodera lab
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

UPDATE: Estimates suggest approximately 40 hours remain for RAID rebuild.
JimboPalmer
Posts: 2521
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JimboPalmer »

Dr Chodera,

The donors get error messages with IP addresses, while you have reported the downtime of a server by DNS name. Would it be possible to give us the IP address so we could use your estimated time to rebuild to address donor issues?

Yours

Jimbo
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Joe_H
Site Admin
Posts: 8224
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Studio M1 Max 32 GB smp6
Mac Hack i7-7700K 48 GB smp4
Location: W. MA

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by Joe_H »

These WS's are not currently taking any connections, so I doubt there will be any reports for a bit. But if you look at the Server Status page, the ones down are IP numbers 140.163.4.231-235 and 140.163.4.241-245.
Image
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

WSs 140.163.4.231-233 are back online! Thanks for your patience.
JohnChodera
Pande Group Member
Posts: 467
Joined: Fri Feb 22, 2013 9:59 pm

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by JohnChodera »

I'll be sure to report IP addresses in the subject line next time. Sorry for the hassle!
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: plfah1-*.mskcc.org temporarily going down for maintenanc

Post by ChristianVirtual »

Thanks for the efforts and greetings to the IT support team
ImageImage
Please contribute your logs to http://ppd.fahmm.net
Post Reply