Blame Murphy's Law, Excessive Hubris For Blackouts In SF
from the or-blame-the-PR-people,-your-choice dept
Lots of San Francisco-based web sites (Craigslist, Six Apart, Yelp, and Technorati, among others) have been experiencing some problems today, after power outages in the city took down 365 Main, a major hosting facility there. While the power company says it doesn't yet know what's caused the outages, we have a pretty good idea: a 365 Main press release that went out this morning, bragging about the two years of continuous uptime one of its customers has had since moving to the data center. And when these guys invoke Murphy's Law, they don't do it by half, either. The release also brags about the center's "unique billing system in which 365 Main only charges customers for the exact amount of power that is used" -- so presumably today will be free. But what really sealed their fate was this paragraph:
"To ensure uptime for key tenants such as RedEnvelope, 365 Main provides modern power and cooling infrastructure. The company's San Francisco facility includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators, avoiding costly downtime for tenants and keeping the data center continuously running."Good to see those backup systems are working!


Reader Comments
(Flattened / Threaded)
Drunk and disorderly... ?
It hit digg a little while ago:
http://valleywag.com/tech/breakdowns/a-drunk-employee-kills-all-of-the-websites-you-care-abo ut-282021.php
In that article it says that a "shitfaced drunk" employee did the damage to 40+ racks of equipment.
I dunno, but I think I'll take that article with a grain of salt for the time being.
(reply to this comment) (link to this comment)
Rats
I once worked for Motorola's Iridium project, which had spent millions of dollars on redundant power backup systems at their controlling facility. However, it was still not immune to the stray rat who bit into a power cable -- that would have taken the facility down for days. The facility was near the Potomac River, so there were lots of rats.
Russia's Baikonur Kosmodrome also had a serious problem with rats chewing cables at the time, and they dealt with the problem by keeping cats.
(reply to this comment) (link to this comment)
Murphy's Law - a correction
Murphy's Law is commonly stated: "Whatever can go wrong will go wrong." That is incorrect. It should be: "Whatever can go wrong may go wrong." Of course, if you persist in walking through mine fields...
Quoted from The Signature of God by John Dalmas
(reply to this comment) (link to this comment)
Could be related. =)
"> 30K Without Power In SF
(reply to this comment) (link to this comment)
Bragging like that is almost as stupid as...
Bragging like that is almost as stupid as saying that a ship is "unsinkable". I can think of one such case where such a statement blew up in the bragger's face and royally pwned them.
Moral: God doesn't like braggers.
BMR777
(reply to this comment) (link to this comment)
"Excessive Hubris"... seems a bit redundant I'd say.
(reply to this comment) (link to this comment)
Re: Excessive Hubris
(reply to this comment) (link to this comment)
hrmm thats why I can't get on craigslist. Still seeems to be running slow and timing out now.
(reply to this comment) (link to this comment)
Craigslist is running slow and timing out because now the server is Craig's mac that was in his closet.
(reply to this comment) (link to this comment)
Re:
Thank good it's a Mac, if it was IIS it wold just be timing out.
(reply to this comment) (link to this comment)
Duh...
So the servers were running on backup power. Big whoop. All of the switching and other important nodes were down all around the servers. No data IO at all. Doesn't really help to have backups unless the entire network has backups.
(reply to this comment) (link to this comment)
That wasn't Murphy's Law...
It was The Un-Speakable Law !!!
(reply to this comment) (link to this comment)
Imagine all the damage done to the environment due to the do-gooders in San Fran Sicko using their A/C.
Excessive BS is more like it
(reply to this comment) (link to this comment)
well duh!
(emphasis mine)
The power goes out and they trust on electrical systems to switch to back up generators?
No wonder it went wrong. :-)
(reply to this comment) (link to this comment)
Re: well duh!
I'm pretty sure that something powered by the backup would have been monitoring, not something from live
(reply to this comment) (link to this comment)
Re: well duh!
You power them off UPS and use this to manage the switchover (humans just aren't fast or reliable enough)
Something clearly went wrong however. I had a client once who had a similar failure, the batteries switched over to the generator which started just fine and all was OK, but a few minutes later the generator stopped
Cause: whichever plank had installed the generator had wired up its electric diesel pump to the mains only, instead of to the generators own electrical output.... it drank itself dry ;0)
(reply to this comment) (link to this comment)
Murphy was an optimist
Whatever can go wrong, won't. Whatever can't go wrong, will.
(reply to this comment) (link to this comment)
Askign for it.
Dude, this was like the ultimate murphy call. Couldn't their marketing guys have just lubed the company up, presented it's backside, and shouted out "**** me murphy! **** me long and hard!"
(reply to this comment) (link to this comment)
No mirroring?
If these major web based companies are not mirroring and utilizing simple disaster recovery by having dual facilities in major metro areas, they should be considering it now. They should be ashamed and now they realize the cost of this ignorance is a mere pittance in comparison to the embarrassment.
(reply to this comment) (link to this comment)
well, the backup can work all it wants to... the rest of the surrounding infrastructure has to work too, and if the rest of the city block is dead, they're still dead and aren't going anywhere.
(reply to this comment) (link to this comment)
backups?
It happens to the best of us. Why didn't 365 Main plan for this kind of problem? Google, Microsoft, Yahoo, all have enough redundancy in their systems to prevent this kind of thing happening. What if their drives got toasted? What would have happened to their data? They'd be spending alot of money for data recovery, that's what. J
(reply to this comment) (link to this comment)
uhh.. they probably run huge diesel gensets that are capable of putting out Megawatts of power. There is no need for offsite mirrors. The diesels can be at peak load in under 1 minute. UPS can hold you over for the time in between. Also, most facilities have automagical transfer switches. Proper maintenance of the gensets (that means running them under loadbanks, and having qualified people come in to do the oil and other fluids) will almost guarantee that your facility will be online in minutes rather than hours.
(reply to this comment) (link to this comment)
you think you're so smart...
My DC has redundant feeds from the street, redundant pipes, racks and racks of backup batteries. Didn't do a bit of good when a huge spike came in off the street and *vaporized* the emergency switching gear. I'm not kidding.
No batteries, no redundant grid, no amount of testing will guarantee a no-impact failover. I've seen multiple outages at multiple sites over 20 years. The answer is IT DEPENDS. With hugs power feeds it gets complicated fast.
It is incredibly ironic that the PR folks put out a release like that the same day... but really, most hosting sites say the same kinds of things. You can take cheap shots if you want, doesn't mean you know anything. I often enjoy the hubris of the media!
(reply to this comment) (link to this comment)
Add Your Comment