JC / Railbird

Failure Happens

Nothing to do with racing, but I’m fascinated by the story of last Tuesday’s massive power outage at a San Francisco data center that wiped out Typepad, Technorati, Yelp, Craigslist, and a bunch of other Web 2.0 sites for several hours. Turns out, the impossible happened:

In this incident, latent defects caused three generators to fail during start-up. No customers were affected until a fourth generator failed 30 seconds later, which overloaded the surviving backup system and caused power failures to 3 of 8 customer areas.
What’s most interesting is that the redundant design of the system is what caused it to fail so completely. The failure of the fourth generator should have only brought down one area instead of three. This kind of cascade failure is common in complex & tightly coupled systems. In my experience, these sorts of failure-modes are often identified and then promptly dismissed as being “nearly impossible.” Unfortunately, the impossible often becomes reality. (O’Reilly Radar)

Actually, I guess the story is racing-related: Horseplayers well understand the impossible becoming reality, upsetting best-laid wagers, and anyone who plays multi-race exotics knows all about the necessity and danger of redundant design …