I really don't ever want to hear those words. Earlier this year, my then favorite SaaS provider replaced their normal login page with the following message to their website's anxious visitors (including me and several of my consulting clients):
What a nightmare! I'm not exactly sure for how long that outage lasted, but while it might only have been for a few hours, it seemed like days and weeks to me! I had projects to manage and clients to deal with. From b2b2dot0's perspective, we have orders that we have to take, and that means real revenue for our clients. It's against that backdrop that we've been building our operations infrastructure. So call me paranoid.
We've architected our production infrastructure to be way more fault tolerant and recoverable than our 99.5% (~4 hours per month downtime) SLA requires.
We don't have any single points of failure, unless of course you consider the fact that we don't have a backup data center (yet). But we do have redundant firewalls, network switches, physical servers, webservers, application servers and databases. I drew the line at buying a redundant power feed to our rack. If our hosting provider, Hosted Solutions, has a problem bringing power to my rack, I think I'll be getting a phone call anyway.
But just because we have a robust infrastructure in place, that really doesn't mean that it will perform as intended in time of need. That's why we subject ourselves to periodic fire drills. While everyone was in a Turkey day coma this past weekend, we staged our first full blown disaster recovery exercise.
So with Excel spreadsheet in hand, we organized our test plan and proceeded to pull plugs and cables and hard drives to simulate a few real world disasters. Lest the world thinks that Agilists don't like documentation, here is a snapshot of what we used to guide our efforts:
The two test cases in this view of our test plan are for recovering from a network switch failure (the 37 Signals failure mode) and a physical server failure. The good news is that we passed both of those tests 🙂 !
How did we do overall? Of the total 19 planned scenarios, we passed 8, failed 3 and deferred 8 until next month's test when we install some new hardware.
The failed tests uncovered some issues that we're dealing with as we speak. While our production systems are (and have been for the past 3 months) operating normally, these failed tests uncovered some potential risks that we're going to manage better. After all, isn't that why you go through these exercises? To verify that things are working as intended and to uncover areas for improvement?
b2b2dot0 provides a mission critical application to our clients and we take that responsibility seriously!