Last month I introduced our first blog post on holding us publicly accountable for our Availability Service Level Agreement (SLA). Today I’m going to discuss our May 2010 performance.
To begin with, you can see that May was our worst performing month since the beginning of the year. While we still safely delivered above our target 99.500% availability, another 37 minutes of outage and we wouldn’t have. So what happened? A blue moon struck twice.
We use two different SAS 70 Level II hosting providers, Hosted Solutions and Peak10, for our data center needs. The former hosts all of our core applications and the later our client’s corporate websites, product catalogs and content. In the month of May, both of them lost the use of their backup Power Distribution Units which ultimately interrupted our service! Those unplanned outages alone cost us close to 80 minutes of downtime. Add our planned maintenance outages to the calculation and you come up to 179 minutes of website unavailability.
The good news is that in all cases, the unavailability came on Sunday mornings. While we never register ZERO traffic to our service, there is no doubt that Sunday mornings aren’t the most popular times to place and track orders.
To be sure we had plenty of conversation during and after these events. Again, while we’re still within our contractual obligations, we’re not happy about how close we came to crossing the line. Here is what we’ve learned in May and what we’re going to do about it:
- Simplify our reporting. – There are only two numbers on this month’s chart. Our target SLA Availability (red horizontal line) and our actual availability (blue bars). I figure the simpler the numbers the easier it will be to stay focused on the right things.
- No backup for the backup. While we could put in place backup power supplies to backup our data center’s backup power supplies, that seemed a little crazy. This is the first power related event we’ve had in the past 2.5 years and is very rare. For now, this is a risk we’re willing to take and no investments will be made. However, we will be monitoring our hosting providers closely :-).
- We’ve improved our communication plans (and unfortunately got to execute them in May) for the >15 minute outages. We will be communicating with all of our clients every 15 minutes whether we know something new or not. In the case of a prolonged outage, we feel as if there is no such thing as over-communicating.
- We’re going to evolve towards zero downtime maintenance windows. Most of our releases can be accomplished without ever taking down our client’s production servers. However, there are a few cases in which that’s impossible…currently. We’re going to develop the systems and procedures that allow us to perform all of our application maintenance without a disruption of service. That’s going to take awhile to accomplish, but the goal is now in place.
- Decrease recovery times. While we can’t prevent “stuff” from happening, our job is to mitigate the resultant negative impact on our clients and their customers. To that end, we’re going to be revisiting our disaster recovery plans and making sure that we’re comfortable with their design and our ability to execute them. Maybe I’ll even highlight one focus area that we’ve “re-visited” every month to illuminate our efforts.
That’s it for May. Our fifth consecutive month (in 2010) with better than promised availability :-).