Reading between the lines of the AWS outage – Why you should be worried Part 1

TL;DR: There are many very real problem with AWS that are being ignored.  #1 is that they appear to have no plan for dealing with their ongoing software bugs.

There has been a surprising amount of talk about the most recent AWS outage (Oct 22, 2012). In truth, I was busy coding and didn’t even hear about the outage until it was almost over (my company runs partially on AWS, but in another region). From what I read on the Amazon status site, the scope sounded pretty limited; I didn’t see it as a major event. But the talk since then says otherwise.

Amazon has now released their complete post-mortem and in reading I was struck by several hidden truths that I think many people will miss.  I was an early closed beta tester of AWS (when you basically needed a personal invite to get in) and have done over 30 top-to-bottom builds of complete production app stacks while I was with RoundHouse.  So I hope to provide some additional insight into what makes the last few AWS outages especially interesting.

What you should really worry about Part 1 – The bugs aren’t getting better

Widespread issues occurred because of (by my reading) 6 different software bugs.  That’s a lot.  This fact can be spun as both a positive and a negative.   Here’s what we would assume are the positives:

  • The bugs can (and will) be fixed.  Once they are, they won’t happen again.
  • By virtue of AWS’s advancing age, they have “shaken out” more bugs than any of their lesser competitors.  Similar bugs surely lie dormant in every other cloud provider’s code as well.  No provider is immune and every single one will experience downtime because of it.  In this regard AWS is ahead of the game.

But this line of thinking fails to address the underlying negatives:

  • Any codebase is always growing & evolving and that means new bugs.  The alarming part is that each outage seems to have some underlying bugs discovered.  The incidence of Amazon letting new bugs through seems to be disconcertingly high.  It does no good to fix the old ones if other areas of the service have secretly added twice as many new bugs.  If things were really being “fixed” we would expect new bugs to show up less often, not more often.  After a certain point, we have to start assuming that the demonstrated error rate will continue.
  • When bugs like this happen, they often can’t be seen coming.  The impact and scope can’t be anticipated, but it is typically very large.

So far, we have not yet heard of any sort of comprehensive plan intended to address this issue.  Ironically, AWS is consistently dropping the ball on their ‘root-cause’ analysis but failing the “Five Whys” test.  They’re stopping at about the 2nd or 3rd why without addressing the true root cause.

In the case of AWS, anecdotal evidence suggests they have not yet succeeded in building a development protocol that is accountable to the bugs it produces. They’re letting too many through.  Yes, these bugs are very difficult to detect and test for, but they need to have a higher standard or outages like this will continue to occur.