Reading between the lines of the AWS outage – Why you should be worried Part 2

TL;DR: The real problem is there isn’t enough capacity.  The us-east1 region is too big and during an outage there simply aren’t enough resources to allow users to recover their sites.

In part 1 I discussed how the bug-rate on AWS doesn’t seem to be getting better.  A few bugs aren’t necessarily a big deal.  For one, they’re expected given the trailblazing that AWS is doing and the incredibly hard & complex problems they’re solving.  This is a recipe for bad things to happen sometimes.  To account for this the true selling point in AWS has always been “if something goes wrong, just get more instances from somewhere else and keep on running”.

In the early days (pre-EBS), good AWS architecture dictated that you had to be prepared for any instance to disappear at any time for any reason.  Had important data on it?  Better have four instances with copies of that data in different AZs & regions along with offsite backups, because your data could disappear at any time.  Done properly, you simply started a new instance to replace it, copied in your data, and went about your merry way.

What has unfortunately happened is that nearly all customers are centralized onto us-east-1.  This has many consequences to the architecture model described above.

Traffic Load

A very common thread in all of the us-east-1 outages over the last two years is that any time there is trouble, the API & management console becomes overloaded.  Every user will be trying to move and/or restore their services.  All at once.  And the API/console has shown to be extremely dependent on east-1.  PagerDuty went so far as to move to another region to de-correlate east1 failures from their own failures.

Competition for Resources

Once again, by virtue of us-east-1 being the largest region, whenever there is an outage every customer will start trying to provision new capacity in other AZs.  But there is seldom enough capacity.  Inevitably in each outage there is an entry in the status updates that says “We’re adding more disks to expand EBS capacity”, or “We’re bring more systems online to make more instances available”, and so forth.  You can’t really blame Amazon for this one: They can’t keep the prices they have and always be running below 50% capacity.  But when lots of instances fail, or lots of disks fill up, or lots of IP addresses get allocated, there just aren’t enough left.

This is a painful side effect of forcing everyone to be centralized into the us-east1 region.  us-west has us-west1 & us-west2 because the datacenters are too far apart to maintain a low-latency connection to put them into the same regional designation.  us-east has a dozen or more datacenters, and thanks to them being so close, Amazon has been able to call them all ‘us-east’ instead of ‘us-east1’ and ‘us-east2’.

But what happens when a bug affects multiple AZs in a region?  Suddenly, having all the AZs in a single region becomes a liability.  Too many people are affected at once and they have nowhere to go.  And all those organizations that have architected under the assumption that they can “just launch more instances somewhere else” are left with few options.

P.S. I know things are sounding a little negative, but stay tuned.  My goal here is first to identify what are the truly dangerous issues facing AWS, and then to describe the best ways to deal with them as well as why I still think AWS is the absolute best cloud provider available.

Reading between the lines of the AWS outage – Why you should be worried Part 1

TL;DR: There are many very real problem with AWS that are being ignored.  #1 is that they appear to have no plan for dealing with their ongoing software bugs.

There has been a surprising amount of talk about the most recent AWS outage (Oct 22, 2012). In truth, I was busy coding and didn’t even hear about the outage until it was almost over (my company runs partially on AWS, but in another region). From what I read on the Amazon status site, the scope sounded pretty limited; I didn’t see it as a major event. But the talk since then says otherwise.

Amazon has now released their complete post-mortem and in reading I was struck by several hidden truths that I think many people will miss.  I was an early closed beta tester of AWS (when you basically needed a personal invite to get in) and have done over 30 top-to-bottom builds of complete production app stacks while I was with RoundHouse.  So I hope to provide some additional insight into what makes the last few AWS outages especially interesting.

What you should really worry about Part 1 – The bugs aren’t getting better

Widespread issues occurred because of (by my reading) 6 different software bugs.  That’s a lot.  This fact can be spun as both a positive and a negative.   Here’s what we would assume are the positives:

  • The bugs can (and will) be fixed.  Once they are, they won’t happen again.
  • By virtue of AWS’s advancing age, they have “shaken out” more bugs than any of their lesser competitors.  Similar bugs surely lie dormant in every other cloud provider’s code as well.  No provider is immune and every single one will experience downtime because of it.  In this regard AWS is ahead of the game.

But this line of thinking fails to address the underlying negatives:

  • Any codebase is always growing & evolving and that means new bugs.  The alarming part is that each outage seems to have some underlying bugs discovered.  The incidence of Amazon letting new bugs through seems to be disconcertingly high.  It does no good to fix the old ones if other areas of the service have secretly added twice as many new bugs.  If things were really being “fixed” we would expect new bugs to show up less often, not more often.  After a certain point, we have to start assuming that the demonstrated error rate will continue.
  • When bugs like this happen, they often can’t be seen coming.  The impact and scope can’t be anticipated, but it is typically very large.

So far, we have not yet heard of any sort of comprehensive plan intended to address this issue.  Ironically, AWS is consistently dropping the ball on their ‘root-cause’ analysis but failing the “Five Whys” test.  They’re stopping at about the 2nd or 3rd why without addressing the true root cause.

In the case of AWS, anecdotal evidence suggests they have not yet succeeded in building a development protocol that is accountable to the bugs it produces. They’re letting too many through.  Yes, these bugs are very difficult to detect and test for, but they need to have a higher standard or outages like this will continue to occur.

Fixing Passenger error: PassengerLoggingAgent doesn’t exist

While doing a new install of Passenger & nginx, I ran into some strange errors:


2012/09/25 20:09:54 [alert] 2593#0: Unable to start the Phusion Passenger watchdog because it encountered the following error during startup: Unable to start the Phusion Passenger logging agent because its executable (/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.12/agents/PassengerLoggingAgent) doesn't exist. This probably means that your Phusion Passenger installation is broken or incomplete. Please reinstall Phusion Passenger (-1: Unknown error)

Our environment is using Ubuntu 12.04 LTS & Chef with the nginx::source & nginx::passenger_module recipes from the opscode cookbook. It turns out there were two root causes here that needed to be resolved:

  1. Even though the config explicitly stated to use version 3.0.12, the 3.0.17 passenger gem was also getting installed.  Some things were going to one place, some to another.
    • SOLUTION: I figured it’d just be easier to stick with the latest release.  So I changed the setting to just use 3.0.17 and uninstalled the old version
  2. The PassengerLoggingAgent was failing to be installed (but was failing silently.

SOLUTION: It turned out that we were missing some libraries.  Building the passenger package manually showed the details:

root@w2s-web01:/opt/ruby-enterprise-1.8.7-2011.12/lib/ruby/gems/1.8/gems/passenger-3.0.17# rake nginx RELEASE=yes
g++ ext/common/LoggingAgent/Main.cpp -o agents/PassengerLoggingAgent -Iext -Iext/common -Iext/libev -D_REENTRANT -I/usr/local/include -DHASH_NAMESPACE="__gnu_cxx" -DHASH_NAMESPACE="__gnu_cxx" -DHASH_FUN_H="<hash_fun.h>" -DHAS_ALLOCA_H -DHAS_SFENCE -DHAS_LFENCE -Wall -Wextra -Wno-unused-parameter -Wno-parentheses -Wpointer-arith -Wwrite-strings -Wno-long-long -Wno-missing-field-initializers -g -DPASSENGER_DEBUG -DBOOST_DISABLE_ASSERTS ext/common/libpassenger_common.a ext/common/libboost_oxt.a ext/libev/.libs/libev.a -lz -lpthread -rdynamic
In file included from ext/common/LoggingAgent/LoggingServer.h:46:0,
from ext/common/LoggingAgent/Main.cpp:43:
ext/common/LoggingAgent/RemoteSender.h:31:23: fatal error: curl/curl.h: No such file or directory
compilation terminated.
rake aborted!
Command failed with status (1): [g++ ext/common/LoggingAgent/Main.cpp -o ag...]

So the libcurl development headers were the true issue.

apt-get install libcurl4-openssl-dev

was the solution.  Or in our case it was to add:

package 'libcurl4-openssl-dev'

to the nginx recipe in chef.   I hope this helps someone else out there!

AWS VPC Error: Client.InvalidParameterCombination

When trying to execute an ec2-run-instances command for a VPC, you must specify both which subnet & which security group you want it to belong to:

ec2-run-instances ami-abc123 
 --group sg-abc123 
 --subnet subnet-abc123 
 --private-ip-address 10.0.1.10 
 .... your other params

However, doing so generates this error:

Client.InvalidParameterCombination: Network interfaces and an instance-level security groups may not be specified on the same request

I even found one lowly report of someone else with this issue: https://forums.aws.amazon.com/message.jspa?messageID=368030

Luckily, my company has premium AWS support and a quick 10 minute chat got the answer I needed.  You must use the --network-attachment param, which takes the place of --group, --private-ip-address, and --subnet

The resulting command looks like this:

ec2-run-instances ami-abc123 
  --network-attachment :0:subnet-abc123::10.0.1.10:sg-abc123::
  .... your other params

Good luck, I hope this helps!

Where do you get hosting support?

For quite some time now, I’ve found that the options for good Rails hosting have been significantly lacking.  As a consultant/contractor on a huge range of projects, I’m often asked for advice, guidance, or help in choosing and setting up servers for a client.  Nearly every client or customer wants the same thing:

  1. Stability/reliability
  2. Flexibility/room to grow
  3. Someone to keep things running
  4. Someone to call when they need help

The first two options are met by a lot of providers.  Tier IV datacenters, hardware redundancy, and virtualization are a dime-a-dozen nowadays and building a good Rails stack is just about the same for everyone.

However, the rub is in #3 & #4.  As my colleagues, peers, and I are already work full-time writing new applications, there is precious little time for system administration and support of completed projects and old apps.  Even with extensive automation, a small 1-3 person team can write many more Rails apps than they can support long-term.

Inevitably, the client wants to know “Who is going to keep things going once development is done?” and “Who can I call when things stop working?”.  Set them up on a physical machines, VPSes, EC2, or anything else and the developer is left with little choice but to help keep that server running long-term.  Including late night phone calls when something goes wrong.  And can you honestly say you are regularly doing all the little extra things that need to be done?  General maintenance?  Security patches?  Tuning?

Want an alternative?  AWS Premium support won’t touch your software stack.  Rackspace won’t support Rails.  Slicehost: no managed option at all.  There’s really only one player: just google ‘rails cloud support’.

So my question is: If you can get easy, scalable, on-demand hosting, why can’t you get easy, scalable, on-demand support? My answer to this issue is to launch a service that lets developers keep developing while someone else takes care of the system administration long-term:  RoundHouse Support.  Please read my public release announcement and then come check us out!

Announcing RoundHouse – Managed support for your host

I’m very happy today to announce a new community service available for Rails shops: RoundHouse – Server Management and Support.

RoundHouse is a cooperative solution for getting managed servers and system administration for your Rails stack, no matter what host you use.  We’re gathering a pool of specialists that you can call upon to get the help you need.  Whether that’s emergency support when you’re having server problems, regular day-to-day duties, or assistance in configuring a particularly difficult piece of software.

This is a service that provides freelancers, development shops, and companies alike an opportunity to focus on their product instead of on their hosting.  For developers this means freeing up more time to code.  For those running a website it means reliable service from a great group of experts.  For everyone it means having someone available whenever you need it.

Obviously this is a new offering and system administration (much like your hosting provider) must be utterly reliable.  So we’re beginning to establish a base set of clients to try out our service for free.  This gives us the chance to get established, continue to develop a solid organizational structure, and to expand our brand.  For our customers, it means you’re going to excellent sysadmin support at no charge while you learn about all the great things we can offer (and then hopefully recommend us to all your friends!).  So if you’ve been needing help setting up or running your Rails app, please contact us to get started.

We’re also looking to add additional members to our team as we continue to grow.  If you have expertise in system administration, elements of the Rails stack, or are just a great DevOp, please e-mail us at jobs@roundhousesupport.com and we can talk more!

Finally, please feel free to read my expanded rationale for how this service fits into the Rails ecosystem.

Presenting gem_cloner

Besides being a Ruby/Rails/Merb developer, I’m also a part-time sysadmin for a number of previous clients.  Usually I’m responsible for maintaining Rails stacks, either for apps that I’ve written or just for another developer that doesn’t have as much Linux experience.

Lately, I’ve had to do a move of a number of Rails installations to completely new/clean servers.  I’ve got lots of scripts for doing initial setups of the stack as they need to be.  But one thing that comes up is that, especially with older apps, the gem dependencies can be very finicky.  Installing the latest versions will almost certainly break something.  Plus some times the system can have quite an extensive list.

Yes, I know that the gems should be packaged with the app, but there are a lot of reasons that it doesn’t always happen or doesn’t always work.  To that end, I’ve found the most effective method is just to re-install the exact same set of gems on the new box as the old one.  To automate this process, I present: gem_cloner.

gem_cloner is a very tiny but useful script that will take the text output of `gem list`from one machine and execute the `gem install` command on the new machine.  Usage is very simple:

  1. On the old machine, run `gem list > gems.txt`
  2. Copy gems.txt to the new machine.
  3. Copy the gem_cloner.rb file to the same place
  4. With sudo or as root, run `ruby gem_cloner.rb`

The script will read that file and install the exact same gem versions.  You’ll definitely want to browse and tweak the script.  Possibly by adding ‘sudo‘ in the command call or adding ‘–no-rdoc –no-ri‘ (I personally use a gemrc to eliminate the doc files on production systems).

Fork, patch, & praise ad nauseum on github and drop me a line if you like it.

Rails exception monitoring

There has been an explosion of late on new ways to deal with tracking exceptions thrown in a production Rails app.  It used to be that you put in exception_notifier and went about your business.  But not too long ago I decided that I needed more.  Two things started this push:

  1. There was more than just me working on a project.  Several people needed easy access to the exceptions (at times) but I didn’t want to clutter their inboxes with e-mails for every exception.
  2. I would not be working on the project forever and others would be handling long term maintenance.  This meant changing the e-mail addresses in the config file quite often and just felt like ‘the wrong way’ to be doing it.

So I started looking for other alternatives.  Here’s an overview of the three I found:

Exception Logger

Let’s start with the positive.  Exception logger is exactly what I was looking for.  It provides the functionality I was looking for and makes tracking exceptions with multiple users easy.  Unfortunately, the install procedure is a mess.  The standard plugin just would not work for me.  Rails is notoriously bad at handling controllers/models/views that are inside a plugin so exception logger really needs to be built on Engines for that kind of functionality.  Instead, it uses a number of hacks to try and get Rails to recognize the code in the plugin and it just doesn’t work well.  Plus, in order to get added functionality (such as authentication) into the controller, you’re supposed to use a config file!  It’s a crazy unclean mess.  So in order to get it operating, I was forced to simply take the controller, views, etc and put them into the normal app tree.  This took a lot of setup and debugging, but I was able to get it to running and now it works extremely well.  I have it in one high-use production app and I’m very happy but I don’t expect to use it much more.  You lose the ability to e-mail notifications (at least not without hacking some more of the plugin) as well, so it’s only good in a few select cases.

Exceptional

Now we start getting to the fun stuff.  Several new exception monitoring applications have sprung up recently and I decided to check them out.  Exceptional is a hosted service and although I’d really prefer to have my exceptions tracked locally on a per-app basis, having them aggregated does have its benefits.  It’s totally free, so sign up for a username, then create a new app profile to get an api key.  Install their plugin and paste in the key and you’re set.  It runs in production mode and sends the exceptions off to their logging service.  It integrates with lighthouse, campfire, and twitter (none of which I use, but I’m sure it helps others) and will also e-mail you the notifications.

Once again, though, some issues made it unusable.  It appeared to work great, but as I started doing some system administration, I started running into a number of problems.  Whenever the plugin loads (when in production mode) it dumps a set of debug messages to stdout.  Every time time I’d load a production console (for working on a few issues that could only be tested on the production/staging server) I’d get messages about its attempt to connect.  Then a few seconds later, interrupting whatever I started, there would be more messages about the successful connection.  Unfortunately it does this when running rake tasks as well, so all my cron jobs that use rake tasks were now littered with these messages.  I was prepared to just edit the plugin to stop these messages, but instead I came across our next entry.

Hoptoad

Hoptoad is very simple, free, and nearly identical to Exceptional.  Sign up (you get your own subdomain), add an app profile, install the plugin, and copy the api key.  It doesn’t have the extra integrations of exceptional (though I’ll bet they’re coming), but it does everything I need and without the annoying messages.  It also lets you give extra users access to errors from certain apps only (as does Exceptional, although it wasn’t immediately obvious whereas Exceptional appears to be one username only).  One thing I will really miss from Exceptional is that it tracked 404 errors as well.  Although 404s usually come from scan bots, some may very well be legitimate broken links on or to your site, so it’s nice to track them.

Summary

Overall, I would highly recommend Exceptional and Hoptoad for everyone for all their exception tracking from now on.  Which one you use just comes down to a matter of taste right now and maybe your specific requirements.  I’m very excited to see what further features they add to differentiate themselves and I am REALLY hoping to see integration with scoutapp somehow, as I think exception handling and performance monitoring go hand-in-hand.

Finally, please feel free to comment on your own experiences with any of these projects!

Comparison of Rails monitoring apps: FiveRuns vs NewRelic RPM vs Scout App

Monitoring your production Rails application is a very important part of deploying and operating a web app. There are several more general solutions that work very well: Nagios, Munin, etc. As of late, however, several Rails specific options have come into common use. I’d like to discuss the three big players here:

FiveRuns RM-Manage

The FiveRuns client has been out for about a year and offers a terrific suite of monitoring: Server load/memory, MySQL queries, Rails errors, etc. As of version 2.0 (which is in open beta is and will be released for customers during RailsConf) it also supports monitoring your mongrels. It works great, but can get a pretty expensive ( They don’t publish their prices, but I’m paying $30/server ). It is an good choice for most users.

NewRelic RPM

I’ve been beta testing the NewRelic RPM service for the past few months. It’s a decent service, it’s very easy to install, but is very limited. It will monitor your server load/memory, slow queries, etc as will all the other monitoring tools. But beyond that it doesn’t offer much. You are limited to graphing only a 24 hour period of data, so you can’t see any kind of long term trends.  They have an amazing backend system for collecting data and their site is the fastest and most responsive I’ve ever seen.  Once they get their UI front-end featureset to match their amazing data collection system, they’re going to be awesome.  As of today, they opened to the general public and released their pricing. It is based on the number of mongrel/thin instances no matter how many servers (at least that’s my understanding). For a small to medium app running up to 40 mongrels (which would probably be 2-4 servers), you’ll wind up paying $250 / month, as compared to $60-120 for FiveRuns. Overall, given the limited functionality and hefty price, I can’t yet recommend NewRelic. I hope to see it grow and quickly add more features to change my mind.

Scout App

The third option is Scout App. It offers the same suite of monitoring features as the others, but goes a step further by offering a huge range of additional plugins that will allow you to customize its functionality and easily set up extra functionality such as restart dying mongrels. You can also write your own plugins. To add another scoop on this already monstrous sundae, Scout App is the cheapest of all. It will run you only $29.00 / month for four servers.

If I had to pick one service to recommend, it would likely be Scout. They provide just about everything you can ask for out of a monitoring app at the lowest price point. If anybody else has experience with these services, please add your own comments!

Deploying Rails Applications by Ez

I
recently received Ezra’s new book "Deploying Rails Applications" in
the mail.  It is a terrific reference and I will be following up shortly with an in-depth review.

I was quite delighted to find that I had already done even the most advanced tasks regarding scaling at
least once or twice (e.g. load balanced app servers, clustered MySQL,
Memcached, CDN for static files, & extensive
benchmarking/profiling/optimization).