The PivotNine Blog

AWS Outage Doesn’t Change Anything

20 September 2015
Justin Warren

If the latest AWS outage changes anything in your approach to cloud adoption, then you're doing it wrong.
This is not the first AWS outage (I first wrote about one in 2011, back when I had hair), nor will it be the last. Nor will it be only AWS that suffers another outage at some point in the future. We've already seen outages from Office365, Azure, Softlayer, and Gmail.

Outages are a thing that happens, whether your computing is happening in your office, in co-location, or in ‘the cloud', which is just a shorthand term for “someone else's computer”.

To think that putting applications ‘in the cloud' magically makes everything better is naive at best.

A Tradeoff

I've written about the resiliency trade-off before, so to summarize, there are only two ways to approach this: Assume robust, or assume fragile.

Writing an application that assumes all of the infrastructure it runs on is fragile and may fail at any moment is complex and difficult. So, for many years the dominant thinking in writing applications was to assume that the infrastructure was essentially perfect, which made writing the applications much simpler. This is the assume robust __model.

The trade-off was that we had to spend a lot of time and effort and money on making the infrastructure robust. So we have RAID, and clustering, and Tandem/HP NonStop, and redundancy, and a host of other techniques that help the infrastructure to stay online even when bits of it break.

It turns out this makes infrastructure pretty expensive.

Lately, the hyper-scale companies like Apple and Amazon have found that when you have lots of things, even if you've got robust hardware, things are still breaking all the time. It's simple math: if something has a 1% chance of breaking each day, and you have 100 of them, odds are one of them will break every day.

So we may as well assume fragile __and design around it. Then we can spend a lot less on infrastructure because we don't need any of the expensive redundancy bits. Instead, we buy cheap infrastructure (which is going to fail soon anyway) and write software that assumes the infrastructure it runs on will fail at any moment.

See the issue with this? Writing software that handles all the different hardware failure modes, particularly at scale, is hard. Distributed systems are __Capital H Hard. And we still have to make the application perform correctly under ideal conditions, with the added challenge of having it work correctly under non-ideal conditions at the same time.

For massive global companies with huge amounts of money and strong brands, hiring incredibly smart and capable people is a lot easier than for mid-tier companies you've never heard of. Where do they get the smart developers required to write these distributed applications?

And then you still have to write the apps. Which takes longer than comparatively simpler apps, even for really smart people. Which costs money. Maybe more money than the pricey resilient infrastructure you decided was too expensive.

And now we're back where we started.

Thinking Required

Figuring out what to do here requires understanding what you're trying to achieve, and what the trade-offs are. That means understanding both the characteristics of your deployment choice (on-site, cloud, or a hybrid approach) and the characteristics of the application you then need to design, build, and run.

You need to have staff who can do all of these things, or purchase access to them from consultants (like me!) or vendors.

And you need to consider what happens when your choice of solution breaks, because it will, even if only temporarily.

But you were already doing all of this.

Weren't you?

This article first appeared in Forbes.com here.