by David Snopek on March 1, 2017 - 10:26am

You probably noticed that many sites and apps were having serious problems yesterday due to an Amazon AWS outage.

Some sites/apps were completely down, and others had partial or reduced functionality. In the Drupal world, Pantheon was affected: sites didn't go down (huzzah for Pantheon!), but everything was in read-only mode for several hours, so users couldn't upload files to their sites and many dashboard functions didn't work.

Already, many are talking about how this outage is proof that the public cloud is a bad idea. Or, that Amazon messed up big time and maybe we should look at other cloud providers.

However, I'm going to argue that it wasn't Amazon's fault that this outage took down so many sites and apps.

And by extension, this isn't proof that the cloud is a bad idea, or that we should look to providers other than Amazon. The cloud is great, and so is Amazon AWS.

I'm going to argue that it's OUR fault -- the web developers who make all these great apps and sites -- that this outage broke the internet.

Please read more to find out why!

What exactly broke at Amazon?

My understanding is that Amazon S3, a service for storing individual files in the cloud, went down in the US-EAST-1 data center.

More precisely, most S3 operations in that data center failed, but it's my understanding that some percentage of them were successful. But the failure rate was so high, that even with lots and lots of retrying, it caused many services that depended on it to fail.

However, the most important thing about this outage is that it only affected Amazon S3 in a single Amazon data center.

Amazon S3 runs on 5 data centers in North America, 3 in Europe, 5 in Asia and 1 in South America.

When using Amazon S3, you can choose which data center you want to use, and there's nothing preventing you from using multiple data centers.

Note: there was lots of confusion yesterday. Many people had their EC2 servers running in US-WEST-1, but unknowingly had their S3 data in US-EAST-1, which led to many saying S3 was down in US-WEST-1 too, but it's my understanding that it wasn't.

UPDATE (2016-03-02): Here's the full post-mortem explanation from Amazon.

Amazon gave us the tools to avoid this! We didn't use them...

Of course, some mistake was made in the Amazon's US-EAST-1 data center. That's on them. (That said, the last major Amazon S3 outage was 2011.)

But the reality of life, and especially in technology, is that mistakes happen. We can't build a bullet-proof system that never breaks.

What we can do, is build systems that can tolerate the failure of one part of the system and keep running.

The crux of my argument, is the real failure here is web developers building apps that relied on a single Amazon data center.

And we should know better. Even without Amazon making a mistake, there's always the possibility of a natural disaster knocking out a data center.

It's our (web developers) fault for not replicating data in S3 to multiple data centers -- and Amazon even gave us really easy tools to do it!

With a couple of clicks in the AWS console, you can configure an S3 "bucket" to replicate its data to another data center. Then, if your primary S3 data center goes down, you can simply change your site or app to point at the backup data center, and you're back in business.

Why did so many big websites and apps fail to setup that up? Maybe Amazon is too stable and we all got complacent?

What I can say is that even at myDropWizard, we relied on a single S3 data center. However, that's going to change before this week is out! Maybe we should declare this "official" S3 replication week? ;-)

Amazon and the cloud are still great

The people on the internet saying that this outage is a sign that it's time to stop relying so strongly on the cloud or Amazon are just plain wrong. That's too extreme a leap.

But this is a sign that we've all been using the cloud wrong!

The fact that the cloud allows us to painlessly replicate our data and apps in multiple data centers is a great strength of the cloud! That's actually something that's quite hard to do with physical servers.

So, everyone:

  • First, calm down - it's not the end of the cloud
  • Second, go evaluate how your apps would react if your primary data center went down, either partially or completely :-)

Do you agree? Do you think I'm totally wrong? Have a data center outage story you'd like to share? Please leave a comment below!

Topics:

Want to read more articles like this?

myDropWizard.com blog Subscribe to the myDropWizard.com blog and recieve e-mail updates when new articles are published!

Comments

Great article and it helps understand how this all happened BUT, we keep moving toward a type of centralization by moving from a more distributed system to a more centralized-distribution, i.e., as you mention about a redundancy with coding to more than one system in the cloud, etc.

What I propose is that, as history has shown us, centralization and consolidation are good things that save money and make things more productive and efficient BUT there is another side (since every positive has a negative where balance is best) and that side tends to be given way to much to hold.

What I am saying is even with the supposed redundancies we are slowly and inexorably moving closer to one size fits all model even if we say we aren’t (can you say cognizant dissonance and confirmation bias) that one day will be the focal point in a, “Cyber-War.” I am military and one of the reasons the military don’t rely on a centralized model of leadership is because in the art of war cutting off the head kills the body therefore if we continue to move in the direction we are moving just for the sake of savings, while at our current state we still have efficiency and productivity, we open ourselves to being destroyed when someone cuts off that consolidated one point (even several redundant points has limits and if it connects to the Internet of Devices and Things then think cascade fall).

It is great and awesome that this event triggered your effort to recode things for better redundancy and yet that doesn’t speak to our effort and the gradual energy build up like a snow ball rolling down the side of a mountain that will one day, with the type of complacency you mention, lead us to a great catastrophic event.

Something to think about.

It is great and awesome that this event triggered your effort to recode things for better redundancy and yet that doesn’t speak to our effort and the gradual energy build up like a snow ball rolling down the side of a mountain that will one day, with the type of complacency you mention, lead us to a great catastrophic event.

I think the greater point is that under Amazon's model, we all have a choice. While many choose simple, cost-savings, and "efficiency" - we have that choice.

So, in a way, by offering those two choices, Amazon is increasing the diversity of choice. Not to be too much of an "Amazon Fan Boy", but it allows all sorts of smaller parties to choose if they want ease/simplicit/cost vs. reliability/ubiquity.

Amazon even, to a point, allows us to pick and choose when and how we want to be scalable and/or reliable. Take the NFL, for instance. There is ONE day per year where they can have absolutely zero downtime without it being catastrophic. The rest of the year, while still a ridiculously popular site, wouldn't be considered mission critical.

David's over-arching point is that Amazon *has* thought about these things. If they never anticipated downtime, there would only be a west coast datacenter. The problem with the events yesterday is that the rest of us don't think about those things. :)

I do not think you are wrong overall. But I would not be so quick to blame the developers, in some cases the client dictates lower costs. In that case it falls to the client, if the developer recommended the replication.

Yes, certainly when a client is envolved, this can become more complicated. Hopefully, the developer explained to them the trade-offs with that decision, and maybe they'll revisit it now.

However, plenty of the affected sites weren't built for a client, but are an app built by the company itself, for example, Pantheon (not to pick on them too much -- I think they handled the outage great, considering). I hope that the developers responsible for Pantheon, and similar in-house apps (Quora, etc), are going to the AWS console right now to click the buttons to enable S3 replication to another Amazon data center. :-)

I think this is a good attitude to have for owners of distributed systems:

"go evaluate how your apps would react if your primary data center went down, either partially or completely"

I've spent a lot of my time lately working on orchestration of multiple different services (GitHub, CircleCI, Pantheon) and is is always worth analyzing how failure in one system reaches the whole.

David Snopek puts his finger on a key point when saying 'hopefully, the developer explained to the tme trade-offs.' This begs the question.

The cost to benefit graph of making infrastructure more robust is, impressionistically, exponential. It is not good enough for a developer to say the client, 'here is a cost-benefit graph of hosting spend vs robustness, and here is the point on graph which I recommend for your use case.' The developer must also do a cost-benefit analysis which explains to the client why that particular level of spend on robust infrastructure is the right fit for that particular client, evaluated from a commercial point of view. And what developer actually does that, and tries to back the recommendation with real data?

I have tried to provide such figures for clients, and found it a huge struggle, mainly because I cannot find the data and do not know where to get it. My impression is that some developers, even if they know something data, do not even make the attempt. They simply make a judgment based on 'industry practice' or 'gut feeling' or 'what similar clients with a responsible attitude would do.'

If someone can tell me how to explain hosting trade-offs to a client on the basis of solid, commercially relevant, data, then I will begin to feel that the problem inadequate hosting is (as we tend to feel) to be blamed on parsimonious clients. At present, it is just as much a failure of our industry to offer clients data-backed hosting recommendations which go beyond our feeling that sticking the site $5 hosting, or in a single data centre, does not fit with our preferred way of working.

This is something we struggle with too. Our service is helping sites remain secure, online and without annoying little bugs. It's hard to convince clients that they need to pay to prevent a problem they don't have right now, ie. "My site up now, so why do I need to hire someone to keep it up?" or "No one has hacked my site yet, so why would I want a service to prevent it?" Clients tend to think reactively, after the problem has already happened, so it's tough to convince them of the value. We try to tell a story to get them to imagine that something bad has happened, and how much better the experience will be with our help, but it's hard to avoid that turning into "scare tactics" which we really don't want to use. :-)

Anyway, great point! Thanks for the comment!

Factor that (security setup, processes, maintenance, etc) into your hourly rate, then give it to your client for free.

That could definitely work if you're doing other active work for that client continuously! I could see it being problematic for a client that hasn't paid you for other work in a year, but a security issue comes up with a fix isn't trivial: do you still do it for them for free in recognition of what the paid you in the past?

Our model is a fixed monthly fee for maintenance and support, which they might not always use, but if something comes up they always have us on-call. It's not "free" but it's at least dependable and won't go up if they have a month where they need a lot of things or there a big site outage or security issue.

Agreed, but I see a difference between prevention (security processes and setup) and ongoing maintenance (security updates). I'm suggesting you incorporate the security processes and setups you've built over time as part of your deliverable. Automate it as much as possible, and share with your client what they're getting 'for free'. Again, I put that in quotes because they're really paying for it as part of your increase in hourly rate.

Marketing your ongoing maintenance packages for those one-and-done clients is tricky. I will say this, I haven't run into a client that hasn't panicked after seeing the default Drupal update message - "There is a security update for your version of Drupal. To ensure the security of your server, update immediately!" Maybe there's a way you can leverage similar language by automatically sending e-mails to clients that don't have your maintenance package every time your team has fixed/updated a new issue...

As for clients that don't pay - keep sending those security e-mails but ask for a retainer (equal to the amount they owe). Once they pay, say 'thanks', and cut them loose. Ain't nobody got time for that.

thanks you for your post. 100% on target

Hi, David,

Thanks for bringing me here from my blog post on the issue, https://www.freelock.com/blog/john-locke/2017-03/6-things-consider-next-... . Overall, I wholeheartedly agree with your main point, to evaluate what might happen if your main data center goes down.

On a broader note, though, I think there is a danger in the entire industry standardizing on a couple big players. A single point of failure is a dangerous thing. Much better to have some variety, some heterogeneity in your systems. We have chosen to distribute our servers across 4 different services, with a self-hosted configuration management tool to make this relatively painless. (We happen to use Salt Stack, but pick your own CM...) If something managed to bring down the Amazon platform itself somehow, we could easily move to a provider that uses a different cloud platform -- and it wouldn't even take a full day to recover.

Cloud infrastructure is great, but you should diversify to more than just multiple data centers at a single provider...

Cheers,
John

Thanks for your comment!

We have chosen to distribute our servers across 4 different services, with a self-hosted configuration management tool to make this relatively painless. (We happen to use Salt Stack, but pick your own CM...) If something managed to bring down the Amazon platform itself somehow, we could easily move to a provider that uses a different cloud platform -- and it wouldn't even take a full day to recover.

That is really, really cool :-)

Cloud infrastructure is great, but you should diversify to more than just multiple data centers at a single provider...

Absolutely! There's always more you can do, including storing some backups on another provider entirely, or rehearsing bringing your whole infrastructure back up on a totally different hosting provider, or even going as far as you guys have, etc.

That said - if you're using S3, it's so laughably easy to synchronize all your data to 2nd data center (you just click some buttons in the AWS console), but for some reason this isn't a wide-spread practice. It's even relatively easy to setup some backup EC2 servers in a different data center (ex. replicating to a MySQL slave, so you have up-to-the-minute data in case you need to start up some application servers in that data center if your primary goes down).

So, while there's always more you can do, I feel like at bare minimum, we should at least leverage the multiple data centers of our existing cloud providers - that's one of the reasons cloud providers have multiple data centers. Diversifying more is obviously better, but so many apps (including quite big ones) stopped short of even that. If there's easy wins, let's take them!

Add comment

o