What caused the biggest AWS outage in years?
Image: Protocol

What caused the biggest AWS outage in years?

Protocol Enterprise

Welcome to Protocol Cloud, your comprehensive roundup of everything you need to know about the week in cloud and enterprise software. This week: a post-mortem on the big AWS outage, a rundown on re:Invent so far and Salesforce is now a Slacker.

The Big Story

Computers, amirite

Given the way this year has gone, it really shouldn't be surprising that AWS kicked off the biggest three-week stretch of its year — the re:Invent conference — with one of its worst service disruptions in recent memory.

Last Wednesday's outage knocked prominent AWS customers such as Adobe, iRobot and Roku offline for several hours. It was a noticeable blip in what had otherwise been a strong operational year for the cloud leader, especially during a period in which demand for cloud services skyrocketed. Over the holiday weekend, AWS released a lengthy post-mortem detailing what went wrong, and how it plans to prevent similar problems from occurring in the future.

Modern cloud computing is ridiculously complicated, and it's managed by people. And people make mistakes. Still, no cloud provider wants to be seen as unreliable, even if 99.97% uptime is considered problematic. Anyway, here's what you need to know about last week's outage:

  • AWS traced the problem to a service called Kinesis Data Streams, which collects a wide variety of data from different sources for analytical processing.
  • In the process of attempting to add server capacity for that service at its US-East-1 data center region in Northern Virginia — the oldest and most prominent facility in its arsenal — several things went wrong.
  • Amazingly, the new servers "caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration," according to AWS's report, which is a thing I didn't know could happen.
  • AWS was forced to restart all the Kinesis servers — which takes hours — and because several other widely used AWS services rely on Kinesis, a cascading series of issues unfolded.
  • Major services such as EC2 were not affected by this event, and the damage was confined to a single region, but there's no doubt these issues will have cost some AWS customers money.

AWS plans to make several changes in the wake of this incident.

  • It increased the computing power and memory of the servers dedicated to Kinesis, which will allow it to reduce the number of servers needed to handle the load and avoid the operating-system thread limit.
  • It will also reconfigure how it manages operating systems across the service, which should raise that thread limit.
  • AWS uses a technique it called "cellularization" to confine the issues that inevitably pop up in other services to a smaller number of users, but it has yet to implement that work for Kinesis. That will change shortly.
  • And AWS will fix bugs and reconfigure how two prominent services impacted by the outage — CloudWatch and Cognito — handle future errors from the Kinesis service, which they depend on to provide data for monitoring and identity management, respectively.

So what can we learn from this incident, other than the fact that a lot of people don't really understand how cloud computing works in practice?

  • Both AWS and its customers have been increasingly interested in managed cloud services, where AWS does most of the heavy lifting required to accomplish a given task: Those services are easier for customers to use and more profitable for AWS.
  • But a lot of those managed services depend on other AWS services. AWS uses its own tools to build many of those higher-level services, and once the dominoes start falling they can be hard to stop.
  • Building an application across multiple clouds or multiple AWS regions would have made it easier for affected customers to quickly recover. But despite a plethora of products designed to help with that, it remains a difficult undertaking for most companies.

Cloud computing is built on the back of thousands of outages large and small, which people have learned from over the course of 15 years. If you get a few drinks in them, the people who know the messy details of how modern enterprise systems are architected reveal their amazement that the whole thing actually works as well as it does.

A MESSAGE FROM MICROSOFT AZURE AND INTEL

Intel/MSFT

What is confidential computing? There are ways to encrypt your data at rest and while in transit, but confidential computing protects the integrity of your data while it is in use. Data threats never rest, nor should the protection of your sensitive information.

Read more.

This Week on Protocol

The New Enterprise: Over the course of this week, in our latest Protocol Manual, we're taking a look at the people, companies and trends that are shaping enterprise computing. Earlier this week we shared the story behind Zoom's unprecedented scaling event, updated the state of cloud economics and examined what "lock-in" means in today's IT world. Today: Why data centers will never look the same again. And there's still more to come.

Re:Invent: It's AWS week, at least in the virtual sense, and CEO Andy Jassy's marathon keynote speech Tuesday morning laid out more evidence for the fact that hybrid cloud is here to stay. Although based on Jassy keynotes going back several years, it's hard for many of us to believe that this was the plan all along. More on AWS' announcements below.

People are typing: Salesforce's decision to shell out nearly $28 billion for Slack will go down either as either the masterstroke of a decades-long M&A strategy conducted by Marc Benioff, or the moment at which the company that pioneered the cloud hit a ceiling. Either way, the next few years of enterprise SaaS won't be boring.

This Week in re:Invent

Here's a brief summary of some highlights from the first 36 hours of AWS re:Invent 2020:

Around the Cloud

A MESSAGE FROM MICROSOFT AZURE AND INTEL

Intel/MSFT

What is confidential computing? There are ways to encrypt your data at rest and while in transit, but confidential computing protects the integrity of your data while it is in use. Data threats never rest, nor should the protection of your sensitive information.

Read more.

Thanks for reading; see you next week.

Recent Issues