What caused the biggest AWS outage in years?
Welcome to Protocol Cloud, your comprehensive roundup of everything you need to know about the week in cloud and enterprise software. This week: a post-mortem on the big AWS outage, a rundown on re:Invent so far and Salesforce is now a Slacker.
The Big Story
Computers, amirite
Given the way this year has gone, it really shouldn't be surprising that AWS kicked off the biggest three-week stretch of its year — the re:Invent conference — with one of its worst service disruptions in recent memory.
Last Wednesday's outage knocked prominent AWS customers such as Adobe, iRobot and Roku offline for several hours. It was a noticeable blip in what had otherwise been a strong operational year for the cloud leader, especially during a period in which demand for cloud services skyrocketed. Over the holiday weekend, AWS released a lengthy post-mortem detailing what went wrong, and how it plans to prevent similar problems from occurring in the future.
Modern cloud computing is ridiculously complicated, and it's managed by people. And people make mistakes. Still, no cloud provider wants to be seen as unreliable, even if 99.97% uptime is considered problematic. Anyway, here's what you need to know about last week's outage:
- AWS traced the problem to a service called Kinesis Data Streams, which collects a wide variety of data from different sources for analytical processing.
- In the process of attempting to add server capacity for that service at its US-East-1 data center region in Northern Virginia — the oldest and most prominent facility in its arsenal — several things went wrong.
- Amazingly, the new servers "caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration," according to AWS's report, which is a thing I didn't know could happen.
- AWS was forced to restart all the Kinesis servers — which takes hours — and because several other widely used AWS services rely on Kinesis, a cascading series of issues unfolded.
- Major services such as EC2 were not affected by this event, and the damage was confined to a single region, but there's no doubt these issues will have cost some AWS customers money.
AWS plans to make several changes in the wake of this incident.
- It increased the computing power and memory of the servers dedicated to Kinesis, which will allow it to reduce the number of servers needed to handle the load and avoid the operating-system thread limit.
- It will also reconfigure how it manages operating systems across the service, which should raise that thread limit.
- AWS uses a technique it called "cellularization" to confine the issues that inevitably pop up in other services to a smaller number of users, but it has yet to implement that work for Kinesis. That will change shortly.
- And AWS will fix bugs and reconfigure how two prominent services impacted by the outage — CloudWatch and Cognito — handle future errors from the Kinesis service, which they depend on to provide data for monitoring and identity management, respectively.
So what can we learn from this incident, other than the fact that a lot of people don't really understand how cloud computing works in practice?
- Both AWS and its customers have been increasingly interested in managed cloud services, where AWS does most of the heavy lifting required to accomplish a given task: Those services are easier for customers to use and more profitable for AWS.
- But a lot of those managed services depend on other AWS services. AWS uses its own tools to build many of those higher-level services, and once the dominoes start falling they can be hard to stop.
- Building an application across multiple clouds or multiple AWS regions would have made it easier for affected customers to quickly recover. But despite a plethora of products designed to help with that, it remains a difficult undertaking for most companies.
Cloud computing is built on the back of thousands of outages large and small, which people have learned from over the course of 15 years. If you get a few drinks in them, the people who know the messy details of how modern enterprise systems are architected reveal their amazement that the whole thing actually works as well as it does.
- Technology will fail; how you recover from failure is what matters. That is, so long as you don't make a habit of it.
A MESSAGE FROM MICROSOFT AZURE AND INTEL

What is confidential computing? There are ways to encrypt your data at rest and while in transit, but confidential computing protects the integrity of your data while it is in use. Data threats never rest, nor should the protection of your sensitive information.
This Week on Protocol
The New Enterprise: Over the course of this week, in our latest Protocol Manual, we're taking a look at the people, companies and trends that are shaping enterprise computing. Earlier this week we shared the story behind Zoom's unprecedented scaling event, updated the state of cloud economics and examined what "lock-in" means in today's IT world. Today: Why data centers will never look the same again. And there's still more to come.
Re:Invent: It's AWS week, at least in the virtual sense, and CEO Andy Jassy's marathon keynote speech Tuesday morning laid out more evidence for the fact that hybrid cloud is here to stay. Although based on Jassy keynotes going back several years, it's hard for many of us to believe that this was the plan all along. More on AWS' announcements below.
People are typing: Salesforce's decision to shell out nearly $28 billion for Slack will go down either as either the masterstroke of a decades-long M&A strategy conducted by Marc Benioff, or the moment at which the company that pioneered the cloud hit a ceiling. Either way, the next few years of enterprise SaaS won't be boring.
This Week in re:Invent
Here's a brief summary of some highlights from the first 36 hours of AWS re:Invent 2020:
- AWS will allow developers to rent Mac Minis in its cloud for software testing purposes, which could appeal to current customers designing for Apple hardware. If they don't mind the price.
- Amazon S3 got a big update. Data stored in the original cloud service from the original cloud company is now "strongly consistent," which is a complicated database way of saying that customers will no longer see short delays between the time when data is stored and the time when it is available to be read.
- Interested in the custom Graviton2 Arm chip? Now you'll be able to select a new option with faster networking speeds.
- There's a new service called Proton that will allow AWS customers to create standard deployment templates for their software developers to use when building new applications that re-use common resources.
- And there's a new monitoring service called Monitron, which sounds like Voltron's long-lost twin sibling.
- AWS Lambda should get cheaper: Customers of the serverless computing platform will now pay for their consumption in one-millisecond intervals, which should allow for significant cost savings compared to the 100-millisecond billing unit previously in place.
Around the Cloud
- Zoom posted another ridiculous quarterof revenue growth, which climbed 367% compared to the same period last year.
- ServiceNow acquired Element AI for an undisclosed amount, the latest in a series of AI-related acquisitions by the business-process management company.
- Microsoft backed down from a plan to let Microsoft 365 administrators see fine-grained details about their employees' work activity after an outcry about privacy concerns.
- How deep does antipathy run among AI professionals when it comes to Department of Defense AI work? Not very deep, according to Defense One.
- Security concerns about Docker containersappeared to be on the wane a year or so ago, but malware attacks targeting misconfigured servers are rising.
- There have been a lot of twists and turns in the 81-year history of Hewlett-Packard, but this is still a little surprising: Its former enterprise computing division, now known as HPE, is moving to Houston, Texas, a long way away from a small garage in Palo Alto.
A MESSAGE FROM MICROSOFT AZURE AND INTEL

What is confidential computing? There are ways to encrypt your data at rest and while in transit, but confidential computing protects the integrity of your data while it is in use. Data threats never rest, nor should the protection of your sensitive information.
Thanks for reading; see you next week.
Recent Issues
In a tough economy, benefits of the cloud 'only magnify'
November 14, 2022
Twitter’s security leads just quit. Now what?
November 10, 2022
Intel finally serves up a chip
November 09, 2022
The great AI race that wasn’t
November 08, 2022
Cloudflare sets a target
November 07, 2022
How Elon will bring back the Fail Whale
November 04, 2022
See more
To give you the best possible experience, this site uses cookies. If you continue browsing. you accept our use of cookies. You can review our privacy policy to find out more about the cookies we use.