enterprise| enterpriseauthorTom KrazitNoneAre you keeping up with the latest cloud developments? Get Tom Krazit and Joe Williams' newsletter every Monday and Thursday.d3d5b92349
×

Get access to Protocol

I’ve already subscribed

Will be used in accordance with our Privacy Policy

Protocol | Enterprise

That massive Slack outage this month? It started with an AWS networking error.

And problems with Slack's infrastructure meant things got far, far worse from there.

Slack

This sign was about as interactive as Slack's software on Jan. 4.

Photo: Justin Sullivan/Getty Images

The hours-long outage that kicked off the 2021 working year for Slack customers was the result of a cascading series of problems initially caused by network scaling issues at AWS, Protocol has learned.

According to a root-cause analysis that Slack distributed to customers last week, "around 6:00 a.m. PST we began to experience packet loss between servers caused by a routing problem between network boundaries on the network of our cloud provider." A source familiar with the issue confirmed that AWS Transit Gateway did not scale fast enough to accommodate the spike in demand for Slack's service the morning of Jan. 4, coming off the holiday break.

Slack declined to comment beyond confirming the authenticity of the report. AWS declined to comment.

Over the next hour, packet loss caused by the networking problems led Slack's servers to report an increasing number of errors. That forced healthy servers to handle an increasing amount of demand as more and more servers were tagged as "unhealthy" due to their lack of responsiveness, thanks to the networking issues. Slack engineers were not alerted to the problems until around 6:45 a.m. PT.

"By 7:00am PST there were an insufficient number of backend servers to meet our capacity needs," according to the report, and Slack went down hard across the world.

Slack had a backup reserve of servers ready to go, but began to discover problems with the provisioning service it used to spin up and verify those backup servers, which was not designed to handle the task of trying to get Slack up and running on more than 1,000 servers in a short period of time. It was also unable to debug the issues properly because its observability service was also affected by the networking issues, according to the report.

Between 7 a.m. PT and roughly 8:15 a.m. PT, AWS increased the capacity of AWS Transit Gateway, and moved Slack from a shared system to a dedicated system, Slack told customers. Once Slack's problems with its provisioning system were fixed, the new servers found they had stable network connections, and service began to come back to normal over the next hour.

In its report, Slack promised customers it would improve several aspects of its architecture over the next few months, starting with a better alert system for packet loss and closer ties between its observability system and its provisioning service. It will also redesign the server-provisioning service to handle a similar type of event and set new rules around how its servers automatically scale in response to demand.

One thing that isn't yet clear is how folks at AWS coordinated their response to the outage: AWS, after all, is actually a Slack customer, since the two companies signed a sweeping partnership deal last June. For its part, Slack signed a five-year deal with AWS in 2018 that appears to cover the majority of its cloud computing needs through 2023.

Slack has run into problems in the past when a disproportionately large number of people try to log into its service all at once. A similar outage occurred on Halloween in 2017 when a coding error kicked Slack users offline and everybody tried to log back in at the same time. "It's similar to DDoSing yourself," former Slack director of infrastructure Julia Grace told me at the time.

Power

Google wants to help you get a life

Digital car windows, curved AR glasses, automatic presentations and other patents from Big Tech.

A new patent from Google offers a few suggestions.

Image: USPTO

Another week has come to pass, meaning it's time again for Big Tech patents! You've hopefully been busy reading all the new Manual Series stories that have come out this week and are now looking forward to hearing what comes after what comes next. Google wants to get rid of your double-chin selfie videos and find things for you as you sit bored at home; Apple wants to bring translucent displays to car windows; and Microsoft is exploring how much you can stress out a virtual assistant.

And remember: The big tech companies file all kinds of crazy patents for things, and though most never amount to anything, some end up defining the future.

Keep Reading Show less
Mike Murphy

Mike Murphy ( @mcwm) is the director of special projects at Protocol, focusing on the industries being rapidly upended by technology and the companies disrupting incumbents. Previously, Mike was the technology editor at Quartz, where he frequently wrote on robotics, artificial intelligence, and consumer electronics.

Sponsored Content

The future of computing at the edge: an interview with Intel’s Tom Lantzsch

An interview with Tom Lantzsch, SVP and GM, Internet of Things Group at Intel

An interview with Tom Lantzsch

Senior Vice President and General Manager of the Internet of Things Group (IoT) at Intel Corporation

Edge computing had been on the rise in the last 18 months – and accelerated amid the need for new applications to solve challenges created by the Covid-19 pandemic. Tom Lantzsch, Senior Vice President and General Manager of the Internet of Things Group (IoT) at Intel Corp., thinks there are more innovations to come – and wants technology leaders to think equally about data and the algorithms as critical differentiators.

In his role at Intel, Lantzsch leads the worldwide group of solutions architects across IoT market segments, including retail, banking, hospitality, education, industrial, transportation, smart cities and healthcare. And he's seen first-hand how artificial intelligence run at the edge can have a big impact on customers' success.

Protocol sat down with Lantzsch to talk about the challenges faced by companies seeking to move from the cloud to the edge; some of the surprising ways that Intel has found to help customers and the next big breakthrough in this space.

What are the biggest trends you are seeing with edge computing and IoT?

A few years ago, there was a notion that the edge was going to be a simplistic model, where we were going to have everything connected up into the cloud and all the compute was going to happen in the cloud. At Intel, we had a bit of a contrarian view. We thought much of the interesting compute was going to happen closer to where data was created. And we believed, at that time, that camera technology was going to be the driving force – that just the sheer amount of content that was created would be overwhelming to ship to the cloud – so we'd have to do compute at the edge. A few years later – that hypothesis is in action and we're seeing edge compute happen in a big way.

Keep Reading Show less
Saul Hudson
Saul Hudson has a deep knowledge of creating brand voice identity, especially in understanding and targeting messages in cutting-edge technologies. He enjoys commissioning, editing, writing, and business development, in helping companies to build passionate audiences and accelerate their growth. Hudson has reported from more than 30 countries, from war zones to boardrooms to presidential palaces. He has led multinational, multi-lingual teams and managed operations for hundreds of journalists. Hudson is a Managing Partner at Angle42, a strategic communications consultancy.
Protocol | Enterprise

AWS has avoided antitrust scrutiny. That could change soon.

Legislators and regulators are looking closely for evidence of contract pricing, self-preferencing and whether lock-in is hurting customers.

AWS, Microsoft and Google Cloud have all invested billions of dollars in cloud infrastructure.

Image: NurPhoto/Getty Images

The days of AWS flying under the antitrust radar are over.

Cloud computing has grown at a dizzying speed since 2006, when AWS launched its first cloud service. A generation of tech companies found themselves more than willing to pay handsomely to outsource their hardware and networking needs — as well as an ever-growing percentage of their software development tools — to the company.

Keep Reading Show less
Tom Krazit

Tom Krazit ( @tomkrazit) is a senior reporter at Protocol, covering cloud computing and enterprise technology out of the Pacific Northwest. He has written and edited stories about the technology industry for almost two decades for publications such as IDG, CNET, paidContent, and GeekWire. He has written and edited stories about the technology industry for almost two decades for publications such as IDG, CNET and paidContent, and served as executive editor of Gigaom and Structure.

Transforming 2021

Blockchain, QR codes and your phone: the race to build vaccine passports

Digital verification systems could give people the freedom to work and travel. Here's how they could actually happen.

One day, you might not need to carry that physical passport around, either.

Photo: CommonPass

There will come a time, hopefully in the near future, when you'll feel comfortable getting on a plane again. You might even stop at the lounge at the airport, head to the regional office when you land and maybe even see a concert that evening. This seemingly distant reality will depend upon vaccine rollouts continuing on schedule, an open-sourced digital verification system and, amazingly, the blockchain.

Several countries around the world have begun to prepare for what comes after vaccinations. Swaths of the population will be vaccinated before others, but that hasn't stopped industries decimated by the pandemic from pioneering ways to get some people back to work and play. One of the most promising efforts is the idea of a "vaccine passport," which would allow individuals to show proof that they've been vaccinated against COVID-19 in a way that could be verified by businesses to allow them to travel, work or relax in public without a great fear of spreading the virus.

Keep Reading Show less
Mike Murphy

Mike Murphy ( @mcwm) is the director of special projects at Protocol, focusing on the industries being rapidly upended by technology and the companies disrupting incumbents. Previously, Mike was the technology editor at Quartz, where he frequently wrote on robotics, artificial intelligence, and consumer electronics.

Sponsored Content

Building better relationships in the age of all-remote work

How Stripe, Xero and ModSquad work with external partners and customers in Slack channels to build stronger, lasting relationships.

Image: Original by Damian Zaleski

Every business leader knows you can learn the most about your customers and partners by meeting them face-to-face. But in the wake of Covid-19, the kinds of conversations that were taking place over coffee, meals and in company halls are now relegated to video conferences—which can be less effective for nurturing relationships—and email.

Email inboxes, with hard-to-search threads and siloed messages, not only slow down communication but are also an easy target for scammers. Earlier this year, Google reported more than 18 million daily malware and phishing emails related to Covid-19 scams in just one week and more than 240 million daily spam messages.

Keep Reading Show less
Latest Stories