Enterprise

That massive Slack outage this month? It started with an AWS networking error.

And problems with Slack's infrastructure meant things got far, far worse from there.

Slack

This sign was about as interactive as Slack's software on Jan. 4.

Photo: Justin Sullivan/Getty Images

The hours-long outage that kicked off the 2021 working year for Slack customers was the result of a cascading series of problems initially caused by network scaling issues at AWS, Protocol has learned.

According to a root-cause analysis that Slack distributed to customers last week, "around 6:00 a.m. PST we began to experience packet loss between servers caused by a routing problem between network boundaries on the network of our cloud provider." A source familiar with the issue confirmed that AWS Transit Gateway did not scale fast enough to accommodate the spike in demand for Slack's service the morning of Jan. 4, coming off the holiday break.

Slack declined to comment beyond confirming the authenticity of the report. AWS declined to comment.

Over the next hour, packet loss caused by the networking problems led Slack's servers to report an increasing number of errors. That forced healthy servers to handle an increasing amount of demand as more and more servers were tagged as "unhealthy" due to their lack of responsiveness, thanks to the networking issues. Slack engineers were not alerted to the problems until around 6:45 a.m. PT.

"By 7:00am PST there were an insufficient number of backend servers to meet our capacity needs," according to the report, and Slack went down hard across the world.

Slack had a backup reserve of servers ready to go, but began to discover problems with the provisioning service it used to spin up and verify those backup servers, which was not designed to handle the task of trying to get Slack up and running on more than 1,000 servers in a short period of time. It was also unable to debug the issues properly because its observability service was also affected by the networking issues, according to the report.

Between 7 a.m. PT and roughly 8:15 a.m. PT, AWS increased the capacity of AWS Transit Gateway, and moved Slack from a shared system to a dedicated system, Slack told customers. Once Slack's problems with its provisioning system were fixed, the new servers found they had stable network connections, and service began to come back to normal over the next hour.

In its report, Slack promised customers it would improve several aspects of its architecture over the next few months, starting with a better alert system for packet loss and closer ties between its observability system and its provisioning service. It will also redesign the server-provisioning service to handle a similar type of event and set new rules around how its servers automatically scale in response to demand.

One thing that isn't yet clear is how folks at AWS coordinated their response to the outage: AWS, after all, is actually a Slack customer, since the two companies signed a sweeping partnership deal last June. For its part, Slack signed a five-year deal with AWS in 2018 that appears to cover the majority of its cloud computing needs through 2023.

Slack has run into problems in the past when a disproportionately large number of people try to log into its service all at once. A similar outage occurred on Halloween in 2017 when a coding error kicked Slack users offline and everybody tried to log back in at the same time. "It's similar to DDoSing yourself," former Slack director of infrastructure Julia Grace told me at the time.

Workplace

You need a healthy ‘debate culture’

From their first day, employees at Appian are encouraged to disagree with anyone at the company — including the CEO. Here’s how it works.

Appian co-founder and CEO Matt Calkins wants his employees to disagree with him.

Photo: Appian

Matt Calkins often hears that he’s polite, even deferential. But as CEO of Appian, he tells employees to challenge each other — especially their bosses — early and often.

“I love arguments. I love ideas clashing,” Calkins said. “I regard it as a personal compliment when someone respectfully dissents.”

Keep Reading Show less
Allison Levitsky
Allison Levitsky is a reporter at Protocol covering workplace issues in tech. She previously covered big tech companies and the tech workforce for the Silicon Valley Business Journal. Allison grew up in the Bay Area and graduated from UC Berkeley.

Some of the most astounding tech-enabled advances of the next decade, from cutting-edge medical research to urban traffic control and factory floor optimization, will be enabled by a device often smaller than a thumbnail: the memory chip.

While vast amounts of data are created, stored and processed every moment — by some estimates, 2.5 quintillion bytes daily — the insights in that code are unlocked by the memory chips that hold it and transfer it. “Memory will propel the next 10 years into the most transformative years in human history,” said Sanjay Mehrotra, president and CEO of Micron Technology.

Keep Reading Show less
James Daly
James Daly has a deep knowledge of creating brand voice identity, including understanding various audiences and targeting messaging accordingly. He enjoys commissioning, editing, writing, and business development, particularly in launching new ventures and building passionate audiences. Daly has led teams large and small to multiple awards and quantifiable success through a strategy built on teamwork, passion, fact-checking, intelligence, analytics, and audience growth while meeting budget goals and production deadlines in fast-paced environments. Daly is the Editorial Director of 2030 Media and a contributor at Wired.

Gopuff says it will make it through the fast-delivery slump

Maria Renz on her new role, the state of fast delivery and Gopuff’s goals for the coming year.

Gopuff has raised $4 billion at a $15 billion valuation.

Photo: Gopuff

The fast-delivery boom sent startups soaring during the pandemic, only for them to come crashing down in recent months. But Maria Renz said Gopuff is prepared to get through the slump.

“Gopuff is really well-positioned to weather through those challenges that we expect in the next year or so,” Renz told Protocol. “We're first party, we control elements of our mix, like price, very directly. And again, we have nine years of experience.”

Keep Reading Show less
Sarah Roach

Sarah (Sarahroach_) writes for Source Code at Protocol. She's a recent graduate of The George Washington University, where she studied journalism and criminal justice. She served for two years as editor-in-chief of GW's independent newspaper, The GW Hatchet. Sarah is based in New York, and can be reached at sroach@protocol.com

Enterprise

AT&T CTO: Challenges of the cloud transition are interpersonal

Jeremy Legg sat down with Protocol to discuss the race to 5G, the challenges of the cloud transition and nabbing tech talent.

AT&T CTO Jeremy Legg spoke with Protocol about the company's cloud transition and more.

Photo: AT&T

Jeremy Legg is two months into his role as CTO of AT&T, and he has been tasked with a big mandate: transforming the company into a software-driven business, with 5G and fiber as core growth areas.

This isn’t Legg’s first CTO gig, just his biggest one. He’s an entertainment biz guy who’s now at the center of the much bigger, albeit less glamorous, telecom business. Prior to joining AT&T in 2020, Legg was the CTO of WarnerMedia, where he was the technical architect behind HBO Max.

Keep Reading Show less
Michelle Ma

Michelle Ma (@himichellema) is a reporter at Protocol, where she writes about management, leadership and workplace issues in tech. Previously, she was a news editor of live journalism and special coverage for The Wall Street Journal. Prior to that, she worked as a staff writer at Wirecutter. She can be reached at mma@protocol.com.

Workplace

How Canva uses Canva

Design tips and tricks from the ultimate Canva pros: Canva employees themselves.

Employees use Canva to build the internal weekly “Canvazine,” product vision decks, team swag and more.

Illustration: Christopher T. Fong/Protocol

Ever wondered how the companies behind your favorite tech use their own products? We’ve told you how Spotify uses Spotify, How Slack uses Slack and how Meta uses its workplace tools. We talked to Canva employees about the creative ways they use the design tool.

The thing about Canva is that it's ridiculously easy to use. Anyone, regardless of skill level, can open up the app and produce a visually appealing presentation, infographic or video. The 10-year-old company has become synonymous with DIY design, serving as the preferred Instagram infographic app for the social justice “girlies.” Still, the app has plenty of overlooked features that Canvanauts (Canva’s word for its employees) use every day.

Keep Reading Show less
Lizzy Lawrence

Lizzy Lawrence ( @LizzyLaw_) is a reporter at Protocol, covering tools and productivity in the workplace. She's a recent graduate of the University of Michigan, where she studied sociology and international studies. She served as editor in chief of The Michigan Daily, her school's independent newspaper. She's based in D.C., and can be reached at llawrence@protocol.com.

Latest Stories
Bulletins