Enterprise

That massive Slack outage this month? It started with an AWS networking error.

And problems with Slack's infrastructure meant things got far, far worse from there.

Slack

This sign was about as interactive as Slack's software on Jan. 4.

Photo: Justin Sullivan/Getty Images

The hours-long outage that kicked off the 2021 working year for Slack customers was the result of a cascading series of problems initially caused by network scaling issues at AWS, Protocol has learned.

According to a root-cause analysis that Slack distributed to customers last week, "around 6:00 a.m. PST we began to experience packet loss between servers caused by a routing problem between network boundaries on the network of our cloud provider." A source familiar with the issue confirmed that AWS Transit Gateway did not scale fast enough to accommodate the spike in demand for Slack's service the morning of Jan. 4, coming off the holiday break.

Slack declined to comment beyond confirming the authenticity of the report. AWS declined to comment.

Over the next hour, packet loss caused by the networking problems led Slack's servers to report an increasing number of errors. That forced healthy servers to handle an increasing amount of demand as more and more servers were tagged as "unhealthy" due to their lack of responsiveness, thanks to the networking issues. Slack engineers were not alerted to the problems until around 6:45 a.m. PT.

"By 7:00am PST there were an insufficient number of backend servers to meet our capacity needs," according to the report, and Slack went down hard across the world.

Slack had a backup reserve of servers ready to go, but began to discover problems with the provisioning service it used to spin up and verify those backup servers, which was not designed to handle the task of trying to get Slack up and running on more than 1,000 servers in a short period of time. It was also unable to debug the issues properly because its observability service was also affected by the networking issues, according to the report.

Between 7 a.m. PT and roughly 8:15 a.m. PT, AWS increased the capacity of AWS Transit Gateway, and moved Slack from a shared system to a dedicated system, Slack told customers. Once Slack's problems with its provisioning system were fixed, the new servers found they had stable network connections, and service began to come back to normal over the next hour.

In its report, Slack promised customers it would improve several aspects of its architecture over the next few months, starting with a better alert system for packet loss and closer ties between its observability system and its provisioning service. It will also redesign the server-provisioning service to handle a similar type of event and set new rules around how its servers automatically scale in response to demand.

One thing that isn't yet clear is how folks at AWS coordinated their response to the outage: AWS, after all, is actually a Slack customer, since the two companies signed a sweeping partnership deal last June. For its part, Slack signed a five-year deal with AWS in 2018 that appears to cover the majority of its cloud computing needs through 2023.

Slack has run into problems in the past when a disproportionately large number of people try to log into its service all at once. A similar outage occurred on Halloween in 2017 when a coding error kicked Slack users offline and everybody tried to log back in at the same time. "It's similar to DDoSing yourself," former Slack director of infrastructure Julia Grace told me at the time.

Fintech

Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep Reading Show less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep Reading Show less
FTA
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.
Enterprise

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep Reading Show less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep Reading Show less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.

Enterprise

Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep Reading Show less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories
Bulletins