The hours-long outage that kicked off the 2021 working year for Slack customers was the result of a cascading series of problems initially caused by network scaling issues at AWS, Protocol has learned.
According to a root-cause analysis that Slack distributed to customers last week, "around 6:00 a.m. PST we began to experience packet loss between servers caused by a routing problem between network boundaries on the network of our cloud provider." A source familiar with the issue confirmed that AWS Transit Gateway did not scale fast enough to accommodate the spike in demand for Slack's service the morning of Jan. 4, coming off the holiday break.
Slack declined to comment beyond confirming the authenticity of the report. AWS declined to comment.
Over the next hour, packet loss caused by the networking problems led Slack's servers to report an increasing number of errors. That forced healthy servers to handle an increasing amount of demand as more and more servers were tagged as "unhealthy" due to their lack of responsiveness, thanks to the networking issues. Slack engineers were not alerted to the problems until around 6:45 a.m. PT.
"By 7:00am PST there were an insufficient number of backend servers to meet our capacity needs," according to the report, and Slack went down hard across the world.
Slack had a backup reserve of servers ready to go, but began to discover problems with the provisioning service it used to spin up and verify those backup servers, which was not designed to handle the task of trying to get Slack up and running on more than 1,000 servers in a short period of time. It was also unable to debug the issues properly because its observability service was also affected by the networking issues, according to the report.
Between 7 a.m. PT and roughly 8:15 a.m. PT, AWS increased the capacity of AWS Transit Gateway, and moved Slack from a shared system to a dedicated system, Slack told customers. Once Slack's problems with its provisioning system were fixed, the new servers found they had stable network connections, and service began to come back to normal over the next hour.
In its report, Slack promised customers it would improve several aspects of its architecture over the next few months, starting with a better alert system for packet loss and closer ties between its observability system and its provisioning service. It will also redesign the server-provisioning service to handle a similar type of event and set new rules around how its servers automatically scale in response to demand.
One thing that isn't yet clear is how folks at AWS coordinated their response to the outage: AWS, after all, is actually a Slack customer, since the two companies signed a sweeping partnership deal last June. For its part, Slack signed a five-year deal with AWS in 2018 that appears to cover the majority of its cloud computing needs through 2023.
Slack has run into problems in the past when a disproportionately large number of people try to log into its service all at once. A similar outage occurred on Halloween in 2017 when a coding error kicked Slack users offline and everybody tried to log back in at the same time. "It's similar to DDoSing yourself," former Slack director of infrastructure Julia Grace told me at the time.