Atlassian’s Jira outage, first noticed last week, is somehow still affecting hundreds of the company’s customers and potentially thousands of developers around the world. Late Tuesday the company finally released more details about the cause and scope of the outage, proving once again that delaying the release of bad news to enterprise customers is worse than the bad news itself.
“Let me start by saying that this incident and our response time are not up to our standard, and I apologize on behalf of Atlassian,” said CTO Sri Viswanath in a blog post Tuesday.
He explained that last week Atlassian engineers attempted to deactivate an old app that worked with Jira Service Management and Jira Software that is now fully integrated into its current services, but internal communication problems and a bad deactivation script actually caused “sites for approximately 400 customers [to be] improperly deleted.” The incident also took out Confluence and Opsgenie, two Atlassian products that customers use to manage their own internal incident response systems.
Compounding the mistake was a lack of automated backup and recovery tools for an incident of this nature, which is forcing Atlassian engineers to manually restore affected customers’ data in order to make sure nothing happens to the data of customers who were not affected by the initial incident. Expect that to change later this year.
Gergely Orosz, a former Uber and Microsoft engineer, might have summed it up best:
“Outages happened, happen, and will happen. The root cause is less important in this case. What is important is how companies respond when things go wrong, and how quickly they do this. And speed is where the company failed first and foremost.”