Roblox
Image: Roblox

Why Roblox was down for three days

Protocol Enterprise

Hello and welcome to Protocol Enterprise! Today: Roblox unpacks the complicated set of factors that derailed its gaming service last October, Uncle Pat goes to Ohio, and earnings season kicks off next week.

Spin up

Cloud companies have spent a lot of time over the last five years catering to companies reluctant to give up their data centers entirely, but stop me if you’re heard this one: The pandemic changed that thinking. More than half of IT professionals surveyed by Aryaka plan to close their data centers over the next 24 months, according to Data Center Knowledge.

Lessons from the Roblox outage

Roblox is probably not the first name that comes to mind when it comes to thinking about enterprise tech, but the company operates a massive network to serve the 50 million extremely demanding preteen and teenage gamers on its platform. This week the company released a long, extremely detailed post-mortem on a big three-day outage last year that everyone working in enterprise infrastructure should read.

“The outage was unique in both duration and complexity,” Roblox said, and that’s an understatement. Three days is an eternity on the internet; Facebook was down for several hours one day last October and the world briefly lost its mind.

Roblox manages its own infrastructure, which is not unusual for a company founded in 2004.

  • The company has over 18,000 servers on that infrastructure and also deploys and manages its own storage and networking equipment.
  • It relies extensively on technology developed by HashiCorp, including Nomad, Vault and Consul.
  • Consul, which is part of a category of emerging enterprise technologies called service meshes, played a key role in unraveling the circumstances that led to the outage.

Like most outages, this one started innocuously but uncovered a novel bug deep inside the layers of software used to run Roblox’s infrastructure.

  • Service meshes like Consul function like traffic-control officers on a network, allowing individual microservices to talk to each other and exchange the data needed to do their jobs.
  • At first glance it seemed like a simple failure of the hardware running the Consul cluster, but after replacing all the servers, performance was still impacted.
  • A combined team of Roblox and HashiCorp engineers eventually figured out that design choices made in an open-source logging project called BoltDB at the heart of Consul were causing the bottlenecks, and that was only exposed because of Roblox’s unique architecture.
  • Part of the reason why it took so long to diagnose the issue was that the team couldn’t determine whether or not Roblox’s choices or something flawed inside Consul was causing the problem; turns out, it was a little bit of both.

Roblox has made several changes to its infrastructure over the last three months.

  • “Running all Roblox backend services on one Consul cluster left us exposed to an outage of this nature,” the company said, and as a result it has added a second data center to run backend services and also plans to implement availability zones within those data center regions.
  • HashiCorp is also working on a new version of Consul that replaces BoltDB.
  • But despite the survey results at the top of today’s newsletter, sales reps at the Big Three cloud providers haven’t picked up any major new business from Roblox.
  • “In general we find public cloud to be a good tool for applications that are not performance and latency critical, and that run at a limited scale. However, for our most performance and latency critical workloads, we have made the choice to build and manage our own infrastructure on-prem,” the company said.

Sincere kudos to Roblox for posting such a detailed analysis of what was probably one of the worst incidents in the history of the company. Lucky for it, however, its core audience moves on quickly.

— Tom Krazit (email | twitter)

A MESSAGE FROM DATAIKU

At Dataiku, we know data isn’t a destination, so we build tools for what comes after. Dataiku is the only AI platform connecting data and doers, enabling anyone across organizations to transform data into real business impact. Because there is no soul in the machine, only in front of it.

Learn more

Intel bets on the Buckeye State

President Biden joined Intel CEO Pat Gelsinger Friday to unveil plans for Intel’s first new U.S. factory site in 40 years, a $20 billion investment in Ohio.

Construction is set to begin late this year on two factories, or fabs, at the 1,000 acre site near Columbus, Ohio, that Gelsinger called the “Silicon Heartland.” But the 3,000 jobs the two fabs will create are tied to a $52 billion subsidy package that has stalled in the House after passage by the Senate, Gelsinger said.

New fab construction is critical to the chip industry, which is buckling under pressure to make more chips than it ever has before. COVID-related supply shocks and unprecedented demand for consumer goods — all of which seem to need chips — have prodded the industry to find ways to manufacture more chips. Spending on new fab equipment will have grown 34% to $152 billion in 2021, according to IC Insights. And the Biden administration has made it a priority to attract more new factories to the U.S. because of national security and geopolitical concerns.

Intel expects the fabs to be fully operational in 2025.

— Max A. Cherney (email | twitter)

Coming next week

It’s time for the first round of 2022 earnings calls, with Microsoft, ServiceNow, Qualtrics and Intel all sharing their results from the December 2021 quarter.

Microsoft will present second-quarter earnings on Tuesday at 2:30pm PT.

ServiceNow will announce earnings on Wednesday at 2:00pm PT.

Qualtrics will share fiscal year results on Wednesday at 2:00pm PT.

Intel’s earnings will be announced Wednesday after the closing bell, at 2:00pm PT.

Around the enterprise

IBM finally pulled the plug on Watson Health, selling it to Francisco Partners after a failed decade-long campaign to gain traction as a healthcare IT player.

VMware staff are pushing back on the hire of AWS executive Joshua Burgin, who was investigated by Amazon for alleged discrimination towards a Black female employee.

A MESSAGE FROM DATAIKU

At Dataiku, we know data isn’t a destination, so we build tools for what comes after. Dataiku is the only AI platform connecting data and doers, enabling anyone across organizations to transform data into real business impact. Because there is no soul in the machine, only in front of it.

Learn more

Thanks for reading — see you Monday!

Recent Issues

Veni, vidi, Vendia?

Slack has a Quip handy