aerial view of Facebook data center in Ireland
Photo: Facebook

What you can learn from Facebook's outage

Protocol Enterprise

Welcome to Protocol | Enterprise, your comprehensive roundup of everything you need to know about cloud and enterprise software. This Thursday: what others can learn from Facebook's Black Monday, an upcoming Protocol event with ServiceNow CEO Bill McDermott, and D-Wave's quantum-computing strategy enters a new era.

(Was this email forwarded to you? Sign up here.)

The Big Story

Black Monday

Facebook is a unique tech company in pretty much every interpretation of that phrase. So, unsurprisingly, the details behind its massive global outage on Monday were irresistibly fascinating to anyone who has ever been responsible for building and maintaining enterprise tech. And while some of the particulars were definitely unique to Facebook, this outage was certainly one for the history books.

The outage is easy enough to understand. Facebook released more details about the causes leading up to the outage on Tuesday, after issuing a brief statement Monday evening that seemed mostly designed to counter unhinged conspiracy theories spreading on social media. (The irony!)

  • The company's blog post mostly confirmed what we already knew, as detailed by Cloudflare: Facebook somehow managed to (metaphorically) blow up the roads leading from the outside internet to the servers that run Facebook, Instagram, WhatsApp and other properties with a routine maintenance query.
  • Facebook operates a massive network that includes its own data centers, like the one in Ireland pictured above, as well as smaller facilities called "points of presence" that are scattered around the world to collect inbound traffic and direct it across Facebook's private network to its eventual destination.
  • Computers and networking equipment have a tendency to fail for myriad reasons, and checking to see if anything is broken across that network is a routine part of the engineering staff's job.
  • But on Monday morning, that routine check somehow was executed as a command to withdraw all of Facebook's connections from its backbone network to the broader world.
  • And an audit tool that was supposed to detect potentially catastrophic errors in configuration changes failed because "a bug in that audit tool prevented it from properly stopping the command," the company said in its post.

Facebook's infrastructure choices compounded the problem. Decisions made long ago about its internal architecture made recovering from this error far harder than it would have been for many other companies.

  • Facebook relies almost entirely on its own infrastructure and custom-built services for nearly everything it needs to run its operation, compared to other tech companies of its size and resources that use third-party providers for at least some, if not all, of their infrastructure needs.
  • That includes DNS servers, which live in those smaller point-of-presence facilities. Those servers tell Facebook's data centers where incoming requests for its content are coming from and also give browsers requesting "facebook.com" a computer-friendly route to that destination.
  • Facebook's DNS servers were designed to tell inbound requests for "facebook.com" to avoid a particular route to a data center if they detected a problem with that path, because any prolonged delay would result in a poor user experience. On normal days, there are way more working paths than faulty ones and it's easy to find a quick detour.
  • However, when all of those paths disappeared, those otherwise-operational DNS servers had no idea where Facebook's servers had gone, forcing them to return error messages to phones and browsers.
  • And to make matters even more difficult, Facebook's internal communications and disaster recovery tools relied on connections to the facilities that housed those DNS servers.

And its response was littered with roadblocks. Everything described so far happened in the span of about two minutes Monday morning as the West Coast work day got underway. Mistakes happen at webscale; it's how companies recover that matters, and that recovery was rockier than it had to be.

  • Somehow, Facebook's out-of-band connection to its servers — the normal backup plan when the primary network goes down — also failed. That meant physical access to one of its data-center facilities was required to fix the problem.
  • You can't just walk into a Facebook data center, or really any data center; access is strictly controlled at each step of the journey from the perimeter of those several-acre facilities down to the room where the equipment is stored.
  • So while Facebook didn't actually have to saw through its server cages to fix the problem, ensuring the right people with the right expertise were allowed to enter the closest building and access the relevant computers took more time than anyone would have liked.

Every big outage is a learning opportunity, even for a company like Facebook that appears unwilling to learn from its mistakes in other areas. Here are three big takeaways from this one:

  • Plan for the worst. Enterprises need a contingency plan for the complete loss of their computing resources or network connection, not just the loss of a data center or cloud region.
  • Hedge your bets. It's extremely unlikely that the entire internet will go down at the same time; hedging at least a few bets across multiple service providers could be worth the effort.
  • Check your priorities. There's no way to run an operation the size of Facebook without a serious amount of automation, which means code auditing tools like the one that failed to stop this outage need extra attention.

For some reason, though, the usual calls for #hugops in the wake of an outage were a little more muted than usual last Monday.

— Tom Krazit

A MESSAGE FROM PROEDGE, A PWC PRODUCT

Creating a workforce with the right mix of skills has always been a challenge for companies, and 74% of CEOs are concerned about finding skilled workers. That problem will likely only increase as the definition of work, and the needs of employees, evolve in a post-pandemic environment. So, what can companies do?

Learn more

This Week On Protocol

PCs A-OK: Windows 11 came out this week, as Microsoft's original cash cow hurtles into the future. Protocol's David Pierce talked to Microsoft's Panos Panay about some of the decisions behind the development of Windows 11 and why the work-from-home boom gave new life to the PC market.

Layaway 2.0: Our colleagues at Protocol | Fintech released the latest Protocol Manual on the surge of interest in "buy now, pay later" services and apps. Fintech could be an enormous opportunity for enterprise tech providers, because those transactions aren't going to process themselves.

Protocol Event

The Inside View with Bill McDermott

ServiceNow is quickly becoming one of enterprise technology's most well-known names. The company started by focusing on helping IT departments manage their workloads, but is quickly expanding to other verticals and, on the way, becoming a deeper rival to other software giants like Salesforce.

We'll talk to CEO Bill McDermott on Oct. 12 at 10 a.m. PT / 1 p.m. ET to learn what's ahead for the company and how it plans to hit $15 billion in annual revenue. RSVP here.

Five Questions For...

Idit Levine, founder and CEO, solo.io

What was the first computer that got you excited about technology?

The iPhone. The idea of a smartphone was not new and had been tried (and failed) before. But Steve Jobs did what he knew how to do best: He focused on the user experience. The iPhone changed the way we communicate and live our lives, and it is mindblowing to think how much computation power we now literally hold in our hands.

If Protocol gave you $1 billion to start a new enterprise tech company from scratch today, what would you do?

The climate crisis is the single most important unmet challenge of our time. Mitigating the climate crisis and adapting to it requires innovative technological solutions that span multiple domains, including information technologies, artificial intelligence, bioengineering and advanced architecture and agriculture. Nobody knows better than an enterprise tech company how to combine a broad range of technologies to develop and execute solutions to hard problems.

What's your favorite pastime that doesn't involve a screen?

Before tech, I was a professional basketball player. To this day nothing calms me or helps me find focus like a challenging basketball match or a volleyball tournament.

Which enterprise tech legend motivates you the most?

Steve Jobs. Jobs understood that great ideas are only the beginning. Turning such ideas into truly successful products involves building the right team of excellent people, thinking hard about what the user needs (even if they don't know it yet), paying attention to details and uncompromising execution.

What will be the greatest challenge for enterprise tech over the coming decade?

Doing no harm. Information communication and data-driven technologies, coupled with artificial intelligence, have the power to improve our everyday lives and provide us with opportunities for creativity, exploration and communication. But at the same time they bear great risks to our emotional well-being, our self image, our privacy and our sense of community. Enterprise tech will have to develop the methodology, sensibility and openness to mitigate these risks.

Around the Enterprise

  • VMware unveiled new multicloud management tools built around its virtualization software during the first VMworld conference held under new CEO Raghu Raghuram this week.
  • New AWS CEO Adam Selipsky broke with tradition and acknowledged that some on-premises applications "will never move" to public cloud services. That's a departure from former AWS CEO Andy Jassy's usual line that everyone will be on the cloud "in the fullness of time."
  • Some news from solo.io and its founder and CEO Idit Levine, whom you just met in the above section. The company just raised a $135 million series C round to value the company at $1 billion.
  • Curious about all the talk about Web3 and what it might mean for application development? The New Stack took a look at this emerging technology.
  • This won't help the chip shortage: Samsung said its next-generation chip technology will be delayed into next year.
  • The U.S. government plans to introduce legislation that will require airports, airlines and railway transportation companies to adopt stricter cybersecurity measures amid fears of a repeat of the Colonial Pipeline ransomware attack.
  • There's still a lack of cloud talent, making it very hard for companies to find the right people they need to operate safely and efficiently, according to The Wall Street Journal.
  • D-Wave is pivoting. It long stood apart from other companies pursuing quantum computers with its annealing technology, but it now plans to build a gate-model quantum computer similar to the ones under development by IBM, Microsoft, Google and others.

Thanks for reading — see you Monday!

Recent Issues