Source Code: What matters in tech, in your inbox every morning

×
Protocol Cloud
Your weekly guide to the future of enterprise computing.

How Microsoft scrambled to fix its COVID cloud capacity crunch

How Microsoft scrambled to fix its COVID cloud capacity crunch

Welcome to Protocol Cloud, your comprehensive roundup of everything you need to know about the week in cloud and enterprise software. This week: How Microsoft scrambled to fix its COVID cloud capacity crunch, devs are trying to improve the diversity of their lexicon, and ThousandEyes CEO Mohit Lad talks about overlooked cloud dependencies.

(Was this email forwarded to you? Sign up here to get Protocol Cloud every week.)

The Big Story

This is not a drill

What would you do if demand for several of your cloud services doubled overnight?

That sounds like an interesting hypothetical question for a planning session. Or a Google interview question. But it's what happened to Microsoft in March, as the pandemic took hold of Europe and the U.S. after devastating China earlier in the year.

  • Now that demand has stabilized, Microsoft released several details Tuesday about the technical steps it took to deal with the surge in demand, which almost doubled across several of its key services including Teams, Virtual Desktop, and Xbox Live during one week in early March.

Some of the details are fascinating. In a video, Microsoft Azure CTO Mark Russinovich explained how the company moved traffic around the globe and rewrote code to tweak the way some of its applications consume computing resources, all in just a few weeks.

  • For example, Microsoft turned off the little "your co-worker is typing" and read receipt notifications for Teams users in peak-demand regions, reducing the CPU capacity required to process those functions by 30% and returning that capacity to Azure customers.
  • The company also pleaded with local ISPs and video-game publishers to delay the release of game updates over Xbox Live until after business hours, freeing up computing capacity that could also be returned to Azure customers.

Microsoft also made a number of changes to its networking strategy, as people left their office parks, with local-area networks and better internet connections, and started working from home, putting a strain on wider-area networks.

  • Azure relies on four undersea cables between the U.S. and Europe to handle traffic traveling across the pond, and to increase capacity on one of those cables it "borrowed" some advanced networking equipment from another Azure region to quickly upgrade that connection.
  • It moved some Xbox Live activity from China and Europe back to the U.S. in order to free up Azure capacity for business customers in those regions.
  • Engineers quickly built a time-based system for managing traffic between regions, which automatically balanced the surges in traffic as people in one region woke up and started working while people in other regions were off the clock.

It also relied on the tried-and-true method of building capacity: buying all the servers it could get its hands on.

  • During its last earnings call, Microsoft cited supply-chain delays from hard-hit server manufacturing regions in China during January as another part of the scaling problem.
  • The trouble with upgrading amid the pandemic, though, was that installing new hardware required people, who had to figure out how to quickly upgrade racks of servers while staying six feet apart from each other.
  • Still, Azure added 12 new so-called "edge sites," essentially mini-datacenters that serve as entry points into the broader Azure network in parts of the world far from actual Azure data centers.

The experience validated a few modern application design philosophies. Microsoft now plans to shift Teams from virtual machines to containers, for instance, and said it found it easier to quickly scale and adjust because several Azure applications were designed around microservices.

Still, it's worth noting that Microsoft was the lone company among the Big Three cloud providers to endure this type of crunch during the first half of the year, and that it was already struggling with Azure capacity long before most people had heard of COVID-19.

  • The company deserves credit for quickly reacting to an unprecedented event, but if AWS and Google faced similar challenges this year, they kept it quiet.

Join Us Next Week

Workday

Protocol's Transformation of Work Summit

How can tech help identify and match in-demand skills with job opportunity? Speakers include Future of Work Caucus co-chairs Representative Lisa Blunt Rochester (D-DE), Representative Bryan Steil (R-WI), CEO of Jobs for the Future Maria Flynn, CEO of Burning Glass Technologies Matthew Sigelman, CEO of Colorado State University Global Dr. Becky Takeda-Tinker and Chief People Officer of Aon Lisa Stevens. Presented by Workday.

Register Now

This Week in Protocol

Open for business: MongoDB CEO Dev Ittycheria helped kick off years of hand-wringing about the future of open-source software when he changed the licensing policies around his company's open-source database in 2018. Surging sales of MongoDB Atlas, a commercial managed version of that database, suggest the strategy is paying off.

New normal: Microsoft has thousands of engineers, designers, and software developers that can help it react to an event like the pandemic, but what is a local brewery supposed to do? Protocol's Mike Murphy examined how small businesses are trying to use tech to stay afloat.

Privacy, interrupted: The downside of all this work-from-home tech is that your employer is now collecting more data about your work and personal habits than ever before. Issie Lapowsky spoke with several privacy experts who are concerned that this intrusion won't end with the pandemic.

Five Questions For...

ThousandEyes CEO Mohit Lad

What was your first tech job?

After completing my Ph.D. in computer science from UCLA, I got my first tech job in 2008 at a network performance startup in Santa Clara called Packet Design. The recession quickly caught up and I was laid off just two months in. It turned out to be a great thing as it forced me to think about what I really wanted to do and I ended up focusing on starting ThousandEyes.

What's the best piece of advice you could give to someone starting their first tech job?

Pick an area that you are passionate about and then find a company that speaks to you and whose mission you can get behind, rather than taking the highest paying job. Learn about what people in other departments do and how everything comes together.

Mac or PC?

I used to favor Linux but nowadays you will typically find me on a Mac since I can still use the terminal to do the things that I was used to on Linux, while still being able to benefit from the overall Mac OS experience. PCs have come a long way, though, and there are some very cool laptops that make me consider switching every now and then.

What was the biggest reason for the success of cloud computing over the past decade?

Ultimately I think the reason cloud has been so successful is that it allows for a focus on core competencies. If you are building a messaging app, you don't have to spend months building a data center, or if you are a startup focusing on acquiring customers, you don't have to spend days or weeks setting up [customer relationship management] — all you need is a browser.

What will be the biggest challenge for cloud computing over the coming decade?

In one of our first offices in San Francisco, where the lights would automatically turn off at 7 p.m. to conserve electricity, we automated the procedure using a script and Twilio to automatically dial a number and enter a code to turn them back on. Then one day the lights went off and we found out that our script was fine, but that Twilio was impacted by an Amazon outage on the East Coast. Our lights in our own office were out because of an outage on the other side of the country. The lesson learned? That the biggest challenge in cloud computing is the exponential amount of increased dependencies between different parties to and from the cloud, a lot of which people don't understand. That means more things will break in ways that are difficult to expect and, when they do, it will create massive disruptions similar to the impact felt every time Amazon suffers an outage.

Around The Cloud

Thanks for reading — see you next week.

Source Code: What matters in tech, in your inbox every morning

×