Adam Selipsky, AWS CEO, on stage during his keynote address for Amazon's AWS re:Invent 2021.
Photo: Amazon Web Services, Inc.

At AWS, Virginia is for outages

Protocol Enterprise

Hello and welcome to Protocol | Enterprise. Today: The common thread behind many AWS outages, an AI technique that does more with less, and oh my, MySQL.

Eastbound and down

Northern Virginia is hallowed ground in the history of the internet. A foundational network exchange was built there in the early 1990s and it remains to this day an important crossroads for the world’s data, as well as the home of AWS’ oldest, largest, most important and most troublesome cloud complex: US-East-1.

Something inside US-East-1 failed again Tuesday to dramatic effect, taking down lots of apps and websites (including Protocol, for about an hour) with an outage that lasted several hours and caused several ripple effects across different AWS services. Even Amazon’s delivery drivers were affected by the outage, causing around a day’s worth of shipping delays during this most wonderful time of the year.

AWS’ US-East-1 region has a checkered history: It’s been responsible for nine of the 17 major outages in AWS history, as tracked by Wojciech Gawroński on his blog, including the major one that took down Slack and several other companies last year.

  • The flagship EC2 computing service was launched in US-East-1 in August 2006, and it is still the default region for many AWS services and external applications when they are first created.
  • According to a flood of social-media posts and obvious outages across the internet Tuesday, something broke around 8 a.m. PT.
  • About 90 minutes later, AWS acknowledged problems in, you guessed it, US-East-1, and later in the day attributed those problems to “an impairment of several network devices” in the Bermuda Cloud Triangle.
  • As of Thursday morning, AWS had not released further details on the incident, which it declared mostly resolved at 4:35 p.m. PT on Tuesday.

Outages are a fact of life on the cloud: “Everything fails all the time,” no less an authority than Amazon CTO Werner Vogels has said many times. What’s notable about this one is that so many internal AWS systems were affected.

  • “This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates,” AWS said in an update at 11:26 a.m. PT.
  • While outages are a fact of life, dependencies are controllable.
  • While we still can’t tell exactly what happened Tuesday, it appears that some core internal AWS tools and services were designed in a way that depended on servers in US-East-1 and didn’t have a quick fallback plan.
  • It took several hours after AWS said it had identified the root cause of the outage for all of its affected services to return to normal, and plenty of AWS customers in that region were able to recover more quickly.

So is this a pertinent reminder to go multicloud? While salespeople from Microsoft and Google were no doubt working their phones and emailing clients Wednesday urging AWS customers to start thinking about it, the multicloud route isn’t as easy as it might seem.

  • “Before you even fantasize about multicloud for availability, you should be multi-AZ (availability zones) in multiple regions, and have maximized your resilience through proper application design/implementation, thoroughly tested through chaos engineering,” said Gartner’s Lydia Leong, who has seen more than anyone’s fair share of real-world cloud deployments.
  • Availability zones are designed to provide connectivity in the event of an equipment problem or failure in a single region like US-East-1, but customers have to build applications designed with those zones in mind.
  • AWS data egress costs — even in some cases, data transfers across availability zones — can be prohibitively expensive for some types of applications, even after recent changes.
  • And different cloud companies accomplish similar tasks in different ways, which can force employees to learn a whole new idiosyncratic workflow to get the same job done.

One thing everyone can do in the near future — even AWS — is reduce dependence on US-East-1 as a region, because that collection of data centers has been problematic for years.

  • Latency can be an important part of application performance, but AWS also offers East Coast service out of Ohio.
  • And if we’re being truthful, an awful lot of enterprise applications don’t need that kind of performance: A round trip to US-West-2 in Eastern Oregon, which never seems to go down and offers the same number of computing instance options as US-East-1, isn’t too much to ask in many cases.
  • While it seems likely that AWS has a plan to upgrade US-East-1 at some point, that process won’t be quick or easy.

Virginia seems like a perfectly nice enough place. You just might not want to run your apps there.

— Tom Krazit (email | twitter)

A MESSAGE FROM WORKPLACE FROM META

Whether you work on the top floor or the shop floor, Workplace celebrates who you are and what you can bring to your business. Discover the place where you can be more you.

Learn more

This week on Protocol

re:Invent recap: Thanks to Liz Fong-Jones of Honeycomb, Sheila Gulati of Tola Capital and Corey Quinn of The Duckbill Group for joining me Wednesday morning to talk about last week’s AWS re:Invent conference and some hot-button issues heading into 2022. You can find a video of the session here, and look out for future references to what Gulati called “the cloud oligarchs” in upcoming editions of Protocol | Enterprise.

More with less: One of the main reasons why the Big Data movement acquired that nickname is that most artificial intelligence research techniques require massive amounts of data to even get started. Protocol’s Kate Kaye reported on a new technique called “few-shot learning” that Meta is testing to help it regulate content on Facebook, especially content written in languages for which it does not have as much data as others.

No alt: It’s far past time for businesses operating on the internet to improve the accessibility of their products and services, because many of us don’t see the world in the same way as those who designed that tech. Protocol’s Aisha Counts looked at the debate over the best way to accomplish that task: Do we layer accessibility features on existing tech, or blow everything up and start over?

Upcoming at Protocol

Gaming platforms have traditionally been defined by their hardware, from arcades to personal computers to home consoles — and now, mobile phones. But cloud gaming, the rise of AR/VR and the promise of the metaverse have begun to redefine the very nature of gaming platforms and revolutionize the nature of play.

Join Protocol’s Nick Statt next Tuesday, Dec. 14, at 10 a.m. PT/ 1 p.m. ET for a virtual event discussing the future of our entertainment platforms with Frederic Descamps, CEO and co-founder of Manticore Games; Chris Mahoney, senior manager of central product development at Zynga; and Kellee Santiago, director of external publishing at Niantic. RSVP here.

A MESSAGE FROM WORKPLACE FROM META

Whether you work on the top floor or the shop floor, Workplace celebrates who you are and what you can bring to your business. Discover the place where you can be more you.

Learn more

Around the enterprise

Microsoft once again flexed its power over Office customers by preparing to charge them 20% more unless they switch to annual plans, after raising prices earlier this year.

AWS opened a new cloud region for government customerson the West Coast, offering yet another reason to stay out of Virginia.

Ansible was key to IBM’s Red Hat deal, and the infrastructure-as-code tool is now available on Microsoft Azure.

Data-center giant Equinix acquired MainOne, one of the top internet infrastructure providers in Africa, for $320 million.

“MySQL is a pretty poor database, and you should strongly consider using Postgres instead,” said Steinar Gunderson, an Oracle engineer who was working on MySQL until leaving the company this week.

Thanks for reading — see you Monday!

Recent Issues