How Snap rebuilt the infrastructure that now supports 347 million daily users

Snapchat relied on microservices and a multicloud strategy to overhaul its technology approach as it grew.

Jerry Hunter

Jerry Hunter, senior vice president of engineering at Snap, told Protocol about its infrastructure.

Photo: Snap

In 2017, 95% of Snap’s infrastructure was running on Google App Engine. Then came the Annihilate FSN project.

Snap, which launched in 2011, was built on GAE — FSN (Feelin-So-Nice) was the name for the original back-end system — and the majority of Snapchat’s core functionality was running within a monolithic application on it. While the architecture initially was effective, Snap started encountering issues when it became too big for GAE to handle, according to Jerry Hunter, senior vice president of engineering at Snap, where he runs Snapchat, Spectacles and Bitmoji as well as all back-end or cloud-based infrastructure services.

“Google App Engine wasn't really designed to support really big implementations,” Hunter, who joined the company in late 2016 from AWS, told Protocol. “We would find bugs or scaling challenges when we were in our high-scale periods like New Year's Eve. We would really work hard with Google to make sure that we were scaling it up appropriately, and sometimes it just would hit issues that they had not seen before, because we were scaling beyond what they had seen other customers use.”

Today, less than 1.5% of Snap’s infrastructure sits on GAE, a serverless platform for developing and hosting web applications, after the company broke apart its back end into microservices backed by other services inside of Google Cloud Platform (GCP) and added AWS as its second cloud computing provider. Snap now picks and chooses which workloads to place on AWS or GCP under its multicloud model, playing the competitive edge between them.

The Annihilate FSN project came with the recognition that microservices would provide a lot more reliability and control, especially from a cost and performance perspective.

“[We] basically tried to make the services be as narrow as possible and then backed by a cloud service or multiple cloud services, depending on what the service we were providing was,” Hunter said.

Snapchat now has 347 million daily active users who send billions of short videos, send photos called Snaps or use its augmented-reality Lenses.

Its new architecture has resulted in a 65% reduction in compute costs, and Hunter said he has come to deeply understand the importance of having competitors in Snap’s supply chain.

“I just believe that providers work better when they've got real competition,” said Hunter, who left AWS as a vice president of infrastructure. “You just get better … pricing, better features, better service. We're cloud-native, and we intend on staying that way, and it's a big expense for us. We save a lot of money by having two clouds.”

The Annihilate FSN process wasn’t without at least one failed hypothesis. Hunter mistakenly thought that Snap could write its applications on one layer and that layer would use the cloud provider that best fit a workload. That proved to be way too hard, he said.

“The clouds are different enough in most of their services and changing rapidly enough that it would have taken a giant team to build something like that,” he said. “And neither of the cloud providers were interested at all in us doing that, which makes sense.”

Instead, Hunter said, there are three types of services that he looks at from the cloud.

“There's one which is cloud-agnostic,” he said. “It's pretty much the same, regardless of where you go, like blob storage or [content-delivery networks] or raw compute on EC2 or GCP. There's a little bit of tuning if you're doing raw compute but, by and large, those services are all pretty much equal. Then there's sort of mixed things where it's mostly the same, but it really takes some engineering work to modify a service to run on one provider versus the other. And then there's things that are very cloud-specific, where … only one cloud offers it and the other doesn't. We have to do this process of understanding where we're going to spend our engineering resources to make our services work on whichever cloud that it is.”


Snap’s current architecture also has resulted in reduced latency for Snapchatters.

In its early days, Snap had its back-end monolith hosted in a single region in the middle of the United States — Oklahoma — which impacted performance and the ability for users to communicate instantly. If two people living a mile apart in Sydney, Australia, were sending Snaps to each other, for example, the video would have to traverse Australia's terrestrial network and an undersea cable to the United States, be deposited in a server in Oklahoma and then backtrack to Australia.

“If you and I are in a conversation with each other, and it's taking seconds or half a minute for that to happen, you're out of the conversation,” Hunter said. “You might come back to it later, but you've missed that opportunity to communicate with a friend. Alternatively, if I have just the messaging stack sitting inside of the data center in Sydney … now you're traversing two miles of terrestrial cable to a data center that's practically right next to you, and the entire transaction is so much faster.”

If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it.

Snap wanted to regionalize its services where it made sense. The only way to do that was by using microservices and understanding which services were useful to have close to the customer and which ones weren't, Hunter said.

“Customers benefit by having data centers be physically closer to them because performance is better,” he said. “CDNs can cover a lot of the broadcast content, but when doing one-on-one communications with people — people send Snaps and Snap videos — those are big chunks of data to move through the network.”

That ability to switch regions is one of the benefits of using cloud providers, Hunter said.

“If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it,” he said. “I'm just going to call them up and say, ‘OK, we're going to put our messaging stack in Tokyo,’ and the systems are all there, and we try it. If it turns out it doesn't actually make a difference, we turn that service off and move it to a cheaper location.”

Delta Force

Snap has built more than 100 services for very specific functions, including Delta Force.

In 2016, any time a user opened the Snapchat app, it would download or redownload everything, including stories that a user had already looked at but hadn’t yet timed out in the app.

“It was … a naive deployment of just ‘download everything so that you don't miss anything,’” Hunter said. “Delta Force goes and looks at the client … finds out all the things that you've already downloaded and are still on your phone, and then only downloads the things that are net-new.”

This approach had other benefits.

“Of course, that turns out to make the app faster,” Hunter said. “It also costs us way less, so we reduced our costs enormously by implementing that single service.”

Open source

Snap uses open-source software to create its infrastructure, including Kubernetes for service development, Spinnaker for its application team to deploy software, Spark for data processing and memcached/KeyDB for caching. “We have a process for looking at open source and making sure we're comfortable that it's safe and that it's not something that we wouldn't want to deploy in our infrastructure,” Hunter said.

Snap also uses Envoy, an edge and service proxy and universal data plane designed for large, microservice service-mesh architectures.

“I actually feel like … the way of the future is using a service mesh on top of your cloud to basically deploy all your security protocols and make sure that you've got the right logins and that people aren't getting access to it that shouldn't,” Hunter said. “I'm happy with the Envoy implementations giving us a great way of managing load when we're moving between clouds.”

Cloud primitives, ‘moving fast’ and cost camp

Hunter prefers using primitives or simple services from AWS and Google Cloud rather than managed services. A Snap philosophy that serves it well is the ability to move very fast, Hunter said.

“I don't expect my engineers to come back with perfectly efficient systems when we're launching a new feature that has a service as a back end,” he said, noting many of his team members previously worked for Google or Amazon. “Do what you have to do to get it out there, let's move fast. Be smart, but don't spend a lot of time tuning and optimizing. If that service doesn't take off, and it doesn't get a lot of use, then leave it the way it is. If that service takes off, and we start to get a lot of use on it, then let's go back and start to tune it.”

Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us.

It’s through that tuning process of understanding how a service operates where cycles of cloud usage can be reduced and result in instant cost savings, according to Hunter.

“Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us,” he said. “If you're not making the sort of constant changes that we are, I think it's fine to use the managed services that Google or Amazon provide. But if you're in a world where we're constantly making changes — like daily changes, multiple-times-a-day changes — I think you want to have that technical expertise in house so that you can just really be on top of things.”

Three factors figure into Snap’s ability to reap cost savings: the competition between AWS and Google Cloud, Snap’s ability to tweeze out costs as a result of its own work and going back to the cloud providers and looking at their new products and services.

“We're in a state of doing those three things all the time, and between those three, [we save] many tens of millions of dollars,” Hunter said.

Snap holds a “cost camp” every year where it asks its engineers to find all the places where costs possibly could be reduced.

“We take that list and prioritize that list, and then I cut people loose to go and work on those things,” he said. “On an annual basis depending on the year, it's many tens of millions dollars of cost savings.”

Adding a third cloud provider and advice on going multicloud

Snap has considered adding a third cloud provider, and it could still happen some day, although the process is pretty challenging, according to Hunter.

“It's a big lift to move into another cloud, because you've got those three layers,” he said. “The agnostic stuff is pretty straightforward, but then once you get to mixed and cloud-specific, you've got to go hire engineers that are good at that cloud, or you've got to go train your team up on … the nuances of that cloud.”

Enterprises considering adding another cloud provider need to make sure they have the engineering staff to pull it off: 20 to 30 dedicated cloud people as a starting point, Hunter said.

“It's not cheap, and second, that team has to be pretty sophisticated and technical,” he said. “If you don't have a big deployment, it's probably not worth it. I think about a lot of the customers I used to serve when I was in AWS, and the vast majority of them, their implementations … were serving their company's internal stuff, and it wasn't gigantic. If you're in that boat, it's probably not worth the extra work that it takes to do multicloud.”


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories