How Snap rebuilt the infrastructure that now supports 347 million daily users

Snapchat relied on microservices and a multicloud strategy to overhaul its technology approach as it grew.

Jerry Hunter

Jerry Hunter, senior vice president of engineering at Snap, told Protocol about its infrastructure.

Photo: Snap

In 2017, 95% of Snap’s infrastructure was running on Google App Engine. Then came the Annihilate FSN project.

Snap, which launched in 2011, was built on GAE — FSN (Feelin-So-Nice) was the name for the original back-end system — and the majority of Snapchat’s core functionality was running within a monolithic application on it. While the architecture initially was effective, Snap started encountering issues when it became too big for GAE to handle, according to Jerry Hunter, senior vice president of engineering at Snap, where he runs Snapchat, Spectacles and Bitmoji as well as all back-end or cloud-based infrastructure services.

“Google App Engine wasn't really designed to support really big implementations,” Hunter, who joined the company in late 2016 from AWS, told Protocol. “We would find bugs or scaling challenges when we were in our high-scale periods like New Year's Eve. We would really work hard with Google to make sure that we were scaling it up appropriately, and sometimes it just would hit issues that they had not seen before, because we were scaling beyond what they had seen other customers use.”

Today, less than 1.5% of Snap’s infrastructure sits on GAE, a serverless platform for developing and hosting web applications, after the company broke apart its back end into microservices backed by other services inside of Google Cloud Platform (GCP) and added AWS as its second cloud computing provider. Snap now picks and chooses which workloads to place on AWS or GCP under its multicloud model, playing the competitive edge between them.

The Annihilate FSN project came with the recognition that microservices would provide a lot more reliability and control, especially from a cost and performance perspective.

“[We] basically tried to make the services be as narrow as possible and then backed by a cloud service or multiple cloud services, depending on what the service we were providing was,” Hunter said.

Snapchat now has 347 million daily active users who send billions of short videos, send photos called Snaps or use its augmented-reality Lenses.

Its new architecture has resulted in a 65% reduction in compute costs, and Hunter said he has come to deeply understand the importance of having competitors in Snap’s supply chain.

“I just believe that providers work better when they've got real competition,” said Hunter, who left AWS as a vice president of infrastructure. “You just get better … pricing, better features, better service. We're cloud-native, and we intend on staying that way, and it's a big expense for us. We save a lot of money by having two clouds.”

The Annihilate FSN process wasn’t without at least one failed hypothesis. Hunter mistakenly thought that Snap could write its applications on one layer and that layer would use the cloud provider that best fit a workload. That proved to be way too hard, he said.

“The clouds are different enough in most of their services and changing rapidly enough that it would have taken a giant team to build something like that,” he said. “And neither of the cloud providers were interested at all in us doing that, which makes sense.”

Instead, Hunter said, there are three types of services that he looks at from the cloud.

“There's one which is cloud-agnostic,” he said. “It's pretty much the same, regardless of where you go, like blob storage or [content-delivery networks] or raw compute on EC2 or GCP. There's a little bit of tuning if you're doing raw compute but, by and large, those services are all pretty much equal. Then there's sort of mixed things where it's mostly the same, but it really takes some engineering work to modify a service to run on one provider versus the other. And then there's things that are very cloud-specific, where … only one cloud offers it and the other doesn't. We have to do this process of understanding where we're going to spend our engineering resources to make our services work on whichever cloud that it is.”


Snap’s current architecture also has resulted in reduced latency for Snapchatters.

In its early days, Snap had its back-end monolith hosted in a single region in the middle of the United States — Oklahoma — which impacted performance and the ability for users to communicate instantly. If two people living a mile apart in Sydney, Australia, were sending Snaps to each other, for example, the video would have to traverse Australia's terrestrial network and an undersea cable to the United States, be deposited in a server in Oklahoma and then backtrack to Australia.

“If you and I are in a conversation with each other, and it's taking seconds or half a minute for that to happen, you're out of the conversation,” Hunter said. “You might come back to it later, but you've missed that opportunity to communicate with a friend. Alternatively, if I have just the messaging stack sitting inside of the data center in Sydney … now you're traversing two miles of terrestrial cable to a data center that's practically right next to you, and the entire transaction is so much faster.”

If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it.

Snap wanted to regionalize its services where it made sense. The only way to do that was by using microservices and understanding which services were useful to have close to the customer and which ones weren't, Hunter said.

“Customers benefit by having data centers be physically closer to them because performance is better,” he said. “CDNs can cover a lot of the broadcast content, but when doing one-on-one communications with people — people send Snaps and Snap videos — those are big chunks of data to move through the network.”

That ability to switch regions is one of the benefits of using cloud providers, Hunter said.

“If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it,” he said. “I'm just going to call them up and say, ‘OK, we're going to put our messaging stack in Tokyo,’ and the systems are all there, and we try it. If it turns out it doesn't actually make a difference, we turn that service off and move it to a cheaper location.”

Delta Force

Snap has built more than 100 services for very specific functions, including Delta Force.

In 2016, any time a user opened the Snapchat app, it would download or redownload everything, including stories that a user had already looked at but hadn’t yet timed out in the app.

“It was … a naive deployment of just ‘download everything so that you don't miss anything,’” Hunter said. “Delta Force goes and looks at the client … finds out all the things that you've already downloaded and are still on your phone, and then only downloads the things that are net-new.”

This approach had other benefits.

“Of course, that turns out to make the app faster,” Hunter said. “It also costs us way less, so we reduced our costs enormously by implementing that single service.”

Open source

Snap uses open-source software to create its infrastructure, including Kubernetes for service development, Spinnaker for its application team to deploy software, Spark for data processing and memcached/KeyDB for caching. “We have a process for looking at open source and making sure we're comfortable that it's safe and that it's not something that we wouldn't want to deploy in our infrastructure,” Hunter said.

Snap also uses Envoy, an edge and service proxy and universal data plane designed for large, microservice service-mesh architectures.

“I actually feel like … the way of the future is using a service mesh on top of your cloud to basically deploy all your security protocols and make sure that you've got the right logins and that people aren't getting access to it that shouldn't,” Hunter said. “I'm happy with the Envoy implementations giving us a great way of managing load when we're moving between clouds.”

Cloud primitives, ‘moving fast’ and cost camp

Hunter prefers using primitives or simple services from AWS and Google Cloud rather than managed services. A Snap philosophy that serves it well is the ability to move very fast, Hunter said.

“I don't expect my engineers to come back with perfectly efficient systems when we're launching a new feature that has a service as a back end,” he said, noting many of his team members previously worked for Google or Amazon. “Do what you have to do to get it out there, let's move fast. Be smart, but don't spend a lot of time tuning and optimizing. If that service doesn't take off, and it doesn't get a lot of use, then leave it the way it is. If that service takes off, and we start to get a lot of use on it, then let's go back and start to tune it.”

Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us.

It’s through that tuning process of understanding how a service operates where cycles of cloud usage can be reduced and result in instant cost savings, according to Hunter.

“Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us,” he said. “If you're not making the sort of constant changes that we are, I think it's fine to use the managed services that Google or Amazon provide. But if you're in a world where we're constantly making changes — like daily changes, multiple-times-a-day changes — I think you want to have that technical expertise in house so that you can just really be on top of things.”

Three factors figure into Snap’s ability to reap cost savings: the competition between AWS and Google Cloud, Snap’s ability to tweeze out costs as a result of its own work and going back to the cloud providers and looking at their new products and services.

“We're in a state of doing those three things all the time, and between those three, [we save] many tens of millions of dollars,” Hunter said.

Snap holds a “cost camp” every year where it asks its engineers to find all the places where costs possibly could be reduced.

“We take that list and prioritize that list, and then I cut people loose to go and work on those things,” he said. “On an annual basis depending on the year, it's many tens of millions dollars of cost savings.”

Adding a third cloud provider and advice on going multicloud

Snap has considered adding a third cloud provider, and it could still happen some day, although the process is pretty challenging, according to Hunter.

“It's a big lift to move into another cloud, because you've got those three layers,” he said. “The agnostic stuff is pretty straightforward, but then once you get to mixed and cloud-specific, you've got to go hire engineers that are good at that cloud, or you've got to go train your team up on … the nuances of that cloud.”

Enterprises considering adding another cloud provider need to make sure they have the engineering staff to pull it off: 20 to 30 dedicated cloud people as a starting point, Hunter said.

“It's not cheap, and second, that team has to be pretty sophisticated and technical,” he said. “If you don't have a big deployment, it's probably not worth it. I think about a lot of the customers I used to serve when I was in AWS, and the vast majority of them, their implementations … were serving their company's internal stuff, and it wasn't gigantic. If you're in that boat, it's probably not worth the extra work that it takes to do multicloud.”

A 'Soho house for techies': VCs place a bet on community

Contrary is the latest venture firm to experiment with building community spaces instead of offices.

Contrary NYC is meant to recreate being part of a members-only club where engineers and entrepreneurs can hang out together, have a space to work, and host events for people in tech.

Photo: Courtesy of Contrary

In the pre-pandemic times, Contrary’s network of venture scouts, founders and top technologists reflected the magnetic pull Silicon Valley had on the tech industry. About 80% were based in the Bay Area, with a smattering living elsewhere. Today, when Contrary asked where people in its network were living, the split had changed with 40% in the Bay Area and another 40% living in or planning to move to New York.

It’s totally bifurcated now, said Contrary’s founder Eric Tarczynski.

Keep Reading Show less
Biz Carson

Biz Carson ( @bizcarson) is a San Francisco-based reporter at Protocol, covering Silicon Valley with a focus on startups and venture capital. Previously, she reported for Forbes and was co-editor of Forbes Next Billion-Dollar Startups list. Before that, she worked for Business Insider, Gigaom, and Wired and started her career as a newspaper designer for Gannett.

Sponsored Content

Great products are built on strong patents

Experts say robust intellectual property protection is essential to ensure the long-term R&D required to innovate and maintain America's technology leadership.

Every great tech product that you rely on each day, from the smartphone in your pocket to your music streaming service and navigational system in the car, shares one important thing: part of its innovative design is protected by intellectual property (IP) laws.

From 5G to artificial intelligence, IP protection offers a powerful incentive for researchers to create ground-breaking products, and governmental leaders say its protection is an essential part of maintaining US technology leadership. To quote Secretary of Commerce Gina Raimondo: "intellectual property protection is vital for American innovation and entrepreneurship.”

Keep Reading Show less
James Daly
James Daly has a deep knowledge of creating brand voice identity, including understanding various audiences and targeting messaging accordingly. He enjoys commissioning, editing, writing, and business development, particularly in launching new ventures and building passionate audiences. Daly has led teams large and small to multiple awards and quantifiable success through a strategy built on teamwork, passion, fact-checking, intelligence, analytics, and audience growth while meeting budget goals and production deadlines in fast-paced environments. Daly is the Editorial Director of 2030 Media and a contributor at Wired.

Binance CEO wrestles with the 'Chinese company' label

Changpeng "CZ" Zhao, who leads crypto’s largest marketplace, is pushing back on attempts to link Binance to Beijing.

Despite Binance having to abandon its country of origin shortly after its founding, critics have portrayed the exchange as a tool of the Chinese government.

Photo: Akio Kon/Bloomberg via Getty Images

In crypto, he is known simply as CZ, head of one of the industry’s most dominant players.

It took only five years for Binance CEO and co-founder Changpeng Zhao to build his company, which launched in 2017, into the world’s biggest crypto exchange, with 90 million customers and roughly $76 billion in daily trading volume, outpacing the U.S. crypto powerhouse Coinbase.

Keep Reading Show less
Benjamin Pimentel

Benjamin Pimentel ( @benpimentel) covers crypto and fintech from San Francisco. He has reported on many of the biggest tech stories over the past 20 years for the San Francisco Chronicle, Dow Jones MarketWatch and Business Insider, from the dot-com crash, the rise of cloud computing, social networking and AI to the impact of the Great Recession and the COVID crisis on Silicon Valley and beyond. He can be reached at bpimentel@protocol.com or via Google Voice at (925) 307-9342.


How I decided to leave the US and pursue a tech career in Europe

Melissa Di Donato moved to Europe to broaden her technology experience with a different market perspective. She planned to stay two years. Seventeen years later, she remains in London as CEO of Suse.

“It was a hard go for me in the beginning. I was entering inside of a company that had been very traditional in a sense.”

Photo: Suse

Click banner image for more How I decided seriesA native New Yorker, Melissa Di Donato made a life-changing decision back in 2005 when she packed up for Europe to further her career in technology. Then with IBM, she made London her new home base.

Today, Di Donato is CEO of Germany’s Suse, now a 30-year-old, open-source enterprise software company that specializes in Linux operating systems, container management, storage, and edge computing. As the company’s first female leader, she has led Suse through the coronavirus pandemic, a 2021 IPO on the Frankfurt Stock Exchange, and the acquisitions of Kubernetes management startup Rancher Labs and container security company NeuVector.

Keep Reading Show less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.


UiPath had a rocky few years. Rob Enslin wants to turn it around.

Protocol caught up with Enslin, named earlier this year as UiPath’s co-CEO, to discuss why he left Google Cloud, the untapped potential of robotic-process automation, and how he plans to lead alongside founder Daniel Dines.

Rob Enslin, UiPath's co-CEO, chats with Protocol about the company's future.

Photo: UiPath

UiPath has had a shaky history.

The company, which helps companies automate business processes, went public in 2021 at a valuation of more than $30 billion, but now the company’s market capitalization is only around $7 billion. To add insult to injury, UiPath laid off 5% of its staff in June and then lowered its full-year guidance for fiscal year 2023 just months later, tanking its stock by 15%.

Keep Reading Show less
Aisha Counts

Aisha Counts (@aishacounts) is a reporter at Protocol covering enterprise software. Formerly, she was a management consultant for EY. She's based in Los Angeles and can be reached at acounts@protocol.com.

Latest Stories