How Netflix and Uber helped create the data lakehouse by preserving an open-source tradition

Tech giants built the data lakehouse out of necessity. Now their open-source foundations are being commercialized by other companies, including Dremio, which launched wide availability of its Dremio Cloud service Wednesday.


The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi.

Photo: Elif Kandemir/Unsplash

The big data compute team at Netflix was dealing with some pesky data aggravations a few years ago.

“Earlier this week, we had somebody go in and rename a column, and in one engine they were getting results, and in the other ones they were getting null,” said Daniel Weeks, then a Netflix engineering manager, speaking at a 2019 developers’ conference. As head of that team, he and others were building a new way to solve those sorts of data-processing engine complexities that had prevented smoother analysis of the data rushing into the Netflix streaming service.

“We have more and more users coming in, and they need to not be bothered by these problems,” he said of the growing data team at Netflix at that time.

The new approach that was under construction at Netflix, with help from developers at companies including Apple and Salesforce, became an open-source standard for table formats in analytic data sets called Apache Iceberg. Back in 2019, Weeks predicted that iterative improvements to Iceberg from the open-source community would help ensure issues like the one he described wouldn't happen again.

He was right. While most companies don’t need to perform business analytics on top of tens of petabytes of data the way Netflix does, data architectures including Iceberg and Hudi — a system incubated inside Uber to solve similar problems — now form the foundation of products sold to other enterprises as so-called data lakehouses.

Dremio, which calls itself a lakehouse company, announced Wednesday that its Dremio Cloud data lakehouse platform — based in part on Apache Iceberg — is now widely available.

“A lakehouse needs to be open source: That’s why Iceberg has started to get so much momentum,” said Tomer Shiran, founder and chief product officer at Dremio. Companies that need to perform business analytics on top of huge amounts of data such as Netflix, Apple and Salesforce helped build Apache Iceberg, Shiran said, “because these companies needed something like that. Tech companies have been at the leading edge in terms of adopting this kind of architecture.”

Right now, open-source data lakehouse architectures are following a pattern seen with other data standards built or used inside large Silicon Valley tech companies before businesses began moving data to the cloud. For instance, more than a decade before Yahoo spun out its open-source data analytics software Hadoop as a new company, companies including eBay and Facebook used it internally.

Another foundational open-source data technology, Kafka, was developed inside LinkedIn. The business social networking company funded Confluent in 2014 to commercialize the use of Kafka. And Databricks, a fast-growing data vendor, also launched its own lakehouse-style open-source project in 2019 called Delta Lake.

How the lakehouse evolved

What vendors today call “the lakehouse” is, to many data professionals, just an evolved version of the data lake that combines elements of the traditional data warehouse. A data lake is essentially a receptacle for ingesting information, such as website activity data showing what movie content people perused, or data associated with trips taken through a ride-hailing app.

The lakehouse provides a structural layer on top of the otherwise raw and chaotic data stored in a data lake, allowing data scientists and others to perform analytics processes such as querying the data without having to move it first into a more structured warehouse environment.

“Moving data can be very expensive from system to system,” said Ben Ainscough, head of AI and Data Science at business intelligence tech company Domo.

Dremio aims to make information sitting inside data lakes more operational with its new lakehouse features available Wednesday, including a query engine called Sonar and another system called Arctic, which help developers and data scientists keep track of changes made to data. The company is providing free versions of its lakehouse and the other new features, though large enterprises in need of support services and advanced integrations for security or other customizations have to pay.

Arctic is a collaborative data archive system allowing data scientists and engineers to store and access information that reflects how data is used or changed. Known as metadata, this information provides details such as where the data was sourced or what specific time it was ingested or manipulated.

“Observability of what’s changing is important,” said Shiran. “If somebody goes and changes something, a lot of things can go wrong even with good intent.”

The archived metadata stored in the system might show what data was used inside an analytics tool like Tableau to help a company decide whether to buy certain materials or products, for example. Or it could be used when a data scientist wants to run a query to learn more about the most recent action taken inside a data folder.

“The real power comes from managing the metadata information,” said Venkat Venkataramani, who managed engineering teams that built Facebook's online data systems from 2008 to 2015 and is now CEO and co-founder of Rockset, a company that provides a database for building applications for real-time data. Open-source data architectures built to help solve the needs of tech giants — such as Iceberg and Hudi — keep track of metadata in a standardized way, Venkataramani said.

In order to get full value from investing in infrastructure and software to collect, store, manage and analyze data, businesses want to enable multiple people and departments to access and manipulate the same corpus of information. But historically that has required copying the data and moving it so multiple users could access it and work on it at the same time, which risks changes being made to one version of the data that are not reflected in another and are not trackable. The constant influx of new and updated information adds even more complexity.

“People have been shouting about data silos for effectively ever,” said Boris Jabes, co-founder of Census, which makes software to help companies operationalize data for analytics. What’s different today, Jabes said, is that sales, marketing or other teams can each run their own data workloads separately on the same storage layer. “There’s a lot more infrastructure that can be shared now,” he said.

Uber built Hudi out of necessity

The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi, which were later contributed to the Apache Software Foundation.

When Vinoth Chandar, founder and CEO of Onehouse, worked at Uber as a senior staff engineer and manager of its data team starting in 2014, “we ran into this predicament,” he said: People from disparate divisions realized one team’s data may have reflected recent updates, while others did not. That meant each team had been conducting analysis to understand what was happening in specific cities based on different data.

“It had a very profound impact on how we talked about something,” Chandar said.

At the time, Uber had a data warehouse stored on-premises, and used data infrastructure including Hadoop to manage all the analytics and machine-learning algorithms it was building to do things like decide how trip prices should change when it rains.

By building new data-management processes on top of the data lake where data initially entered its system, the company was able to help keep track of data changes and process data faster so all its teams were talking about the same data, Chandar said. That approach for bringing core functionality of Uber’s data warehouse to its data lake was referred to inside Uber as a “transactional data lake,” he said. They named it Hudi (pronounced like hoodie), an acronym for Hadoop Upserts, Deletes and Incrementals.

“It's merely the capabilities Hudi added on top of vanilla Hadoop or cloud storage,” said Chandar.

After incubating Hudi for several years inside Uber, the company contributed the project to Apache in 2019, and it has evolved through the work of an open-source community not unlike the one built around Iceberg.

Chandar’s startup Onehouse, which raised $8 million in seed funding in February, provides a managed service for companies using its Hudi-based lakehouse product.

In the past it was only the Ubers or Facebooks of the world that could afford the hardware and software infrastructure necessary to use these types of technologies in their own data centers, but today the more widespread cloud-centric data ecosystem is ripe for broader adoption of those technologies by other businesses, said Rockset’s Venkataramani. Because Iceberg and Hudi were designed to work in cloud environments, where companies can afford to manage large volumes of data and easily estimate costs of performing queries and analytics using that data, Venkataramani said, the barriers to adoption have been lifted.

“It’s the market demanding projects like Hudi and Iceberg,” he said.

That could bode well for Weeks, the former Netflix engineer who helped create Iceberg. Just last year, along with two other former Netflix data wranglers who also helped create Iceberg, he co-founded Tabular, a startup building a data platform using Iceberg.


Gensler: Bitcoin may be a commodity

The SEC has been vague about crypto. But Gensler said bitcoin is a commodity, “maybe.” It’s the clearest glimpse of his views on digital assets yet.

“Bitcoin — maybe that’s a commodity token. That has a big market value, but that goes over there,” Gensler said, referring to another regulator, the CFTC.

Photoillustration: Al Drago/Bloomberg via Getty Images; Protocol

SEC Chair Gary Gensler has long argued that many cryptocurrencies are subject to regulation as securities.

But he recently clarified that this view wouldn’t apply to the best-known cryptocurrency, bitcoin.

Keep Reading Show less
Benjamin Pimentel

Benjamin Pimentel ( @benpimentel) covers crypto and fintech from San Francisco. He has reported on many of the biggest tech stories over the past 20 years for the San Francisco Chronicle, Dow Jones MarketWatch and Business Insider, from the dot-com crash, the rise of cloud computing, social networking and AI to the impact of the Great Recession and the COVID crisis on Silicon Valley and beyond. He can be reached at bpimentel@protocol.com or via Google Voice at (925) 307-9342.

Sponsored Content

Why the digital transformation of industries is creating a more sustainable future

Qualcomm’s chief sustainability officer Angela Baker on how companies can view going “digital” as a way not only toward growth, as laid out in a recent report, but also toward establishing and meeting environmental, social and governance goals.

Three letters dominate business practice at present: ESG, or environmental, social and governance goals. The number of mentions of the environment in financial earnings has doubled in the last five years, according to GlobalData: 600,000 companies mentioned the term in their annual or quarterly results last year.

But meeting those ESG goals can be a challenge — one that businesses can’t and shouldn’t take lightly. Ahead of an exclusive fireside chat at Davos, Angela Baker, chief sustainability officer at Qualcomm, sat down with Protocol to speak about how best to achieve those targets and how Qualcomm thinks about its own sustainability strategy, net zero commitment, other ESG targets and more.

Keep Reading Show less
Chris Stokel-Walker

Chris Stokel-Walker is a freelance technology and culture journalist and author of "YouTubers: How YouTube Shook Up TV and Created a New Generation of Stars." His work has been published in The New York Times, The Guardian and Wired.


What the economic downturn means for pay packages

The war for talent rages on, but dynamics are shifting back to the employers.

Compensation packages could start to look different as companies reshuffle the balance of cash and equity.

Illustration: Nuthawut Somsuk/Getty Images

The market is turning. Tech stocks are slumping — which is bad news for employees — and even industry powerhouses are slowing hiring and laying people off. Tech talent is still in high demand, but compensation packages could start to look different as companies recruit.

“It’s a little bit like whiplash,” compensation consultant Ashish Raina said of the downturn. Raina, who mainly works with startups that have 200 to 800 employees, previously worked as the director of Talent at Index Ventures and head of Compensation and Talent Analytics at Box. “I do think there’s going to be an interesting reckoning in terms of pay increases going forward, how that pay is delivered.”

Keep Reading Show less
Allison Levitsky
Allison Levitsky is a reporter at Protocol covering workplace issues in tech. She previously covered big tech companies and the tech workforce for the Silicon Valley Business Journal. Allison grew up in the Bay Area and graduated from UC Berkeley.

How 'Zuck Bucks' saved the 2020 election — and fueled the Big Lie

The true story of how Mark Zuckerberg and Priscilla Chan’s $419 million donation became the 2020 election’s most enduring conspiracy theory.

Mark Zuckerberg is smack in the center of one of the 2020 election’s multitudinous conspiracies.

Illustration: Mike McQuade; Photos: Getty Images

If Mark Zuckerberg could have imagined the worst possible outcome of his decision to insert himself into the 2020 election, it might have looked something like the scene that unfolded inside Mar-a-Lago on a steamy evening in early April.

There in a gilded ballroom-turned-theater, MAGA world icons including Kellyanne Conway, Corey Lewandowski, Hope Hicks and former president Donald Trump himself were gathered for the premiere of “Rigged: The Zuckerberg Funded Plot to Defeat Donald Trump.”

Keep Reading Show less
Issie Lapowsky

Issie Lapowsky ( @issielapowsky) is Protocol's chief correspondent, covering the intersection of technology, politics, and national affairs. She also oversees Protocol's fellowship program. Previously, she was a senior writer at Wired, where she covered the 2016 election and the Facebook beat in its aftermath. Prior to that, Issie worked as a staff writer for Inc. magazine, writing about small business and entrepreneurship. She has also worked as an on-air contributor for CBS News and taught a graduate-level course at New York University's Center for Publishing on how tech giants have affected publishing.


From frenzy to fear: Trading apps grapple with anxious investors

After riding the stock-trading wave last year, trading apps like Robinhood have disenchanted customers and jittery investors.

Retail stock trading is still an attractive business, as shown by the news that crypto exchange FTX is dipping its toes in the market by letting some U.S. customers trade stocks.

Photo: Lam Yik/Bloomberg via Getty Images

For a brief moment, last year’s GameStop craze made buying and selling stocks cool, even exciting, for a new generation of young investors. Now, that frenzy has turned to fear.

Robinhood CEO Vlad Tenev pointed to “a challenging macro environment” marked by rising prices and interest rates and a slumping market in a call with analysts explaining his company’s lackluster results. The downturn, he said, was something “most of our customers have never experienced in their lifetimes.”

Keep Reading Show less
Benjamin Pimentel

Benjamin Pimentel ( @benpimentel) covers crypto and fintech from San Francisco. He has reported on many of the biggest tech stories over the past 20 years for the San Francisco Chronicle, Dow Jones MarketWatch and Business Insider, from the dot-com crash, the rise of cloud computing, social networking and AI to the impact of the Great Recession and the COVID crisis on Silicon Valley and beyond. He can be reached at bpimentel@protocol.com or via Google Voice at (925) 307-9342.

Latest Stories