How Netflix and Uber helped create the data lakehouse by preserving an open-source tradition

Tech giants built the data lakehouse out of necessity. Now their open-source foundations are being commercialized by other companies, including Dremio, which launched wide availability of its Dremio Cloud service Wednesday.


The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi.

Photo: Elif Kandemir/Unsplash

The big data compute team at Netflix was dealing with some pesky data aggravations a few years ago.

“Earlier this week, we had somebody go in and rename a column, and in one engine they were getting results, and in the other ones they were getting null,” said Daniel Weeks, then a Netflix engineering manager, speaking at a 2019 developers’ conference. As head of that team, he and others were building a new way to solve those sorts of data-processing engine complexities that had prevented smoother analysis of the data rushing into the Netflix streaming service.

“We have more and more users coming in, and they need to not be bothered by these problems,” he said of the growing data team at Netflix at that time.

The new approach that was under construction at Netflix, with help from developers at companies including Apple and Salesforce, became an open-source standard for table formats in analytic data sets called Apache Iceberg. Back in 2019, Weeks predicted that iterative improvements to Iceberg from the open-source community would help ensure issues like the one he described wouldn't happen again.

He was right. While most companies don’t need to perform business analytics on top of tens of petabytes of data the way Netflix does, data architectures including Iceberg and Hudi — a system incubated inside Uber to solve similar problems — now form the foundation of products sold to other enterprises as so-called data lakehouses.

Dremio, which calls itself a lakehouse company, announced Wednesday that its Dremio Cloud data lakehouse platform — based in part on Apache Iceberg — is now widely available.

“A lakehouse needs to be open source: That’s why Iceberg has started to get so much momentum,” said Tomer Shiran, founder and chief product officer at Dremio. Companies that need to perform business analytics on top of huge amounts of data such as Netflix, Apple and Salesforce helped build Apache Iceberg, Shiran said, “because these companies needed something like that. Tech companies have been at the leading edge in terms of adopting this kind of architecture.”

Right now, open-source data lakehouse architectures are following a pattern seen with other data standards built or used inside large Silicon Valley tech companies before businesses began moving data to the cloud. For instance, more than a decade before Yahoo spun out its open-source data analytics software Hadoop as a new company, companies including eBay and Facebook used it internally.

Another foundational open-source data technology, Kafka, was developed inside LinkedIn. The business social networking company funded Confluent in 2014 to commercialize the use of Kafka. And Databricks, a fast-growing data vendor, also launched its own lakehouse-style open-source project in 2019 called Delta Lake.

How the lakehouse evolved

What vendors today call “the lakehouse” is, to many data professionals, just an evolved version of the data lake that combines elements of the traditional data warehouse. A data lake is essentially a receptacle for ingesting information, such as website activity data showing what movie content people perused, or data associated with trips taken through a ride-hailing app.

The lakehouse provides a structural layer on top of the otherwise raw and chaotic data stored in a data lake, allowing data scientists and others to perform analytics processes such as querying the data without having to move it first into a more structured warehouse environment.

“Moving data can be very expensive from system to system,” said Ben Ainscough, head of AI and Data Science at business intelligence tech company Domo.

Dremio aims to make information sitting inside data lakes more operational with its new lakehouse features available Wednesday, including a query engine called Sonar and another system called Arctic, which help developers and data scientists keep track of changes made to data. The company is providing free versions of its lakehouse and the other new features, though large enterprises in need of support services and advanced integrations for security or other customizations have to pay.

Arctic is a collaborative data archive system allowing data scientists and engineers to store and access information that reflects how data is used or changed. Known as metadata, this information provides details such as where the data was sourced or what specific time it was ingested or manipulated.

“Observability of what’s changing is important,” said Shiran. “If somebody goes and changes something, a lot of things can go wrong even with good intent.”

The archived metadata stored in the system might show what data was used inside an analytics tool like Tableau to help a company decide whether to buy certain materials or products, for example. Or it could be used when a data scientist wants to run a query to learn more about the most recent action taken inside a data folder.

“The real power comes from managing the metadata information,” said Venkat Venkataramani, who managed engineering teams that built Facebook's online data systems from 2008 to 2015 and is now CEO and co-founder of Rockset, a company that provides a database for building applications for real-time data. Open-source data architectures built to help solve the needs of tech giants — such as Iceberg and Hudi — keep track of metadata in a standardized way, Venkataramani said.

In order to get full value from investing in infrastructure and software to collect, store, manage and analyze data, businesses want to enable multiple people and departments to access and manipulate the same corpus of information. But historically that has required copying the data and moving it so multiple users could access it and work on it at the same time, which risks changes being made to one version of the data that are not reflected in another and are not trackable. The constant influx of new and updated information adds even more complexity.

“People have been shouting about data silos for effectively ever,” said Boris Jabes, co-founder of Census, which makes software to help companies operationalize data for analytics. What’s different today, Jabes said, is that sales, marketing or other teams can each run their own data workloads separately on the same storage layer. “There’s a lot more infrastructure that can be shared now,” he said.

Uber built Hudi out of necessity

The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi, which were later contributed to the Apache Software Foundation.

When Vinoth Chandar, founder and CEO of Onehouse, worked at Uber as a senior staff engineer and manager of its data team starting in 2014, “we ran into this predicament,” he said: People from disparate divisions realized one team’s data may have reflected recent updates, while others did not. That meant each team had been conducting analysis to understand what was happening in specific cities based on different data.

“It had a very profound impact on how we talked about something,” Chandar said.

At the time, Uber had a data warehouse stored on-premises, and used data infrastructure including Hadoop to manage all the analytics and machine-learning algorithms it was building to do things like decide how trip prices should change when it rains.

By building new data-management processes on top of the data lake where data initially entered its system, the company was able to help keep track of data changes and process data faster so all its teams were talking about the same data, Chandar said. That approach for bringing core functionality of Uber’s data warehouse to its data lake was referred to inside Uber as a “transactional data lake,” he said. They named it Hudi (pronounced like hoodie), an acronym for Hadoop Upserts, Deletes and Incrementals.

“It's merely the capabilities Hudi added on top of vanilla Hadoop or cloud storage,” said Chandar.

After incubating Hudi for several years inside Uber, the company contributed the project to Apache in 2019, and it has evolved through the work of an open-source community not unlike the one built around Iceberg.

Chandar’s startup Onehouse, which raised $8 million in seed funding in February, provides a managed service for companies using its Hudi-based lakehouse product.

In the past it was only the Ubers or Facebooks of the world that could afford the hardware and software infrastructure necessary to use these types of technologies in their own data centers, but today the more widespread cloud-centric data ecosystem is ripe for broader adoption of those technologies by other businesses, said Rockset’s Venkataramani. Because Iceberg and Hudi were designed to work in cloud environments, where companies can afford to manage large volumes of data and easily estimate costs of performing queries and analytics using that data, Venkataramani said, the barriers to adoption have been lifted.

“It’s the market demanding projects like Hudi and Iceberg,” he said.

That could bode well for Weeks, the former Netflix engineer who helped create Iceberg. Just last year, along with two other former Netflix data wranglers who also helped create Iceberg, he co-founded Tabular, a startup building a data platform using Iceberg.

LA is a growing tech hub. But not everyone may fit.

LA has a housing crisis similar to Silicon Valley’s. And single-family-zoning laws are mostly to blame.

As the number of tech companies in the region grows, so does the number of tech workers, whose high salaries put them at an advantage in both LA's renting and buying markets.

Photo: Nat Rubio-Licht/Protocol

LA’s tech scene is on the rise. The number of unicorn companies in Los Angeles is growing, and the city has become the third-largest startup ecosystem nationally behind the Bay Area and New York with more than 4,000 VC-backed startups in industries ranging from aerospace to creators. As the number of tech companies in the region grows, so does the number of tech workers. The city is quickly becoming more and more like Silicon Valley — a new startup and a dozen tech workers on every corner and companies like Google, Netflix, and Twitter setting up offices there.

But with growth comes growing pains. Los Angeles, especially the burgeoning Silicon Beach area — which includes Santa Monica, Venice, and Marina del Rey — shares something in common with its namesake Silicon Valley: a severe lack of housing.

Keep Reading Show less
Nat Rubio-Licht

Nat Rubio-Licht is a Los Angeles-based news writer at Protocol. They graduated from Syracuse University with a degree in newspaper and online journalism in May 2020. Prior to joining the team, they worked at the Los Angeles Business Journal as a technology and aerospace reporter.

While there remains debate among economists about whether we are officially in a full-blown recession, the signs are certainly there. Like most executives right now, the outlook concerns me.

In any case, businesses aren’t waiting for the official pronouncement. They’re already bracing for impact as U.S. inflation and interest rates soar. Inflation peaked at 9.1% in June 2022 — the highest increase since November 1981 — and the Federal Reserve is targeting an interest rate of 3% by the end of this year.

Keep Reading Show less
Nancy Sansom

Nancy Sansom is the Chief Marketing Officer for Versapay, the leader in Collaborative AR. In this role, she leads marketing, demand generation, product marketing, partner marketing, events, brand, content marketing and communications. She has more than 20 years of experience running successful product and marketing organizations in high-growth software companies focused on HCM and financial technology. Prior to joining Versapay, Nancy served on the senior leadership teams at PlanSource, Benefitfocus and PeopleMatter.


SFPD can now surveil a private camera network funded by Ripple chair

The San Francisco Board of Supervisors approved a policy that the ACLU and EFF argue will further criminalize marginalized groups.

SFPD will be able to temporarily tap into private surveillance networks in certain circumstances.

Photo: Justin Sullivan/Getty Images

Ripple chairman and co-founder Chris Larsen has been funding a network of security cameras throughout San Francisco for a decade. Now, the city has given its police department the green light to monitor the feeds from those cameras — and any other private surveillance devices in the city — in real time, whether or not a crime has been committed.

This week, San Francisco’s Board of Supervisors approved a controversial plan to allow SFPD to temporarily tap into private surveillance networks during life-threatening emergencies, large events, and in the course of criminal investigations, including investigations of misdemeanors. The decision came despite fervent opposition from groups, including the ACLU of Northern California and the Electronic Frontier Foundation, which say the police department’s new authority will be misused against protesters and marginalized groups in a city that has been a bastion for both.

Keep Reading Show less
Issie Lapowsky

Issie Lapowsky ( @issielapowsky) is Protocol's chief correspondent, covering the intersection of technology, politics, and national affairs. She also oversees Protocol's fellowship program. Previously, she was a senior writer at Wired, where she covered the 2016 election and the Facebook beat in its aftermath. Prior to that, Issie worked as a staff writer for Inc. magazine, writing about small business and entrepreneurship. She has also worked as an on-air contributor for CBS News and taught a graduate-level course at New York University's Center for Publishing on how tech giants have affected publishing.


These two AWS vets think they can finally solve enterprise blockchain

Vendia, founded by Tim Wagner and Shruthi Rao, wants to help companies build real-time, decentralized data applications. Its product allows enterprises to more easily share code and data across clouds, regions, companies, accounts, and technology stacks.

“We have this thesis here: Cloud was always the missing ingredient in blockchain, and Vendia added it in,” Wagner (right) told Protocol of his and Shruthi Rao's company.

Photo: Vendia

The promise of an enterprise blockchain was not lost on CIOs — the idea that a database or an API could keep corporate data consistent with their business partners, be it their upstream supply chains, downstream logistics, or financial partners.

But while it was one of the most anticipated and hyped technologies in recent memory, blockchain also has been one of the most failed technologies in terms of enterprise pilots and implementations, according to Vendia CEO Tim Wagner.

Keep Reading Show less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.


Kraken's CEO got tired of being in finance

Jesse Powell tells Protocol the bureaucratic obligations of running a financial services business contributed to his decision to step back from his role as CEO of one of the world’s largest crypto exchanges.

Photo: David Paul Morris/Bloomberg via Getty Images

Kraken is going through a major leadership change after what has been a tough year for the crypto powerhouse, and for departing CEO Jesse Powell.

The crypto market is still struggling to recover from a major crash, although Kraken appears to have navigated the crisis better than other rivals. Despite his exchange’s apparent success, Powell found himself in the hot seat over allegations published in The New York Times that he made insensitive comments on gender and race that sparked heated conversations within the company.

Keep Reading Show less
Benjamin Pimentel

Benjamin Pimentel ( @benpimentel) covers crypto and fintech from San Francisco. He has reported on many of the biggest tech stories over the past 20 years for the San Francisco Chronicle, Dow Jones MarketWatch and Business Insider, from the dot-com crash, the rise of cloud computing, social networking and AI to the impact of the Great Recession and the COVID crisis on Silicon Valley and beyond. He can be reached at bpimentel@protocol.com or via Google Voice at (925) 307-9342.

Latest Stories