How Netflix and Uber helped create the data lakehouse by preserving an open-source tradition

Tech giants built the data lakehouse out of necessity. Now their open-source foundations are being commercialized by other companies, including Dremio, which launched wide availability of its Dremio Cloud service Wednesday.


The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi.

Photo: Elif Kandemir/Unsplash

The big data compute team at Netflix was dealing with some pesky data aggravations a few years ago.

“Earlier this week, we had somebody go in and rename a column, and in one engine they were getting results, and in the other ones they were getting null,” said Daniel Weeks, then a Netflix engineering manager, speaking at a 2019 developers’ conference. As head of that team, he and others were building a new way to solve those sorts of data-processing engine complexities that had prevented smoother analysis of the data rushing into the Netflix streaming service.

“We have more and more users coming in, and they need to not be bothered by these problems,” he said of the growing data team at Netflix at that time.

The new approach that was under construction at Netflix, with help from developers at companies including Apple and Salesforce, became an open-source standard for table formats in analytic data sets called Apache Iceberg. Back in 2019, Weeks predicted that iterative improvements to Iceberg from the open-source community would help ensure issues like the one he described wouldn't happen again.

He was right. While most companies don’t need to perform business analytics on top of tens of petabytes of data the way Netflix does, data architectures including Iceberg and Hudi — a system incubated inside Uber to solve similar problems — now form the foundation of products sold to other enterprises as so-called data lakehouses.

Dremio, which calls itself a lakehouse company, announced Wednesday that its Dremio Cloud data lakehouse platform — based in part on Apache Iceberg — is now widely available.

“A lakehouse needs to be open source: That’s why Iceberg has started to get so much momentum,” said Tomer Shiran, founder and chief product officer at Dremio. Companies that need to perform business analytics on top of huge amounts of data such as Netflix, Apple and Salesforce helped build Apache Iceberg, Shiran said, “because these companies needed something like that. Tech companies have been at the leading edge in terms of adopting this kind of architecture.”

Right now, open-source data lakehouse architectures are following a pattern seen with other data standards built or used inside large Silicon Valley tech companies before businesses began moving data to the cloud. For instance, more than a decade before Yahoo spun out its open-source data analytics software Hadoop as a new company, companies including eBay and Facebook used it internally.

Another foundational open-source data technology, Kafka, was developed inside LinkedIn. The business social networking company funded Confluent in 2014 to commercialize the use of Kafka. And Databricks, a fast-growing data vendor, also launched its own lakehouse-style open-source project in 2019 called Delta Lake.

How the lakehouse evolved

What vendors today call “the lakehouse” is, to many data professionals, just an evolved version of the data lake that combines elements of the traditional data warehouse. A data lake is essentially a receptacle for ingesting information, such as website activity data showing what movie content people perused, or data associated with trips taken through a ride-hailing app.

The lakehouse provides a structural layer on top of the otherwise raw and chaotic data stored in a data lake, allowing data scientists and others to perform analytics processes such as querying the data without having to move it first into a more structured warehouse environment.

“Moving data can be very expensive from system to system,” said Ben Ainscough, head of AI and Data Science at business intelligence tech company Domo.

Dremio aims to make information sitting inside data lakes more operational with its new lakehouse features available Wednesday, including a query engine called Sonar and another system called Arctic, which help developers and data scientists keep track of changes made to data. The company is providing free versions of its lakehouse and the other new features, though large enterprises in need of support services and advanced integrations for security or other customizations have to pay.

Arctic is a collaborative data archive system allowing data scientists and engineers to store and access information that reflects how data is used or changed. Known as metadata, this information provides details such as where the data was sourced or what specific time it was ingested or manipulated.

“Observability of what’s changing is important,” said Shiran. “If somebody goes and changes something, a lot of things can go wrong even with good intent.”

The archived metadata stored in the system might show what data was used inside an analytics tool like Tableau to help a company decide whether to buy certain materials or products, for example. Or it could be used when a data scientist wants to run a query to learn more about the most recent action taken inside a data folder.

“The real power comes from managing the metadata information,” said Venkat Venkataramani, who managed engineering teams that built Facebook's online data systems from 2008 to 2015 and is now CEO and co-founder of Rockset, a company that provides a database for building applications for real-time data. Open-source data architectures built to help solve the needs of tech giants — such as Iceberg and Hudi — keep track of metadata in a standardized way, Venkataramani said.

In order to get full value from investing in infrastructure and software to collect, store, manage and analyze data, businesses want to enable multiple people and departments to access and manipulate the same corpus of information. But historically that has required copying the data and moving it so multiple users could access it and work on it at the same time, which risks changes being made to one version of the data that are not reflected in another and are not trackable. The constant influx of new and updated information adds even more complexity.

“People have been shouting about data silos for effectively ever,” said Boris Jabes, co-founder of Census, which makes software to help companies operationalize data for analytics. What’s different today, Jabes said, is that sales, marketing or other teams can each run their own data workloads separately on the same storage layer. “There’s a lot more infrastructure that can be shared now,” he said.

Uber built Hudi out of necessity

The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi, which were later contributed to the Apache Software Foundation.

When Vinoth Chandar, founder and CEO of Onehouse, worked at Uber as a senior staff engineer and manager of its data team starting in 2014, “we ran into this predicament,” he said: People from disparate divisions realized one team’s data may have reflected recent updates, while others did not. That meant each team had been conducting analysis to understand what was happening in specific cities based on different data.

“It had a very profound impact on how we talked about something,” Chandar said.

At the time, Uber had a data warehouse stored on-premises, and used data infrastructure including Hadoop to manage all the analytics and machine-learning algorithms it was building to do things like decide how trip prices should change when it rains.

By building new data-management processes on top of the data lake where data initially entered its system, the company was able to help keep track of data changes and process data faster so all its teams were talking about the same data, Chandar said. That approach for bringing core functionality of Uber’s data warehouse to its data lake was referred to inside Uber as a “transactional data lake,” he said. They named it Hudi (pronounced like hoodie), an acronym for Hadoop Upserts, Deletes and Incrementals.

“It's merely the capabilities Hudi added on top of vanilla Hadoop or cloud storage,” said Chandar.

After incubating Hudi for several years inside Uber, the company contributed the project to Apache in 2019, and it has evolved through the work of an open-source community not unlike the one built around Iceberg.

Chandar’s startup Onehouse, which raised $8 million in seed funding in February, provides a managed service for companies using its Hudi-based lakehouse product.

In the past it was only the Ubers or Facebooks of the world that could afford the hardware and software infrastructure necessary to use these types of technologies in their own data centers, but today the more widespread cloud-centric data ecosystem is ripe for broader adoption of those technologies by other businesses, said Rockset’s Venkataramani. Because Iceberg and Hudi were designed to work in cloud environments, where companies can afford to manage large volumes of data and easily estimate costs of performing queries and analytics using that data, Venkataramani said, the barriers to adoption have been lifted.

“It’s the market demanding projects like Hudi and Iceberg,” he said.

That could bode well for Weeks, the former Netflix engineer who helped create Iceberg. Just last year, along with two other former Netflix data wranglers who also helped create Iceberg, he co-founded Tabular, a startup building a data platform using Iceberg.


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories