The big data compute team at Netflix was dealing with some pesky data aggravations a few years ago.
“Earlier this week, we had somebody go in and rename a column, and in one engine they were getting results, and in the other ones they were getting null,” said Daniel Weeks, then a Netflix engineering manager, speaking at a 2019 developers’ conference. As head of that team, he and others were building a new way to solve those sorts of data-processing engine complexities that had prevented smoother analysis of the data rushing into the Netflix streaming service.
“We have more and more users coming in, and they need to not be bothered by these problems,” he said of the growing data team at Netflix at that time.
The new approach that was under construction at Netflix, with help from developers at companies including Apple and Salesforce, became an open-source standard for table formats in analytic data sets called Apache Iceberg. Back in 2019, Weeks predicted that iterative improvements to Iceberg from the open-source community would help ensure issues like the one he described wouldn't happen again.
He was right. While most companies don’t need to perform business analytics on top of tens of petabytes of data the way Netflix does, data architectures including Iceberg and Hudi — a system incubated inside Uber to solve similar problems — now form the foundation of products sold to other enterprises as so-called data lakehouses.
Dremio, which calls itself a lakehouse company, announced Wednesday that its Dremio Cloud data lakehouse platform — based in part on Apache Iceberg — is now widely available.
“A lakehouse needs to be open source: That’s why Iceberg has started to get so much momentum,” said Tomer Shiran, founder and chief product officer at Dremio. Companies that need to perform business analytics on top of huge amounts of data such as Netflix, Apple and Salesforce helped build Apache Iceberg, Shiran said, “because these companies needed something like that. Tech companies have been at the leading edge in terms of adopting this kind of architecture.”
Right now, open-source data lakehouse architectures are following a pattern seen with other data standards built or used inside large Silicon Valley tech companies before businesses began moving data to the cloud. For instance, more than a decade before Yahoo spun out its open-source data analytics software Hadoop as a new company, companies including eBay and Facebook used it internally.
Another foundational open-source data technology, Kafka, was developed inside LinkedIn. The business social networking company funded Confluent in 2014 to commercialize the use of Kafka. And Databricks, a fast-growing data vendor, also launched its own lakehouse-style open-source project in 2019 called Delta Lake.
How the lakehouse evolved
What vendors today call “the lakehouse” is, to many data professionals, just an evolved version of the data lake that combines elements of the traditional data warehouse. A data lake is essentially a receptacle for ingesting information, such as website activity data showing what movie content people perused, or data associated with trips taken through a ride-hailing app.
The lakehouse provides a structural layer on top of the otherwise raw and chaotic data stored in a data lake, allowing data scientists and others to perform analytics processes such as querying the data without having to move it first into a more structured warehouse environment.
“Moving data can be very expensive from system to system,” said Ben Ainscough, head of AI and Data Science at business intelligence tech company Domo.
Dremio aims to make information sitting inside data lakes more operational with its new lakehouse features available Wednesday, including a query engine called Sonar and another system called Arctic, which help developers and data scientists keep track of changes made to data. The company is providing free versions of its lakehouse and the other new features, though large enterprises in need of support services and advanced integrations for security or other customizations have to pay.
Arctic is a collaborative data archive system allowing data scientists and engineers to store and access information that reflects how data is used or changed. Known as metadata, this information provides details such as where the data was sourced or what specific time it was ingested or manipulated.
“Observability of what’s changing is important,” said Shiran. “If somebody goes and changes something, a lot of things can go wrong even with good intent.”
The archived metadata stored in the system might show what data was used inside an analytics tool like Tableau to help a company decide whether to buy certain materials or products, for example. Or it could be used when a data scientist wants to run a query to learn more about the most recent action taken inside a data folder.
“The real power comes from managing the metadata information,” said Venkat Venkataramani, who managed engineering teams that built Facebook's online data systems from 2008 to 2015 and is now CEO and co-founder of Rockset, a company that provides a database for building applications for real-time data. Open-source data architectures built to help solve the needs of tech giants — such as Iceberg and Hudi — keep track of metadata in a standardized way, Venkataramani said.
In order to get full value from investing in infrastructure and software to collect, store, manage and analyze data, businesses want to enable multiple people and departments to access and manipulate the same corpus of information. But historically that has required copying the data and moving it so multiple users could access it and work on it at the same time, which risks changes being made to one version of the data that are not reflected in another and are not trackable. The constant influx of new and updated information adds even more complexity.
“People have been shouting about data silos for effectively ever,” said Boris Jabes, co-founder of Census, which makes software to help companies operationalize data for analytics. What’s different today, Jabes said, is that sales, marketing or other teams can each run their own data workloads separately on the same storage layer. “There’s a lot more infrastructure that can be shared now,” he said.
Uber built Hudi out of necessity
The data architecture teams inside Netflix and Uber aimed to alleviate problems associated with data silos by developing projects like Iceberg and Hudi, which were later contributed to the Apache Software Foundation.
When Vinoth Chandar, founder and CEO of Onehouse, worked at Uber as a senior staff engineer and manager of its data team starting in 2014, “we ran into this predicament,” he said: People from disparate divisions realized one team’s data may have reflected recent updates, while others did not. That meant each team had been conducting analysis to understand what was happening in specific cities based on different data.
“It had a very profound impact on how we talked about something,” Chandar said.
At the time, Uber had a data warehouse stored on-premises, and used data infrastructure including Hadoop to manage all the analytics and machine-learning algorithms it was building to do things like decide how trip prices should change when it rains.
By building new data-management processes on top of the data lake where data initially entered its system, the company was able to help keep track of data changes and process data faster so all its teams were talking about the same data, Chandar said. That approach for bringing core functionality of Uber’s data warehouse to its data lake was referred to inside Uber as a “transactional data lake,” he said. They named it Hudi (pronounced like hoodie), an acronym for Hadoop Upserts, Deletes and Incrementals.
“It's merely the capabilities Hudi added on top of vanilla Hadoop or cloud storage,” said Chandar.
After incubating Hudi for several years inside Uber, the company contributed the project to Apache in 2019, and it has evolved through the work of an open-source community not unlike the one built around Iceberg.
Chandar’s startup Onehouse, which raised $8 million in seed funding in February, provides a managed service for companies using its Hudi-based lakehouse product.
In the past it was only the Ubers or Facebooks of the world that could afford the hardware and software infrastructure necessary to use these types of technologies in their own data centers, but today the more widespread cloud-centric data ecosystem is ripe for broader adoption of those technologies by other businesses, said Rockset’s Venkataramani. Because Iceberg and Hudi were designed to work in cloud environments, where companies can afford to manage large volumes of data and easily estimate costs of performing queries and analytics using that data, Venkataramani said, the barriers to adoption have been lifted.
“It’s the market demanding projects like Hudi and Iceberg,” he said.
That could bode well for Weeks, the former Netflix engineer who helped create Iceberg. Just last year, along with two other former Netflix data wranglers who also helped create Iceberg, he co-founded Tabular, a startup building a data platform using Iceberg.