With Delta Lake, Databricks sparks an open-source nerd war and customer confusion

Databricks insists its Delta Lake database technology is open source, but critics say it’s not open source in spirit, and that could cost businesses time and money. This could be all part of the Databricks playbook as it prepares to go public.

A wall of closed doors.

An old enterprise tech debate had come to the cloud database wars.

Illustration: Christopher T. Fong/Protocol

Had the dispute erupted in a bar, it might have led to a sloppy brawl. But this was a virtual spat about wonky database tech philosophy, and instead of throwing left and right hooks, these contenders sparred with pithy sarcasm and thumbs-up emojis.

James Malone, senior manager of Product Management at Snowflake, took the first sneaky jab earlier this year. While introducing Snowflake’s support for Iceberg, an open-source database architecture, he emphasized its genuinely open, open-source status.

“Many data architectures can benefit from a table format, and in my view, #ApacheIceberg is the one to choose - it's (actually) open, has a vibrant and growing ecosystem, and is designed for interoperability,” he wrote in a January LinkedIn post.

He didn’t have to mention Delta Lake by name. Another database table format originally created by Snowflake competitor Databricks, Delta Lake has attracted less interest and engagement from the open-source developer community than Iceberg has. There already had been plenty of chatter among database wranglers questioning its open-source cred.

Databricks software engineers knew a dig at their baby when they saw it, and it got their dander up. They quickly came to Delta’s defense. A shouting match in sarcastic text ensued about the distinctions between a truly open-source project and one that’s proprietary.

An old enterprise tech debate had come to the cloud database wars.

John Lynch, field CTO at Databricks, poked Malone, pointing out in the same LinkedIn thread that Snowflake’s own software is itself proprietary. He posted a link to Delta Lake’s source code on GitHub, the go-to home for open-source software collaboration. A smiley face emoji punctuated the burn.

“It’s not open source. It’s open code,” responded Malone about Delta Lake.

“We don’t need to get into semantics James,” shot back Spencer Cook, financial services solutions architect at Databricks.

But this public display was about more than just developers and engineers picking sides in a tired debate that has been common over the last 15 years of enterprise tech and the hundreds of open-source projects that drove that growth.

Nerd wars are always fun. But there are some very objective differences ...

“Nerd wars are always fun. But there are some very objective differences in the approach that the Apache Iceberg project has taken versus the Databricks Delta Lake approach,” said Billy Bosworth, CEO of Dremio, whose company has highlighted its use of Iceberg in its own products.

Open and shut

Malone and other database engineers say there is confusion among their customers around what parts of Delta Lake are open source. They say Databricks puts up roadblocks to Delta’s full capabilities, forcing users to choose between paying for access to its full performance and breadth of features — or getting stuck with limited capabilities when implementing Delta’s open-source code.

They complain that even though Delta Lake lives on GitHub as an open-source project, Databricks employees wield undue control over decisions to make adjustments to its code without public review. They say that Iceberg — another database table format born inside Netflix and now managed by the open-source Apache Software Foundation — has fostered a more diverse community of contributors from a much wider array of companies than Delta.

The criticism of Delta Lake’s open-source status is “not totally a fair assessment,” said Denny Lee, head of developer relations at Databricks, who said the project has over 200 contributors from 70 different organizations. “Thousands of our customers — non-Databricks employees — are active in the community because Delta Lake is critical for the reliability of their data pipelines and we continue to add features based on their feedback,” he said.

However, open-source purists argue that a truly free and open-source project would not seek engagement from “customers,” but rather a wider community of collaborators. Ultimately, some say this quasi-open-source approach — however much it rubs some database builders the wrong way — is all part of the Databricks playbook.

“It gets a little confusing sometimes when you're trying to distinguish between the Databricks version of Delta Lake, and then what they've open-sourced in the open-source version of Delta Lake,” Bosworth said.

The confusion trickles up from people building databases to enable data queries and analytics to business decision-makers, said Malone. “We’ve heard that confusion from customers,” he said regarding Delta Lake, which Snowflake does support along with Iceberg. “A customer will want to make sure their workload will run reliably. It becomes a critical component. It has serious implications for how you’re running a business,” he said.

“At best, when features are missing, users likely have to rework their code when they switch between proprietary and open-code versions,” Malone said. At worst, he said customers are “locked into a paid version and that fact is not made clear.” He added, “There has not been anything done to address that confusion.”

Ali Ghodsi, co-founder and CEO of Databricks, responded to the criticism in a statement sent to Protocol: “Our platform documentation explains which performance features are only available on Databricks, but all of the features for reading, writing, and managing data are open and usable in this wide ecosystem of other products.” He added that Databricks is planning “a big announcement around open-source Delta Lake” at the company’s conference later this month.

Foundational questions

Although Iceberg and Delta Lake both attempt to fulfill the same data table formatting needs, there are distinctions that can affect a company’s bottom line, Bosworth said. “It's an architectural decision of the type where you live with it for about a decade or more when you make it. So, it's a very critical point in the architecture: to pause and ask, ‘Am I building my foundation on something that I'm going to be comfortable with for the next decade in my organization?’” he said.

Amid squabbles over Delta Lake, momentum is growing behind Iceberg. Along with adoption by Dremio and Snowflake, AWS used Iceberg to build its Athena query service, which was made broadly available in April.

Google Cloud also christened Iceberg by choosing to support it first over Delta in its new lakehouse product, BigLake. “We are supporting Iceberg first with BigLake because that’s the demand that we see on GCP,” Gerrit Kazmaier, vice president for Database, Data Analytics and Looker at Google told Protocol. However, he added that GCP has limited support for Delta “because Databricks is available on GCP, and there are some Databricks ‘interop’ scenarios with BigQuery.”

Support in places like AWS, GCP and Snowflake could inspire developers to add Iceberg to their toolset, while possibly dismissing Delta, said Bosworth, a developer in the first decade of his career. “You don't want to miss the cool kids’ party. People underestimate the psychological impact of the developer decisions.”

Coolness is one thing, but getting a job matters, too. “A lot of developers like to be on the front edge of those waves as they emerge. A lot of developers know they won't go wrong with open-source projects on their resume,” he added.

Still, some companies have not warmed up to Iceberg.

When it comes to Iceberg, I honestly haven’t seen any customers at all using it.

Microsoft and its customers have cozied up to Delta Lake instead, said James Serra, a data and AI solutions architect at Microsoft who helps its customers build solutions in its Azure cloud platform. “When it comes to Iceberg, I honestly haven’t seen any customers at all using it. Over time, especially in the last year, everybody is going, in our world, to Delta Lake.”

Because of that customer interest, he said, Microsoft updated its products to incorporate the open-source version of Delta while adding its own improved data storage and performance features.

'Delta Lake is not a Databricks project'

Sometimes when Delta users run into problems, rather than the collaborative tinkering common in many open-source communities, issues are addressed by Databricks employees and treated almost like IT or software customer service ticket requests. When bugsbunny1101 posted issue #1129 in the Delta Lake GitHub project in May noting “inconsistent behavior between opensource delta and databricks runtime,” another user added, “I'm experiencing the exact same issue.”

Two Databricks software engineers chimed in saying they were investigating the issue. “We at Delta Lake haven't forgotten about this issue,” wrote Scott Sandre, a Databricks software engineer, in late May. “We are working away on the next Delta Lake release, and are hoping to get it out by the Data and AI summit next month,” he continued, alluding to his company’s upcoming conference.

Serra said Delta Lake might not satisfy the criteria of a genuinely open-source project, in part because “it is not widely contributed to.” But that might not matter, he said. “You could say it’s still a really good solution because Databricks is contributing to it and they’ve made it work really well.”

While many contributors to Delta Lake are from Databricks, people from other companies including Esri, IBM and Microsoft have collaborated in its community on GitHub.

“It’s first important to note that while Databricks has built on top of Delta Lake within our Lakehouse Platform to advance query performance, Delta Lake is not a Databricks project,” Ghodsi said, noting that Delta Lake is managed by the Linux Foundation and people from AWS, Comcast, Google and Tableau contribute code to it.

Revisiting Spark’s quasi-open-source playbook

Databricks has an inherent conflict of interest in Delta Lake, said Ryan Blue, co-founder and CEO of data platform startup Tabular and a former Netflix database engineer who helped build Iceberg. He said that because Databricks sells access to its compute engine while also offering a data storage product like Delta, it creates a conflict of interest because the company is likely to steer people toward its compute services to enable better performance.

“Everyone sees the vision of this multi-engine future,” Blue said, explaining why Tabular is built on Iceberg. “We’re saying we’re going to be neutral to the compute engine because that’s what’s in our customer’s interest.”

But delivering performance enhancements through the paid version is indeed the Databricks strategy. “The difference is in the performance,” Lee told Protocol. “Databricks has done things to make the query performance much faster, but that has nothing to do with the format.” He acknowledged the confused perception of Delta Lake is understandable because “Delta Lake was originally proprietary [in] 2017 before it was made open source in 2019.”

Indeed, with Delta Lake, the co-founders of Databricks seem to be running in reverse the same pseudo-open-source play they used to monetize the open-source user base that had built up around Apache Spark, the popular open-source project they started in 2009. That time, they packaged improved features for Spark into a better-performing paid product, forming the foundation of Databricks, which launched in 2013.

“We quickly realized only open source would fuel really big growth,” Ghodsi said in a 2021 conversation with Forbes regarding Spark. “The challenge, though, was getting anyone to pay for our product.” The profit-driven compromise was what Ghodsi himself called “SaaS open source,” wherein Databricks charges customers to update and operate the product while contributing “constantly to the open-source version of Databricks that’s entirely free.”

“You can say they’re trying to do the same thing with Delta Lake,” Serra said.

“This seems to me like slightly disingenuous behavior,” said Armon Petrossian, CEO of data transformation and analytics company Coalesce, who said some companies seem to establish open-source projects in order to generate a community around them, then pull a bait-and-switch by converting those projects to paid products or steering users toward a better, paid version.

“We’ve seen the concept of open source evolve over the years where what was some altruistic intention of being able to support users [has become] a go-to-market motion,” Petrossian said.

“I never see [Databricks] as ever being dishonest or manipulative,” Bosworth said. “I don't think it's in any sense a nefarious sort of thing. It's just their business model. And that's okay.”

If anything, the confusion and contention around Delta Lake illustrates there are many interpretations of what “open” means in relation to software technology.

“Open comes in a lot of flavors. There's open source; there's open formats; and there's open standards,” Bosworth said. “You can conceptually have a very open system that's based on open standards and open protocols, and open formats, files and things like that — but no open-source software.”

“Trying to define open source is hard,” Malone said. “This is not necessarily a new problem.”

Every day, millions of us press the “order” button on our favorite coffee mobile application. When we arrive at the coffee shop, we expect that our chosen brew will be on the counter a few minutes later. It’s a personalized, seamless experience that we have all come to expect. What we don’t know is what’s happening behind the scenes. The mobile application is sourcing data from a database that stores information about each customer and what their favorite coffee drinks are. It is also leveraging event-streaming data in real time to ensure the ingredients for your personal coffee are in supply at your local store.

Applications like this power our daily lives, and if they can’t access massive amounts of data stored in a database as well as streaming data “in motion” instantaneously, you, and millions of customers, won’t have the in-the-moment experiences we all expect.

Keep Reading Show less
Jennifer Goforth Gregory
Jennifer Goforth Gregory has worked in the B2B technology industry for over 20 years. As a freelance writer she writes for top technology brands, including IBM, HPE, Adobe, AT&T, Verizon, Epson, Oracle, Intel and Square. She specializes in a wide range of technology, such as AI, IoT, cloud, cybersecurity, and CX. Jennifer also wrote a bestselling book The Freelance Content Marketing Writer to help other writers launch a high earning freelance business.

How the internet got privatized and how the government could fix it

Author Ben Tarnoff discusses municipal broadband, Web3 and why closing the “digital divide” isn’t enough.

The Biden administration’s Internet for All initiative, which kicked off in May, will roll out grant programs to expand and improve broadband infrastructure, teach digital skills and improve internet access for “everyone in America by the end of the decade.”

Decisions about who is eligible for these grants will be made based on the Federal Communications Commission’s broken, outdated and incorrect broadband maps — maps the FCC plans to update only after funding has been allocated. Inaccurate broadband maps are just one of many barriers to getting everyone in the country successfully online. Internet service providers that use government funds to connect rural and low-income areas have historically provided those regions with slow speeds and poor service, forcing community residents to find reliable internet outside of their homes.

Keep Reading Show less
Aditi Mukund
Aditi Mukund is Protocol’s Data Analyst. Prior to joining Protocol, she was an analyst at The Daily Beast and NPR where she wrangled data into actionable insights for editorial, audience, commerce, subscription, and product teams. She holds a B.S in Cognitive Science, Human Computer Interaction from The University of California, San Diego.

How I decided to exit my startup’s original business

Bluevine got its start in factoring invoices for small businesses. CEO Eyal Lifshitz explains why it dropped that business in favor of “end-to-end banking.”

"[I]t was a realization that we can't be successful at both at the same time: You've got to choose."

Photo: Bluevine

Click banner image for more How I decided series

Bluevine got its start in fintech by offering a modern version of invoice factoring, the centuries-old practice where businesses sell off their accounts receivable for up-front cash. It’s raised $767 million in venture capital since its founding in 2013 by serving small businesses. But along the way, it realized it was better to focus on the checking accounts and lines of credit it provided customers than its original product. It now manages some $500 million in checking-account deposits.

Keep Reading Show less
Ryan Deffenbaugh
Ryan Deffenbaugh is a reporter at Protocol focused on fintech. Before joining Protocol, he reported on New York's technology industry for Crain's New York Business. He is based in New York and can be reached at rdeffenbaugh@protocol.com.

The Roe decision could change how advertisers use location data

Over the years, the digital ad industry has been resistant to restricting use of location data. But that may be changing.

Over the years, the digital ad industry has been resistant to restrictions on the use of location data. But that may be changing.

Illustration: Christopher T. Fong/Protocol

When the Supreme Court overturned Roe v. Wade on Friday, the likelihood for location data to be used against people suddenly shifted from a mostly hypothetical scenario to a realistic threat. Although location data has a variety of purposes — from helping municipalities assess how people move around cities to giving reliable driving directions — it’s the voracious appetite of digital advertisers for location information that has fueled the creation and growth of a sector selling data showing who visited specific points on the map, when, what places they came from and where they went afterwards.

Over the years, the digital ad industry has been resistant to restrictions on the use of location data. But that may be changing. The overturning of Roe not only puts the wide availability of location data for advertising in the spotlight, it could serve as a turning point compelling the digital ad industry to take action to limit data associated with sensitive places before the government does.

Keep Reading Show less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories