With Delta Lake, Databricks sparks an open-source nerd war and customer confusion

Databricks insists its Delta Lake database technology is open source, but critics say it’s not open source in spirit, and that could cost businesses time and money. This could be all part of the Databricks playbook as it prepares to go public.

A wall of closed doors.

An old enterprise tech debate had come to the cloud database wars.

Illustration: Christopher T. Fong/Protocol

Had the dispute erupted in a bar, it might have led to a sloppy brawl. But this was a virtual spat about wonky database tech philosophy, and instead of throwing left and right hooks, these contenders sparred with pithy sarcasm and thumbs-up emojis.

James Malone, senior manager of Product Management at Snowflake, took the first sneaky jab earlier this year. While introducing Snowflake’s support for Iceberg, an open-source database architecture, he emphasized its genuinely open, open-source status.

“Many data architectures can benefit from a table format, and in my view, #ApacheIceberg is the one to choose - it's (actually) open, has a vibrant and growing ecosystem, and is designed for interoperability,” he wrote in a January LinkedIn post.

He didn’t have to mention Delta Lake by name. Another database table format originally created by Snowflake competitor Databricks, Delta Lake has attracted less interest and engagement from the open-source developer community than Iceberg has. There already had been plenty of chatter among database wranglers questioning its open-source cred.

Databricks software engineers knew a dig at their baby when they saw it, and it got their dander up. They quickly came to Delta’s defense. A shouting match in sarcastic text ensued about the distinctions between a truly open-source project and one that’s proprietary.

An old enterprise tech debate had come to the cloud database wars.

John Lynch, field CTO at Databricks, poked Malone, pointing out in the same LinkedIn thread that Snowflake’s own software is itself proprietary. He posted a link to Delta Lake’s source code on GitHub, the go-to home for open-source software collaboration. A smiley face emoji punctuated the burn.

“It’s not open source. It’s open code,” responded Malone about Delta Lake.

“We don’t need to get into semantics James,” shot back Spencer Cook, financial services solutions architect at Databricks.

But this public display was about more than just developers and engineers picking sides in a tired debate that has been common over the last 15 years of enterprise tech and the hundreds of open-source projects that drove that growth.

Nerd wars are always fun. But there are some very objective differences ...

“Nerd wars are always fun. But there are some very objective differences in the approach that the Apache Iceberg project has taken versus the Databricks Delta Lake approach,” said Billy Bosworth, CEO of Dremio, whose company has highlighted its use of Iceberg in its own products.

Open and shut

Malone and other database engineers say there is confusion among their customers around what parts of Delta Lake are open source. They say Databricks puts up roadblocks to Delta’s full capabilities, forcing users to choose between paying for access to its full performance and breadth of features — or getting stuck with limited capabilities when implementing Delta’s open-source code.

They complain that even though Delta Lake lives on GitHub as an open-source project, Databricks employees wield undue control over decisions to make adjustments to its code without public review. They say that Iceberg — another database table format born inside Netflix and now managed by the open-source Apache Software Foundation — has fostered a more diverse community of contributors from a much wider array of companies than Delta.

The criticism of Delta Lake’s open-source status is “not totally a fair assessment,” said Denny Lee, head of developer relations at Databricks, who said the project has over 200 contributors from 70 different organizations. “Thousands of our customers — non-Databricks employees — are active in the community because Delta Lake is critical for the reliability of their data pipelines and we continue to add features based on their feedback,” he said.

However, open-source purists argue that a truly free and open-source project would not seek engagement from “customers,” but rather a wider community of collaborators. Ultimately, some say this quasi-open-source approach — however much it rubs some database builders the wrong way — is all part of the Databricks playbook.

“It gets a little confusing sometimes when you're trying to distinguish between the Databricks version of Delta Lake, and then what they've open-sourced in the open-source version of Delta Lake,” Bosworth said.

The confusion trickles up from people building databases to enable data queries and analytics to business decision-makers, said Malone. “We’ve heard that confusion from customers,” he said regarding Delta Lake, which Snowflake does support along with Iceberg. “A customer will want to make sure their workload will run reliably. It becomes a critical component. It has serious implications for how you’re running a business,” he said.

“At best, when features are missing, users likely have to rework their code when they switch between proprietary and open-code versions,” Malone said. At worst, he said customers are “locked into a paid version and that fact is not made clear.” He added, “There has not been anything done to address that confusion.”

Ali Ghodsi, co-founder and CEO of Databricks, responded to the criticism in a statement sent to Protocol: “Our platform documentation explains which performance features are only available on Databricks, but all of the features for reading, writing, and managing data are open and usable in this wide ecosystem of other products.” He added that Databricks is planning “a big announcement around open-source Delta Lake” at the company’s conference later this month.

Foundational questions

Although Iceberg and Delta Lake both attempt to fulfill the same data table formatting needs, there are distinctions that can affect a company’s bottom line, Bosworth said. “It's an architectural decision of the type where you live with it for about a decade or more when you make it. So, it's a very critical point in the architecture: to pause and ask, ‘Am I building my foundation on something that I'm going to be comfortable with for the next decade in my organization?’” he said.

Amid squabbles over Delta Lake, momentum is growing behind Iceberg. Along with adoption by Dremio and Snowflake, AWS used Iceberg to build its Athena query service, which was made broadly available in April.

Google Cloud also christened Iceberg by choosing to support it first over Delta in its new lakehouse product, BigLake. “We are supporting Iceberg first with BigLake because that’s the demand that we see on GCP,” Gerrit Kazmaier, vice president for Database, Data Analytics and Looker at Google told Protocol. However, he added that GCP has limited support for Delta “because Databricks is available on GCP, and there are some Databricks ‘interop’ scenarios with BigQuery.”

Support in places like AWS, GCP and Snowflake could inspire developers to add Iceberg to their toolset, while possibly dismissing Delta, said Bosworth, a developer in the first decade of his career. “You don't want to miss the cool kids’ party. People underestimate the psychological impact of the developer decisions.”

Coolness is one thing, but getting a job matters, too. “A lot of developers like to be on the front edge of those waves as they emerge. A lot of developers know they won't go wrong with open-source projects on their resume,” he added.

Still, some companies have not warmed up to Iceberg.

When it comes to Iceberg, I honestly haven’t seen any customers at all using it.

Microsoft and its customers have cozied up to Delta Lake instead, said James Serra, a data and AI solutions architect at Microsoft who helps its customers build solutions in its Azure cloud platform. “When it comes to Iceberg, I honestly haven’t seen any customers at all using it. Over time, especially in the last year, everybody is going, in our world, to Delta Lake.”

Because of that customer interest, he said, Microsoft updated its products to incorporate the open-source version of Delta while adding its own improved data storage and performance features.

'Delta Lake is not a Databricks project'

Sometimes when Delta users run into problems, rather than the collaborative tinkering common in many open-source communities, issues are addressed by Databricks employees and treated almost like IT or software customer service ticket requests. When bugsbunny1101 posted issue #1129 in the Delta Lake GitHub project in May noting “inconsistent behavior between opensource delta and databricks runtime,” another user added, “I'm experiencing the exact same issue.”

Two Databricks software engineers chimed in saying they were investigating the issue. “We at Delta Lake haven't forgotten about this issue,” wrote Scott Sandre, a Databricks software engineer, in late May. “We are working away on the next Delta Lake release, and are hoping to get it out by the Data and AI summit next month,” he continued, alluding to his company’s upcoming conference.

Serra said Delta Lake might not satisfy the criteria of a genuinely open-source project, in part because “it is not widely contributed to.” But that might not matter, he said. “You could say it’s still a really good solution because Databricks is contributing to it and they’ve made it work really well.”

While many contributors to Delta Lake are from Databricks, people from other companies including Esri, IBM and Microsoft have collaborated in its community on GitHub.

“It’s first important to note that while Databricks has built on top of Delta Lake within our Lakehouse Platform to advance query performance, Delta Lake is not a Databricks project,” Ghodsi said, noting that Delta Lake is managed by the Linux Foundation and people from AWS, Comcast, Google and Tableau contribute code to it.

Revisiting Spark’s quasi-open-source playbook

Databricks has an inherent conflict of interest in Delta Lake, said Ryan Blue, co-founder and CEO of data platform startup Tabular and a former Netflix database engineer who helped build Iceberg. He said that because Databricks sells access to its compute engine while also offering a data storage product like Delta, it creates a conflict of interest because the company is likely to steer people toward its compute services to enable better performance.

“Everyone sees the vision of this multi-engine future,” Blue said, explaining why Tabular is built on Iceberg. “We’re saying we’re going to be neutral to the compute engine because that’s what’s in our customer’s interest.”

But delivering performance enhancements through the paid version is indeed the Databricks strategy. “The difference is in the performance,” Lee told Protocol. “Databricks has done things to make the query performance much faster, but that has nothing to do with the format.” He acknowledged the confused perception of Delta Lake is understandable because “Delta Lake was originally proprietary [in] 2017 before it was made open source in 2019.”

Indeed, with Delta Lake, the co-founders of Databricks seem to be running in reverse the same pseudo-open-source play they used to monetize the open-source user base that had built up around Apache Spark, the popular open-source project they started in 2009. That time, they packaged improved features for Spark into a better-performing paid product, forming the foundation of Databricks, which launched in 2013.

“We quickly realized only open source would fuel really big growth,” Ghodsi said in a 2021 conversation with Forbes regarding Spark. “The challenge, though, was getting anyone to pay for our product.” The profit-driven compromise was what Ghodsi himself called “SaaS open source,” wherein Databricks charges customers to update and operate the product while contributing “constantly to the open-source version of Databricks that’s entirely free.”

“You can say they’re trying to do the same thing with Delta Lake,” Serra said.

“This seems to me like slightly disingenuous behavior,” said Armon Petrossian, CEO of data transformation and analytics company Coalesce, who said some companies seem to establish open-source projects in order to generate a community around them, then pull a bait-and-switch by converting those projects to paid products or steering users toward a better, paid version.

“We’ve seen the concept of open source evolve over the years where what was some altruistic intention of being able to support users [has become] a go-to-market motion,” Petrossian said.

“I never see [Databricks] as ever being dishonest or manipulative,” Bosworth said. “I don't think it's in any sense a nefarious sort of thing. It's just their business model. And that's okay.”

If anything, the confusion and contention around Delta Lake illustrates there are many interpretations of what “open” means in relation to software technology.

“Open comes in a lot of flavors. There's open source; there's open formats; and there's open standards,” Bosworth said. “You can conceptually have a very open system that's based on open standards and open protocols, and open formats, files and things like that — but no open-source software.”

“Trying to define open source is hard,” Malone said. “This is not necessarily a new problem.”


Judge Zia Faruqui is trying to teach you crypto, one ‘SNL’ reference at a time

His decisions on major cryptocurrency cases have quoted "The Big Lebowski," "SNL," and "Dr. Strangelove." That’s because he wants you — yes, you — to read them.

The ways Zia Faruqui (right) has weighed on cases that have come before him can give lawyers clues as to what legal frameworks will pass muster.

Photo: Carolyn Van Houten/The Washington Post via Getty Images

“Cryptocurrency and related software analytics tools are ‘The wave of the future, Dude. One hundred percent electronic.’”

That’s not a quote from "The Big Lebowski" — at least, not directly. It’s a quote from a Washington, D.C., district court memorandum opinion on the role cryptocurrency analytics tools can play in government investigations. The author is Magistrate Judge Zia Faruqui.

Keep ReadingShow less
Veronica Irwin

Veronica Irwin (@vronirwin) is a San Francisco-based reporter at Protocol covering fintech. Previously she was at the San Francisco Examiner, covering tech from a hyper-local angle. Before that, her byline was featured in SF Weekly, The Nation, Techworker, Ms. Magazine and The Frisc.

The financial technology transformation is driving competition, creating consumer choice, and shaping the future of finance. Hear from seven fintech leaders who are reshaping the future of finance, and join the inaugural Financial Technology Association Fintech Summit to learn more.

Keep ReadingShow less
The Financial Technology Association (FTA) represents industry leaders shaping the future of finance. We champion the power of technology-centered financial services and advocate for the modernization of financial regulation to support inclusion and responsible innovation.

AWS CEO: The cloud isn’t just about technology

As AWS preps for its annual re:Invent conference, Adam Selipsky talks product strategy, support for hybrid environments, and the value of the cloud in uncertain economic times.

Photo: Noah Berger/Getty Images for Amazon Web Services

AWS is gearing up for re:Invent, its annual cloud computing conference where announcements this year are expected to focus on its end-to-end data strategy and delivering new industry-specific services.

It will be the second re:Invent with CEO Adam Selipsky as leader of the industry’s largest cloud provider after his return last year to AWS from data visualization company Tableau Software.

Keep ReadingShow less
Donna Goodison

Donna Goodison (@dgoodison) is Protocol's senior reporter focusing on enterprise infrastructure technology, from the 'Big 3' cloud computing providers to data centers. She previously covered the public cloud at CRN after 15 years as a business reporter for the Boston Herald. Based in Massachusetts, she also has worked as a Boston Globe freelancer, business reporter at the Boston Business Journal and real estate reporter at Banker & Tradesman after toiling at weekly newspapers.

Image: Protocol

We launched Protocol in February 2020 to cover the evolving power center of tech. It is with deep sadness that just under three years later, we are winding down the publication.

As of today, we will not publish any more stories. All of our newsletters, apart from our flagship, Source Code, will no longer be sent. Source Code will be published and sent for the next few weeks, but it will also close down in December.

Keep ReadingShow less
Bennett Richardson

Bennett Richardson ( @bennettrich) is the president of Protocol. Prior to joining Protocol in 2019, Bennett was executive director of global strategic partnerships at POLITICO, where he led strategic growth efforts including POLITICO's European expansion in Brussels and POLITICO's creative agency POLITICO Focus during his six years with the company. Prior to POLITICO, Bennett was co-founder and CMO of Hinge, the mobile dating company recently acquired by Match Group. Bennett began his career in digital and social brand marketing working with major brands across tech, energy, and health care at leading marketing and communications agencies including Edelman and GMMB. Bennett is originally from Portland, Maine, and received his bachelor's degree from Colgate University.


Why large enterprises struggle to find suitable platforms for MLops

As companies expand their use of AI beyond running just a few machine learning models, and as larger enterprises go from deploying hundreds of models to thousands and even millions of models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

As companies expand their use of AI beyond running just a few machine learning models, ML practitioners say that they have yet to find what they need from prepackaged MLops systems.

Photo: artpartner-images via Getty Images

On any given day, Lily AI runs hundreds of machine learning models using computer vision and natural language processing that are customized for its retail and ecommerce clients to make website product recommendations, forecast demand, and plan merchandising. But this spring when the company was in the market for a machine learning operations platform to manage its expanding model roster, it wasn’t easy to find a suitable off-the-shelf system that could handle such a large number of models in deployment while also meeting other criteria.

Some MLops platforms are not well-suited for maintaining even more than 10 machine learning models when it comes to keeping track of data, navigating their user interfaces, or reporting capabilities, Matthew Nokleby, machine learning manager for Lily AI’s product intelligence team, told Protocol earlier this year. “The duct tape starts to show,” he said.

Keep ReadingShow less
Kate Kaye

Kate Kaye is an award-winning multimedia reporter digging deep and telling print, digital and audio stories. She covers AI and data for Protocol. Her reporting on AI and tech ethics issues has been published in OneZero, Fast Company, MIT Technology Review, CityLab, Ad Age and Digiday and heard on NPR. Kate is the creator of RedTailMedia.org and is the author of "Campaign '08: A Turning Point for Digital Media," a book about how the 2008 presidential campaigns used digital media and data.

Latest Stories