The New Database

It’s the golden age of databases. It can’t last.

Startups are reaping huge funding rounds. But money alone won't be enough to top the current market leaders.

Row of darkened storage cabinets in a datacenter.

Massive amounts of data equals massive amounts of opportunity for database companies.

Credit: Jasmin Merdan / Getty Images

This story is part of "The New Database," a Protocol special report. Read more here.

It's a great time to be a database company. Money is flowing into the sector in historic amounts, creating a rush of new startups that command huge valuations. In 2020, startups developing traditional databases, which couple the processing engine with storage, took in $2.3 billion in funding across 54 deals, up from $849 million in 2019, according to data from CB Insights. That number doesn't even encompass the newer entities that are decoupling compute from the repositories.

No two companies embody this reality better than Snowflake, which had a ground-breaking IPO in December, and Databricks, which is gearing up for its own, potentially blockbuster public offering some time this year. But as those two companies compete to become a one-stop data shop for organizations, smaller rivals are also trying to carve out their own niche. More traditional database providers like MongoDB are still gaining traction with new cloud-based products. And overhanging all of this are competing efforts by the cloud giants: AWS, Microsoft and Google.

"There's so many options now. In the last 15 years, it's been much more accelerated than it's ever been," Andy Pavlo, an associate professor at Carnegie Mellon University, told Protocol. "It's the golden age of databases."

But industry insiders don't expect this period of free-flowing cash, sky-high valuations and a litany of vendors to last. Instead, they expect a wave of consolidation within the next decade, similar to what's been happening with enterprise software companies in Salesforce and Microsoft's sights — or what happened with the NoSQL industry.

When that happens is anyone's guess. But CEOs from companies like Dremio and Starburst — which are smaller in comparison to the likes of Databricks, Snowflake and MongoDB — are confident they will remain independent vendors in the long run.

"There's unquestionably a proliferation in the market of databases in 2021. And these things go in cycles, so there'll be a consolidation push," said Cockroach Labs CEO Spencer Kimball. "If we can solve the operational, relational needs of a company and take them into this [new] way of doing data architecture, we'll win a substantial fraction of the largest market in software."

And as larger enterprises add more-and-more vendors to the mix, other up-and-comers see an opportunity to serve as the glue that can bring separate piles of corporate information together — basically banking on their neutrality as the differentiator.

"People will have data in Snowflake. And they'll have data in S3. And they'll have some data in a relational database," said Starburst co-founder Matt Fuller. "We can be that engine that really provides access to all your data … We're not going to go into the space of trying to force everyone to move all their data."

Larger competitors, however, view the future of the industry much differently.

"The infrastructure players are the ones that are going to have a hard time. I wouldn't put a dollar in Starburst or Cockroach," said Snowflake product chief Christian Kleinerman.

Doing it all

The database wars span decades. And for many years, it was a competition between the likes of Oracle, SAP, IBM and others. But now, as more enterprises move to the cloud and the amount of data they generate explodes, the industry is entering a new phase.

The old technology boundaries of the past are fading away, changing the definition of what it means to be a database. The engine processing the information is becoming as important as the repository where it is stored, and a pivot to open architectures is challenging the closed systems that gained prominence in the past two decades.

"Enterprises are recognizing there's a new breed of cloud-data platform companies that don't make [them] compromise and give [them] more flexibility over time," said MongoDB Chief Product Officer Sahir Azam. "It's the early part of a consolidation of capabilities. As an end customer, you can't rationalize, integrate and manage 50 different vendors to have a cloud data architecture."

And now, more independent vendors are trying to round out their product portfolio. Snowflake, for example, gained prominence by providing SQL-based data queries over its data warehouse, a service craved by business analysts who need very structured information to deliver up-to-the-minute dashboard updates on the status of the business. It's not as helpful for data scientists who rely on a much broader swath of data to power machine-learning algorithms that try to predict the future or unearth previously unknown corollaries. Still, Snowflake's approach helped propel the company to a historic IPO and its current market cap of roughly $74 billion.

Some rivals, however, think Snowflake's tech is not much better than that of legacy providers. It doesn't let users share information openly with non-Snowflake customers, which is also an oft-cited criticism of vendors like Oracle. The company, however, believes the decision to double down on one file format gives it more flexibility to upgrade the product to the benefit of its users.

Customers appear to agree, given that the company claims a net retention rate of 168%. But it's also invested to create a more open architecture. Snowflake supports what it calls "Reader" accounts, effectively free agreements that enable non-paying users to receive data sets.

That won't be enough for some customers. Databricks, for example, envisions a different future where data can be stored more easily and shared more freely. The $28 billion startup touts its ability to let users dump their data in one place — what it calls the "data lakehouse" — and connect into almost any outside software they'd need. That alleviates the challenge of copying and moving the data, which is a time-intensive effort and compliance nightmare, especially given the increasingly patchwork approach to consumer privacy laws.

While it may not be the primary mode of storage now, Databricks CEO Ali Ghodsi and others believe "the data lakehouse" will increasingly serve as the key repository. Even Bill Inmon, an industry icon who is heralded as the father of the data warehouse, agrees. One reason, according to advocates, is because the tech makes it much easier for anyone in an organization to easily tap the information. Data scientists can use it for the raw information needed to build algorithms that back everything from ecommerce sites to drug discovery research, for example. That's what Databricks specializes in.

Rivals say those types of queries represent just a small portion of a company's overall data needs. But it's easy to see how that market will continue to grow as AI matures and becomes more ubiquitous within enterprises.

Now, Snowflake and Databricks are trying to wade into each other's territories. Snowflake is offering more AI capabilities, though largely through partnerships with the cloud hyperscalers. Databricks recently began offering SQL-based query options, a decision that came after much internal debate over whether it was worth it to invest in an area Snowflake already excels in.

"Our focus is AI and data science," Ghodsi told Protocol at a recent event. But "we saw that more and more people are actually plugging in BI tools to Databricks asking questions about the past, which is a thing we didn't have support for."

There are, of course, other frontrunners to consider. Palantir, for example, is reporting strong growth since its own IPO last year. But the data analytics company has largely focused on the federal market and doesn't appear to have the enterprise penetration that Databricks and Snowflake do.

The hybrid question

Ultimately, Databricks and Snowflake's main competitors probably aren't each other, but rather Microsoft, AWS and Google.

Each cloud provider offers their own database products, along with AI and analytics compute engines. Microsoft has Access and Azure ML. AWS, which has a ton of different cloud database products, sells S3, Athena and SageMaker. And Google has BigQuery.

Those are just a sample of the menu of applications the companies are adding to complement their cloud storage. And data leaders at enterprises expect that, as those services improve, customers will invest more heavily in the same vertical stack instead of adding more third-party vendors to the mix. Whirlpool CIO Dani Brown previously told Protocol, for example, that "when it comes to exposing data … Google is our main cloud environment."

But there are potential shortcomings from consolidation. Companies are increasingly adopting a multi-cloud strategy alongside their own on-premise storage systems. So while it may be convenient to run data stored within Azure on Azure ML, for example, there is bound to be information stored in other systems. And independent vendors say the cloud providers don't make it easy to move the data out.

"BigQuery runs on hardware that is not even part of GCP. It runs on things that are very unique. How are you going to be able to do a great BigQuery on Amazon? Unclear," said Kleinerman.

Hyperscalers, however, say that type of situation isn't going to be that common. Most organizations that were early cloud adopters, for example, are likely to have AWS as a sizable, if not majority portion of their cloud spend. The company doesn't see that changing despite all the chatter about the hybrid environment. And that's why it continues to invest so heavily in its data analytics and AI tools — which many industry experts will say are sector-leading products — as a way to further entice customers to stay in the ecosystem.

"It's rare that we see a workload that spans multiple clouds in a single application," said Rahul Pathak, the vice president of analytics at AWS.

The obvious advantages are surely going to outweigh the downsides for some customers. And the reality is many organizations are likely to incorporate multiple vendors. Startup bank Current, for example, uses Memorystore and BigQuery from Google, as well as MongoDB Atlas, Neo4j and MySQL from Oracle and others.

"At the end of the day, those technologies serve extremely specific use cases," said Trevor Marshall, Current's chief technology officer.

Still, as business from cloud provider databases grows larger, a surge of smaller startups are nipping at their heels. And that could change the makeup of enterprise data architecture in the future.

'You don't need to be the best analytics database'

Beneath Snowflake's IPO and Databricks's $1 billion funding round in February, other startups were also raking in big investments.

Dremio raised $135 million at a $1 billion valuation. Cockroach Labs raised $160 million at a $2 billion valuation. Starburst raised $100 million at a $1.2 billion valuation. And Redis Labs raised $110 million at an over $2 billion valuation.

It's perhaps no surprise the sector saw such attention after Snowflake's historic IPO.

"Watching its rise to be so successful and so big as a company that does not have the name of cloud service provider was really monumental for the market," said Dremio CEO Billy Bosworth. "That's what opened up the door for a lot of investors to say it's not only possible but it's determined that the future is going to be these ISVs that … can be their own big standalone companies."

But instead of becoming a jack-of-all-trades like Snowflake and Databricks and competing with the larger rivals, some see an advantage in specializing.

Cockroach Labs, for example, touts its ability to store and analyze transactional data, an area where rivals could have a harder time competing. The tech gets complicated, but essentially as the amount of data grew over the past decade, it became difficult for databases to handle both transactional and analytical data processing. While data lakehouse advocates say that tech solves that issue, Cockroach Labs's Kimball said it will be difficult to convince companies like JPMorgan Chase, one of its customers, of that approach.

"Are we as good as Snowflake at doing really huge aggregations and analytics queries? If we became that good, we'd essentially be building their product," he said. "You don't need to be also the best analytics database in order to win that plurality of the market."

Others are banking on their openness to stand apart by providing customers the opportunity to analyze their data wherever they want, effectively removing the database itself from the equation and just focusing on the increasingly important analysis or machine-learning layers.

Starburst, for example, rolled out its own SQL-based query tool in May to achieve just that. Right now, however, it's only available on AWS, which shows the long road ahead for a decades-old industry that is now in the nascent stages of the next generation.

But it gets to the heart of the transformation underway toward a more open architecture, one that Fuller and others think could end up undermining the success of Snowflake and others in the long run. Databricks, for example, introduced a new feature in May that allows users to share data sets with any other program that supports its Delta Lake open-source project.

What's happening in the database world is not much different than what's happened in tech in general over the past few decades. New technology is giving way to a rush of competitors that are all trying to define the future. A key difference, however, is just how much businesses are going to rely on advanced data analytics in the future. That is bound to create massive demand, which means that several vendors will be able to flourish.

Still, history is littered with companies that failed to execute or anticipate the innovations of the future fast enough to prevent upstarts from surpassing them. (Salesforce and Oracle, for example.) One thing is certain: The big data revolution isn't slowing down. And that means the war over managing it and putting the information to use will only get more fierce.

More from The New Database