October 26, 2021
Data mesh, augmented analytics and software-defined open APIs will be part of the next evolution, members of Protocol's Braintrust say.
Good afternoon! We're in your inbox a little early this week talking data warehouses and data lakes. In the last years, the features and functions of each have showed some signs of convergence, so we asked the experts to tell us how theses tools will be used in the future and the new services that might be baked in. Questions or comments? Send us a note at email@example.com. We'll be back on Thursday with another roundtable!
CIO at Oracle
As data models are converging, data is becoming much more democratized, both in how it's created and used. With more convergence and consolidation of data platforms, businesses can analyze all types of data faster and more effectively with less complexities to drive business value.
Building upon data lakes, warehouses and lakehouses, a data mesh architecture will help businesses address this opportunity, allowing data products in different physical locations to interact safely and securely. The data mesh architecture complements these technologies but takes a radically different approach. Its decentralized nature promises to help companies reuse and recombine data more easily, helping businesses increase their return on data capital.
This unified architecture helps businesses query and analyze all data. Both patterns and their services enable customers to move their data where they need it, whether in a hybrid or multi-cloud environment, bringing together the richness of the data warehouse with the breadth of popular open source, enabling teams to use the right tool for the job within a unified architecture. This meshed architecture allows businesses to connect, ingest, understand (using AI) and combine new data with existing data sources, providing more real-time and predictive insights to help businesses make more informed decisions.
In our cloud, we're already seeing companies big and small complement their data lakehouses with a data mesh.
Chief Product Officer at Hitachi Vantara
Data warehouses have long served as repositories for structured data. They're good at answering questions you already know to ask. But with lots of new data, you don't always know what to ask, which is how data lakes came about. You can throw anything into a data lake and delay pulling from it until you know what schema to apply.
Now, we have lake houses, or a hybrid of the data lake and the data warehouse. The idea is you can build the same low-cost storage model of a data lake but with the transactional benefit of a data warehouse. The best of both worlds.
The reality, however, is that in any large enterprise, you have multiple data repositories that need a comprehensive data fabric that can be layered on top. We believe organizations will increasingly build a data fabric on top of these lake houses and lakes, etc., to provide a 360-degree view of all data within the business.
Organizations are going to do that through a combination of an enterprise data catalog and comprehensive data integration technology. Only then can an enterprise unify all data into a single view of IT/OT data through AI-driven data cataloging. They can then apply data integration and data orchestration to pull in data from these repositories and draw meaningful insights. So, an intelligent data catalog becomes even more important as the enterprise adds data stores like lakehouses to its environment.
CTO at Digital.ai
Most companies have both data architecture models deployed to serve distinct user communities within the organization. While the models show signs of greater convergence, there will still be continued innovation within the context of each model for the foreseeable future. Any convergence will be functionally incomplete in the near term, with complex data not fully supported by data warehouses to the extent required by emerging predictive operational applications.
Consequently, additional services introduced will support existing "sweet spots" of the two models and use cases that are predicated on a converged alternative. In the former, we're seeing new capabilities for augmented analytics in the structured data warehouse/BI area just as we're seeing ML use cases that provide real-time predictions against highly varietal data in semi-structured data lakes.
In the converged model, where a single architecture must meet the needs of all the relevant users, we see a range of capabilities emerging. On one hand, in modern data lakes, we see capabilities such as data frames more readily enabling the use of BI tools, as well as advances in data streaming processes to take on some of the batch-oriented ETL workloads. On the other hand, in modern cloud data warehouses, we see capabilities such as reverse ETL to write back analytical content into operational systems, a secular tool-independent metrics layer, new data sharing and governance capabilities, embedded AI capabilities and new visualization tools allowing data scientists/analysts to rapidly source data and serve insights in the languages of their choice.
Chief Technology Officer at Yellowbrick Data
The convergence of data lakes and warehouses doesn't solve the orthogonal problem of how to govern, search and securely share data. The ability to describe, publish, maintain and find data products remains key, and services need to be developed on top of these technologies to make it easy for data consumers to access and trust the data they need to solve business problems.
Emerging concepts such as data mesh suggest that data should be managed by the subject matter experts that created the data, and published across the business via well-understood APIs. There need to be services in place to enable the data stewards in the lines of business to manage their own data products and self-service means for consumers in other lines of business to access the data stored in the data warehouse and lake.
Without this style of data governance, we will be guilty of repeating the mistakes of the past, where the data lake becomes a dumping ground for data that grows untrustworthy over time and useless to the business because nobody owns it. It's critically important that the teams that created the data remain responsible for the through-life maintenance of their own data products.
This is far from just a technology problem to solve. It requires cultural and process changes across the organization too. I believe that companies that embrace such ideas will be more agile and more competitive than their peers.
Sr. Director of Product Management at Google Cloud
Customers are seeking to migrate, modernize and transform their data and application landscapes using Cloud services. We're now seeing a convergence of traditional product categories such as data warehouses and data lakes that are driven by cloud capabilities. Customers need a broader integration of solutions beyond those of data warehouses and data lakes that accounts for scale and breadth; such as incorporating AI and ML, an open and cost-effective strategy that works for data of any type (unstructured, structured), at any speed (batch and streaming), for any workload (analytical and transactional) and for more use cases (geospatial, for example). In addition to working with all of this, customers also ask for unified and intelligent capabilities for data governance and management.
CEO at Qumulo
Just because you can converge the two, doesn't mean you should.
Could you use a screwdriver as a hammer? Sure, but it won't work as well. Instead of combining both capabilities into one tool, you should have a great screwdriver and a great hammer in your toolbox. Data lakes were invented for a reason: You need an effective tool for managing unstructured data.
Data warehousing vendors are interested in unstructured data because it's the fastest-growing data source. But data warehouses were fundamentally built for structured data, which is a very different world from unstructured data. As unstructured data grows exponentially, managing all that data will become more difficult, and a combined data warehouse and data lake runs the risk of making data infrastructure less capable — like a dull multipurpose tool.
As time passes, customers will increasingly seek ways of creating value from the entirety of their data — both unstructured and structured. Data platforms that are software-defined with open APIs and allow for easy, low-cost data movement will help these two very distinct types of data interoperate. A single tool or even a single vendor will not be the answer.
Head of Product at Boomi
Organizations need data to survive. Data lake and data warehouse models offer a strong foundation for data management, but having a high quantity of data doesn't necessarily ensure quality of data — or the ability to understand it all.
The next step is to make sure those insights are actionable, and that's where intelligent connectivity and automation services can be layered in. Data stores must first be connected together — a critical process that eliminates silos and establishes secure, seamless and up-to-date information on every piece of data. Once everything is connected, businesses can leverage intelligent automation to create a standardized process of cleansing and preparing data so teams can reap actionable value. Combined, intelligent connectivity and automation can generate insights across the entire digital ecosystem so businesses can drive innovation at scale.
Partner at Sapphire Ventures
See who's who in the Protocol Braintrust and browse every previous edition by category here (Updated Oct. 26, 2021).
There is incredible gravity to the data lakes and data warehouses, so customers are naturally layering additional services on top. With maturation of open data ecosystems such as data lakes, innovation will thrive and benefit from the standards emerging. Here are a few.
Data catalog and governance: Companies will need to continue investing in tools and frameworks that help govern and manage data and meta data such as Atlan or Sapphire portfolio company Alation. While each vendor will have basic capabilities, we will see services spanning multicloud vendors and a diverse set of data-management tools to deliver common semantic layer and shared usage.
The mixing of real-time and near real-time data: We will increasingly see the ability to run queries against your data lakes and on streaming data with tools like Redpanda and Kafka now that vendors like Dremio (a Sapphire portfolio company) close the performance and latency gap of data lakes.
There will be continued growth of analytics and machine-learning services taking advantage of data residing in a common destination, while capitalizing on the compute performance of these vendors. Specifically, I think there will be more data engineering tools and future innovations focusing on enterprise use cases. Also, low-code/no-code tools will help additional people gather insight from their data.
It has never been more exciting to think about the possibilities for providing meritocracy to data access and understanding and we have barely started — and the convergence of this space makes this a possibility.
Kevin McAllister ( @k__mcallister) is a Research Editor at Protocol, leading the development of Braintrust. Prior to joining the team, he was a rankings data reporter at The Wall Street Journal, where he oversaw structured data projects for the Journal's strategy team.
More from Braintrust