July 12, 2022
Avoiding lock-in, prioritizing tooling and treating data as a product can help organizations get ahead of the curve, members of Protocol's Braintrust say.
Good afternoon! In today's edition, we asked experts about how to set up systems to ensure that data collected doesn't live and die with just one use and what companies can do to ensure that they're set up to repurpose data effectively. Questions or comments? Send us a note at email@example.com
CTO at Pure Storage
More and more enterprises are embracing meaningful data strategies to drive innovation, with the ability to reuse and repurpose data serving as a core component to long-term success. Successful planning for reuse and flexibility is about preserving optionality and agility, and avoiding lock-in — specifically lock-in to one application platform or service, lock-in to closed data formats that are difficult to transform in and out of and lock-in of one system or environment to data gravity or inability to serve or move data.
By planning for flexibility to move and share data across applications and environments, organizations are better prepared to react to new demands on their data, which may not have previously been planned for. As an example, data backup and recovery, in particular, are becoming common areas where organizations are finding opportunities to drive data reuse. Originally designed to meet business continuity requirements, backup and retention copies of data are now being fed into modern analytics tools to drive ongoing business outcomes. Analytics is another area driving greater data-sharing and reuse — with common data formats such as Parquet and Apache Iceberg allowing single data sets to be shared and analyzed by different applications.
In order to plan for data reusability, it’s important for organizations to assess and invest in data platforms and technologies that truly enable the flexibility in modes of use, and the scalability of capacity and processing required to handle greater future demands.
Co-founder and chief product officer at Arize AI
Given data is the lifeblood of nearly every industry and the foundation of modern machine learning, an increasing number of companies are setting their sights on creating a “golden data set” — data that is perfectly reusable, clean, integrated and compliant — as a mission-critical task. To that end, much-needed investments in data governance, data lakes, clear lineage tracking and data observability tools to automatically surface issues are accelerating.
There’s just one problem: Seeking perfection, companies often end up delivering perfectly reusable data that no one actually uses. Ironically, the best way to ensure data reusability is to spend less time on planning and processes and more on flexibly arming internal customers of data with what they need to make models work and establish feedback loops with real-world systems for ongoing active learning. In that sense, investing in machine learning is a great shortcut for ensuring data reusability because it forces you to learn from past data and continuously improve by default.
Reusability aside, many organizations simply need to better understand their data. Teams deploying deep-learning models (like ones that scan images of store shelves to automate inventory orders), for example, often lack visibility into how the model performs in the real world until either a human labeler checks a small subset of individual predictions (was milk really out of stock?) or something goes wrong (customers complain). Even tech companies with sophisticated teams struggle to have AI consistently flag things like hate speech; better insights and monitoring, not just reusability, are needed.
Global lead for data science & machine learning engineering at Accenture
Our survey of 850 C-suite executives across 20 industries found 57% regard AI as a critical enabler of their strategic priorities. CEOs and CIOs are pushing for data reusability because it can accelerate speed and scale. But there’s a gap between wanting to reuse data and possessing the mechanical ability to achieve that goal. Closing that gap effectively calls for the following three steps.
- Connect data to business value: Making data reusable is costly and time consuming. The data used must have potential to solve the business problem for which it was created. For instance, consider pragmatically whether the data provides monetary return, impacts customer experience or ensures savings. We can broadly state that if a data set can tackle many use cases, it is likely to provide a large return, with the reverse being true as well.
- Create a scalable data management platform: We’ve historically focused on collecting as much data as possible, which has led to data siloes or using different references for the same point. These can lead to a significant need for cleanup. A data-driven management platform can make it easier to reuse data.
- Recognize the false economy of reusability: If a client needs a specific subset of data for one use case, it may be more efficient to make the raw data reusable, rather than engineering a new product from that data.
At its core, making data reusable requires prioritizing which data should be shareable – mapping to returns – before engineering for scalability.
Client partner at Fractal Analytics
There is a lot of duplication in data creation, and a lot of wastage in not using the right data to make decisions. Many times, companies are not aware that data exists within their own systems. They don’t know where to ask for it. They don’t know how to access it. They don’t know if it is adding value to a decision. Institutionalization of data-driven decision-making at an organizational level is the main challenge to be tackled here.
Treating data as a product and enabling it to be catalogued and bringing in a storefront kind of an experience and social context to data, aka data marketplaces, will be fundamental to enable data reusability within the enterprise, across the intersection of enterprises and their partners and across enterprises to drive new possibilities.
As experimentations are happening at an unprecedented pace, data will be the core pivot to lend agility and fluidity to enterprises to drive new possibilities. Data as a product will enable the knowledge gained in each innovation to be leveraged across the enterprises and fuel further imagination on new possibilities. The social context: powerful recommendation engines to enable data scientists, business analysts and CDOs to show how data products are used across various experimentations and data product social forums for collaborative discussions to drive more imagination to unlock more and more value across the enterprise community.
Chief solutions officer at Cohesity
Data reuse is a key part of any data strategy. But creating copies of data for purposes of running analytics or doing testing and development can create additional challenges for any business, particularly enterprises. How do you secure those copies? How do you ensure they’re not inadvertently leaking sensitive information? How do you ensure they are not driving your costs up exponentially?
All of these issues point to challenges associated with fragmentation. The more data that’s proliferated across the organization, the harder it is to understand exactly where copies are located, and the greater the potential surface area for cyber attacks. Each time an organization copies data for reuse, the security postures for the data need to be replicated as well. This can become costly, burdensome and harder to achieve as the data grows organically.
Properly planning for data reusability requires a modern data management strategy focused on limiting fragmentation. Restricting the proliferation of data is critical to reducing risk, so organizations should turn to next-gen data management solutions that can consolidate data and make it available for reuse without the need to replicate it. Other factors to consider include being able to create zero cost copies, classifying data so as to prevent leakage, and giving the flexibility to move the data where it needs to be - all things that help balance data reuse with data security.
General manager and director of engineering of Document AI at Scale AI
Many organizations lack a single source of truth for data, with each business unit developing their own methods to extract the information they need. This causes massive issues such as inefficient systems that repeat the same goal at different times for different business units, differing results from the same data, repeated costs and employees developing the same systems multiple times, which reduces productivity. That’s why, when developing a data reusability plan, a clear path to the source is key.
First, organizations must identify business units that need to be isolated for security and ones that can share data, while trying to minimize isolation as much as possible. From there, organizations should extract needed information from high-quality data-leveraging machine learning and ensure the systems used can demonstrate where the information is coming from — the source. Each unit can utilize the data from this source of truth, eliminating redundancies and inefficiencies. However, business units that can’t share data for security reasons should keep their data and sources isolated.
Head of engineering research at DataStax
See who's who in the Protocol Braintrust and browse every previous edition by category here (Updated July 12, 2022).
Data is relevant in a moment: real time (the fast lane) or historic (the deep lane). Data teams that prioritize data access by time are able to answer questions faster and better.
The hardest questions to answer are the ones that require data about right now.
The teams that prepare their data with time in mind are the ones capable of providing real-time insights. Or, in other words, if you can answer questions like, “What is the most popular product now?" you will also be able to answer questions like, “What was the most popular product yesterday / last week / last quarter?"
Orienting your data by time makes sure it is actionable now, and reusable later.
Kevin McAllister ( @k__mcallister) is a Research Editor at Protocol, leading the development of Braintrust. Prior to joining the team, he was a rankings data reporter at The Wall Street Journal, where he oversaw structured data projects for the Journal's strategy team.
More from Braintrust