November 10, 2022
Data profiling, dependency tracking, and third-party auditing can make a huge difference, members of Protocol's Braintrust say.
Good afternoon! In today's edition, we asked a group of experts to tell us about the actions that executives can take to improve data quality in short order. Questions or comments? Send us a note at email@example.com
CTO at Cloudera
The key to improving data quality is implementing robust dependency tracking for data sets with quality as a first-class metadata annotation. The lack of quality in a data set can have severe and extensive consequences across any enterprise, from broken reports to incorrect predictions and everything in between. It can also be a significant source of waste and even missed service-level agreements. There are two essential pieces that together enable a quick and efficient resolution of any data quality issues: lineage and annotations.
Increasingly, data sets are no longer hidden behind poorly designed or unnecessarily complicated reports but are a part of a “network” of use cases within companies. For example, a pricing model might depend on a customer data set, an order history data set, and a catalog data set. The catalog data set might depend on a vendor feed underneath. If a data quality problem is identified in the catalog data set, users need the ability to trace both provenance — the origin of the data — and the impact on the downstream consumers who are affected. Additionally, by annotating data sets with quality measures using metadata — the data about the data — quality itself can be conveyed along a data dependency graph as a metric. Jobs, models, and reports can do pre-checks for data quality, triggering automated actions when issues are detected.
Senior vice president of engineering at Confluent
Focus on usage. In our work and personal lives, we amass data at a phenomenal rate. More often than we would like, this data may have gaps from missing related parts, can be subtly incorrect or invalid due to invalid or unexpected values, or may be corrupted through bugs in software or failing hardware.
The best way to find issues and thus improve quality is to use your data. Why pay for all the validation without reaping the benefits of great analytics results to improve your business?
Start with simple queries to understand if the overall shape of your data exactly matches your expectations. This most basic analysis will both provide you with insights you may be unaware of and expose basic quality issues with gaps, duplicates, and incorrect values.
Next, I would encourage you to look for simple machine-learning questions you can ask which may provide valuable insights from your data. Getting good ML results will require clean data and thus is a great forcing function for identifying issues and getting them fixed.
It’s a growing expense and arduous task to spend time and resources to continually update data validation tests, or to make copies, and regularly verify your backups. But the more you work with data, the better your chance of confirming the data is valid and in a useful form. This also highlights which data is unused, which can lead to money-saving opportunities by either not collecting it or reducing its retention period.
Co-founder and chief business officer at Vendia
Most companies today are understandably focused on what happens within their four walls, but if you depend on partners in order to run your business (think shipping, manufacturing, financials, etc.), then it’s also vital to consider the critical business data that’s created outside of your company. The biggest challenge that businesses have when it comes to data quality between multiple parties is that each partner has their own truth. By using an accurate, trusted, and auditable single version of truth like next-gen blockchains, everyone has access to the same real-time data regardless of their individual tech stacks. This way companies can drastically improve data quality and avoid costly reconciliation costs by simply getting a single version of truth with their partners.
Chief product officer at Informatica
In manufacturing, understanding the quality of raw materials improves the efficiency of creating the finished product. In data management projects, data profiling serves the same purpose. Data profiling is an essential initial step that can dramatically reduce the time and cost it takes to plan and execute data management projects.
Before you can integrate data or use it in a data warehouse, CRM, ERP, or analytics applications, you need a full understanding of its content, quality, and structure — not only as it relates to its original source, but also in the context of what your integration or migration effort is hoping to achieve. We have seen many organizations make assumptions about the data that turned out to be wrong!
Data profiling enables you to discover and analyze data anomalies across all data sources and test assumptions about the data. It finds hidden data problems that put projects at risk. It handles data quality discrepancies before they become a problem. And by leveraging AI/ML, data quality rules can be automatically created and applied to relevant data sets based on the results of the data profiling.
Companies who use data profiling in the early stages of their data management initiatives typically achieve significant ROI by reducing the amount of effort to complete the project than would otherwise have been possible and, more importantly, increasing the quality of project results.
Co-founder & chief technology officer at ThoughtSpot
The biggest mistake people often make is thinking of data quality as a threshold to be met at a specific point in time and waiting for that perfect moment to start using the data. Data quality is a journey, and that journey is greatly accelerated by exposing the data to real users even when you think the quality of the data is not up to par. Of course, you cannot break the trust of the end users, so you have to clearly label poor quality data before you expose it. But when you do that properly, nothing cleans data better than shining a 10,000-watt spotlight on it and exposing it to a group of end users who have a real strategic need for that data.
CTO at Alation
Require data to have a description of use, context, and meaning. That may seem very counter-intuitive. Most people relate data quality to the data being physically flawed such as a social security number that includes alphabetic characters or a phone number that is all nines. In reality, modern applications, databases, and data manipulation tools do a pretty good job of resolving these obvious issues. They do exist, but the much more significant data quality issue is the slippery one related to data being fit for purpose. Fit for purpose means knowing if data is the right and correct data for a question being asked or a new analysis being run.
For instance, using a set of data to analyze election results without first understanding it was originally created in response to another question, which caused it to be filtered by specific demographics or augmented with some third-party synthetic data. The data is not wrong, it's just not fit for purpose and will cause a cascading effect of poor decision-making. A common workaround to not knowing if data is fit for purpose is to simply create new data starting with the originating source. This has its own negative effects and enormous costs. The best answer: require descriptive metadata be provided and maintained with the data, so future consumers can easily understand what it is, why it was created, and how it can be used. That is exactly the role of an enterprise data catalog with a well-run data governance process.
Chief data and analytics officer at Alteryx
It’s hard to list one step as the most important on impacting data quality, as a lot depends on where your organization is on the journey. The first, and likely most important step may sound counterintuitive, but it is to get as many people as you can to start using the data you have. That’s right, getting people to exercise your incomplete, messy, imperfect data is the first step to fixing it. If people don’t use the data, there is no incentive to improve it.
As people work on projects and start to use data, they will quickly see the challenges which lead to the next step, having an ecosystem that allows these practitioners a method to wrangle and cleanse data for their purposes. There are great technologies with low-code/no-code tools that make this incredibly fast and easy. And the final piece of the puzzle is to set up a process to master these fixes for the enterprise so that as improvements are made, they can be leveraged by the broader organization.
If you take all these steps and combine them together, you will be executing the democratization of analytics within your organization. We work with almost half of the largest 2,000 companies in the world implementing this approach into their organizations and watching them rapidly deliver ROI. While no enterprise has perfect data, most data is good enough to drive powerful results.
Senior managing director and global lead, Accenture Applied Intelligence at Accenture
See who's who in the Protocol Braintrust and browse every previous edition by category here (Updated Nov. 10, 2022).
Data management has long been siloed, which led to data professionals and data scientists inefficiently engineering the type of information they had for each use case. The future of data management requires integrating data across the enterprise, and in turn appointing a C-suite leader that is accountable for ensuring the organization’s data quality with an eye toward the business. In some companies, this would be an additional responsibility for an existing C-suite leader, and in others, it would mean having a defined role, like a Chief Data Officer.
This executive must be empowered to align not just data management, but also AI strategy, with measurable business goals. This includes having strategies in place to capture, store and process data that fuels AI. For example, rather than measuring data quality across the thousands of data points that a company may house, it is critical that relevant data is aligned to use cases across lines of business needs for better consumption. This leader can also establish key governance models and practices that enable curation and management of domain-centric data for ease of access and consumption with trusted data quality.
From my conversations with CEOs in our latest Accenture research, there is an opportunity – and desire – for companies’ most senior leaders to increase their AI expertise and adopt a clear vision for data and AI’s value. The payoff? Our latest analysis of AI among 1,200 global companies found those that are the most AI-mature enjoy 50% greater revenue growth compared to their peers.
Kevin McAllister ( @k__mcallister) is a Research Editor at Protocol, leading the development of Braintrust. Prior to joining the team, he was a rankings data reporter at The Wall Street Journal, where he oversaw structured data projects for the Journal's strategy team.
More from Braintrust