Presented by
Protocol's experts on the biggest questions in tech.

What’s the hardest part about using synthetic data correctly?

What’s the hardest part about using synthetic data correctly?

Making the data realistic, achieving workable distributions and studying potential biases are all hurdles that companies need to clear, members of Protocol's Braintrust say.

Good afternoon! Today we're talking synthetic data with the experts and asked them to let us in on the hardest parts to get right about using it. Questions or comments? Send us a note at

Vivek Raju Muppalla

Director of Synthetic Services at Scale AI

Generating synthetic data is an involved process, and providers must address three major challenges for successful adoption.

First is data generation: Synthetic data is often not realistic enough. This leads models to learn details only in synthetic and fail to perform well on real images. Over the past few years, the industry has made a lot of progress to improve realism and bridge the gap via refinement; however, more is needed to generalize across domains with low costs.

The second is explainability and reproducibility. Proving the correctness of synthetic data compared to real data is a critical step in the process. For example, consider the task of detecting a scratch on a car: The scratch size, its location, lighting conditions, etc. are a few of the many variables that impact the data quality and thus model performance. Tracking the impact of each of these variables and generating data in a deterministic way is key to a reliable system. Incorrectly created data can also lead to compliance and legal issues.

Last but not least are challenges around bias and privacy. Even if the data was correct, a potential issue is data bias. If developers are not attentive and iterating on the data throughout the process, biases can creep in and be amplified. While we need the data to match the real-world distributions, it cannot tie back or expose the original data, especially in a privacy-focused use case.

Kimberly Powell

Vice president of Healthcare at Nvidia

Evaluating if your team is correctly generating and using synthetic data is a multistep process. Data set bias is a significant challenge, and the use of synthetic data can potentially amplify that if not done correctly.

Generating synthetic data with high validity requires robust, generalizable models built over carefully engineered data sets. The opportunity to bring synthetic data exists across industries, but one particularly important application is in health care. Developers need to ensure generative models aren’t simply memorizing the data they were trained on, which is especially crucial for medical records to ensure patient privacy.

Researchers need to take the time to ensure that generated data sets have the desired distribution, either reflective of the real world or the intended data science challenge. For example, you’d want to build a synthetic data set of a variety of patients to study: not a homogenous group, but a diverse cohort that reflects the real world. This is key to combatting bias and creating solutions that work for everyone.

Finally, the clearest indicator of success is that models built on synthetic datasets are monitored for success when applied to real-world challenges.

François Candelon

Managing director and senior partner at BCG; global director at the BCG Henderson Institute

The use of synthetic data is on the rise, with more and more companies deciding to rely on its benefits: cheaper access, privacy protection and simulation of rare real events. However, many are learning that synthetic data use comes at a cost, both at operational and societal levels.

Operationally, companies must keep in mind that synthetic data doesn’t match real-world direct measurements, creating a risk for AI systems to miss the impact of secondary data. Similarly, it can, at times, even amplify biases in trained algorithms.

Additionally, ensuring trust in algorithms is already difficult, and synthetic data adds a layer of uncertainty for society. Feeding AI systems with "non-real" data may create challenges for companies struggling to gain societal approval in using AI at scale; this is what I call the AI social license.

For example, we can make a parallel with the use of animals over humans in drug development. At times, animal testing is seen as easier, and like synthetic data, carries little privacy concerns. Nevertheless, pharmaceutical companies still test on real humans down the road, ensuring side effects are fully assessed and reassuring patients and regulators of effectiveness. The same can be said for using real-world data over synthetic data.

Overall, to tackle these challenges, companies should blend real-world data with synthetic data when preparing and training data, use transfer learning techniques that help algorithms maintain high efficiency when applied to new forms of data sets and ensure quality testing so that an AI system’s outcomes are compared against real environments.

Eric Haller

Executive vice president and general manager of Identity, Fraud and DataLabs at Experian

Generally, as it relates to fraud prevention, this data can be very useful in detecting patterns and trends, against which you can build models, but they don't easily replace the use of the raw data for tracking known offending entities or attributes. For example, capturing specific addresses, email addresses, phone numbers, etc. that have been associated with fraud and placing those into watchlists or negative lists is still a very powerful and common practice, even in consortia environments. In these cases it is possible to use hashed or encrypted data, but not necessarily synthetic data.

The other part of this that is interesting is that fraudsters often will intentionally or unintentionally misspell addresses, names, company names, employer names, etc. or other free-form attributes in order to bypass fraud controls. And any process that converts that data to a synthetic form of the data for downstream analysis can have the unfortunate consequence of losing the intelligence inherent in the raw data. The same issue exists even using data normalization technologies (address normalization, etc.). It's like taking Clorox to the crime scene.

Wei Wang

Chair at the Association for Computing Machinery Special Interest Group on Knowledge Discovery and Data Mining (ACM SIGKDD) and professor in Computer Science at University of California, Los Angeles

There are many challenges facing using synthetic data in training AI/ML systems, one of which is how to make synthetic data truly resemble the scenario we want to model. This requires the synthetic data generator to fully capture the complex relationships between different entities and between these entities and the environment that may be only partially observed. There are several questions we need to answer:

1) How might we identify these complex relationships from usually limited observations collected from a real scenario? There could be spatial, temporal, casual and correlation relationships between two or more entities, and they may evolve over time.

2) How might we identify unknown confounding factors? We might know only very little about these factors and the roles they may play, and may not be able to directly observe them.

3) How might we accurately model these complex relationships and confounding factors for data generation? This remains an active research area with limited success in some special cases. Most data generators tend to oversimplify or sometimes even overlook these complexities.

4) How might we validate the synthetic data? For example, how do we prove that the data generator does not introduce bias?

More from Braintrust
Latest Stories