Can Synthetic Data Save AI From Bias and Model Drift? Techopedia

Businesses and organizations around the world are rapidly settling into the harsh reality of safely deploying artificial intelligence.

After barging into the global scene fueling the biggest hype in tech in the past decade, AI presented itself as a magical black box that could do almost anything — although black box comes with some negative connotations, including not knowing what is going on inside.

Still, progress marches forward, and the revolution to integrate AI into operations everywhere is irresistible.

But companies are learning, often the hard way, that AI risks are abundant, and — when mismanaged — the risks far outweigh the benefits.

The biggest problems with AI? Compliance, data consent, copyright issues, training data, and bias. Synthetic data — created artificially — can mimic real-world data and could be the key to unlocking AI’s full potential. But can it save AI?

Key Takeaways

Synthetic data offers several benefits over traditional data. It’s cheaper, customizable, avoids bias and privacy concerns, and allows for diverse scenario testing.
However it requires careful validation and human oversight to ensure realistic outputs and ethical considerations.
Synthetic data is valuable in various fields like healthcare (faster clinical trials), autonomous vehicles (simulating rare events), and finance (preserving data privacy).
It can also help mitigate model drift by incorporating a wider range of scenarios into training.

Another Letter of Warning, How Synthetic Data Responds

On June 4, 2024, former and current employees from OpenAI and Google DeepMind released a public letter urging leading AI companies to allow their workers to speak their minds freely about the risks of AI.

When Data Anonymization Holds Back AI Innovation

Studies warn that AI in healthcare is a double-edged sword — it allows breakthrough advancements but also allows for the possibility that the personal medical records of a patient can be accessed by a number of individuals.

Healthcare is not the only sector facing this problem. From governments, to finance and research, numerous industries struggle to deploy AI due to the high standards of data anonymization and data accuracy demands they must meet to operate.

Torsten Staab, principal engineering fellow and head of Innovation and AI at Nightwing, an intelligence services company working to advance national security interests, spoke to Techopedia about the issue.

“Synthetic data can also be algorithmically designed to exclude personally identifiable information, which might be irrelevant for certain model training tasks anyway, thus eliminating potential privacy concerns.”

By avoiding the cloning of potentially copyrighted materials, the risk of copyright infringement can also be lowered significantly, Staab explained.

“Synthetic data can also be used to help train models in a more ethical, controlled manner, preventing models from unfairly targeting or favoring a specific set of outputs.”

Staab warned that despite this potential, synthetic data it is not a silver bullet.

“Checks and balances, in the form of human oversight, must be put in place to ensure that the algorithms used to generate synthetic data are unbiased and produce realistic outputs.”

Feeding non-representative, unrealistic synthetic data into a machine-learning model could potentially create even more harm. “To reduce bias, consent, copyright, and privacy conflicts, there must be a balance between the use of synthetic and real-world data,” Staab said.

Synthetic Data in the Pharma Industry: Better, Faster, Cheaper

Amber Gosney, a managing director within FTI Consulting’s Information Governance, Privacy & Security practice spoke to Techopedia about synthetic data in the pharma industry.

Gosney referred to studies that show that in the clinical trials space a synthetic data set can be more useful or valuable than anonymized data. The Accenture report “Faster and cheaper clinical trials” says that an operating model that effectively integrates synthetic data into clinical trial design is essential for pharma companies to stay ahead of the game.

“Synthetic data can remain in the same (i.e. structured) format as the original data set and is often faster to produce than using regular anonymizing techniques,” Gosney said.

“It can also help with scaling problems, such as with rare diseases where the number of participants for a clinical trial might be very low.”

Gosney explained that a clinical trial data set could also be made “fairer” for under-represented groups in the trial that might otherwise experience disproportionate outcomes of a drug or product.

Model Drift: Real-World Data Vs Synthetic Data

‘Model drift’ is a machine learning (ML) term that refers to the degradation in performance and accuracy of an ML or AI system usually caused by widening gaps between the training data, knowledge base data, and models’ output data.

For example, when the global pandemic hit, organizations around the world soon discovered that their AI models were drifting, providing inaccurate or misleading outputs.

The reason for this was the unexpected data shift and changes in behaviors generated by COVID-19 globally. This unexpected new wave of different data made models no longer effective. Naturally, all AI systems, if not managed, updated, and monitored, tend to drift as new information is constantly presented to the world.

Badeev from Trevolution recognized that synthetic data may lack the complexity and richness that real-world data offers.

“However, synthetic data can be generated to include rare events, ensuring models are exposed to scenarios that might be underrepresented or non-existent in real-world data.”

Badeev said that, for example, in autonomous driving, synthetic data can simulate severe weather conditions and unusual or extreme driving scenarios, which might be lacking in real-world data but critical for safer operation.

Staab from Nightwing added that synthetic data can augment limited real-world data sets with a broader range of scenarios, improving a model’s accuracy and robustness and significantly lowering training costs.

He added:

“A car manufacturer’s ability to train its self-driving car algorithms on billions of miles of synthetically generated roads and complex traffic scenarios provides a significant competitive advantage.”

However, training a model with synthetic training data that is biased or unrepresentative of real-world conditions could reduce the model’s output accuracy to the point where the model becomes useless or even a liability. Staab warned that a model could drift in the wrong direction, effectively decoupling it from reality.

The Bottom Line

Synthetic data shines in crafting controlled experiments for AI models. It allows researchers to probe a model’s response to specific inputs, offering a window into its decision-making process.

Even more valuable, synthetic data enables testing models across diverse scenarios, ensuring consistent and predictable behavior. This is critical in safety-sensitive fields like healthcare, finance, and autonomous vehicles, where model reliability is paramount.

However, while synthetic data is a powerful tool, it’s not a magic solution. A balanced approach that incorporates real-world data and human oversight remains essential.

Real-world data grounds the model in the complexities of the actual environment it will operate. Human expertise serves as a crucial check, ensuring the model’s goals are aligned with ethical considerations and real-world applications.