The Pitfalls of Training AI With Made-Up Data


With artificial intelligence (AI) often hampered by a lack of access to real-world data, models are often trained using AI-generated data. While AI's impressive capabilities rely on deep learning from data, it is often using synthetic data, which isn't a perfect match, but remains the most effective method for training AI models in various tasks.

AI is growing up, entering our lives and the workplace as the possibilities of an Einstein in your pocket catches on.


Whether it is writing an essay, creating complex artwork, reviewing policies, creating custom code, or writing an after-dinner speech for you, it’s already beginning to transform how we work and live.

However, artificial intelligence (AI) depends solely on data to do what it does.


Let’s take an example of the prompt: “Create me a picture of a rose”. AI first needs to learn about the various data on offer, before getting to work.

It needs to learn about the typical rose shape, colors, design, petal arrangement — all the characteristics that make a rose a rose.

What is the source of the data from which it learns? The data is supplied by AI-generated data or synthetic data.


Training an Artificial Intelligence

While our focus today is training an AI system with AI-generated data, generally, an AI system is trained with a mix of AI-generated and real-world data.

The process is designed around the constraints of legal, ethical, and secrecy considerations in acquiring real-world data.

But data is critical if you are to generate realistic AI systems — synthetic news readers, for example — and given the lack of real-world data, generating synthetic data, which imitates real-world data, becomes vital.

For example, an AI system might be able to generate a detailed image of a cockpit in an airplane, but it will not match exactly the image of a real-world cockpit.

Step 1: Generating Synthetic Data

The source AI system generates synthetic data that is used to train the target AI model, which could be a neural network or another machine learning algorithm.

The synthetic data is as close as possible to real-world data and enables the target AI system to learn about the object the data is about. It knows about things like shapes, colors, and configuration details.

Step 2: Training data preparation

The synthetic data is mixed with appropriate real-world data. For example, the AI-generated image of an airplane cockpit dashboard is combined with the actual image of a cockpit dashboard.

This is an opportunity for the AI learning model to learn from the data. It can not only identify the component parts of the data, for example, the Fuel Meter and the Altimeter, but also distinguish between synthetic and real-world data.

Step 3: Training the AI model

The target AI model learns from the mixed data set.

For example, the objective is to enable the AI model to learn about different types of images of dogs. The acceptable response is that it can identify the dogs’ names and categorize them as sheepdogs, hound dogs, etc.

The AI model provides a limited collection of real dogs’ images and a wider collection of synthetic data.

The learning model studies and understands the various characteristics and parameters and learns to draw inferences and patterns.

For example, dogs with short tails might be identified as Dobermans, or those with prominent and acutely triangular ears might be identified as German Shepherds.

The learning model also learns not to generalize based on the parameters. For example, Dobermans will have short tails, but all dogs with short tails might not be Dobermans.

Using Data in the Real World

One of the most notable real-world examples of AI trained by AI-generated data is PilotNet, the self-driving car project by NVIDIA.

PilotNet is a deep learning system that learns about real-time driving from both synthetic data and observing human drivers who drive a special car designed to collect data on driving, road conditions, traffic signs, lane markings, vehicles, and pedestrians.

Driving is a complex task because it involves both skills and decision-making within an extremely short period of time. As the human driver drives the car, PilotNet gathers data, and the relevant data is marked as highlighted pixels.

The deep learning system behind the self-driven car must control the driving based on the highlighted pixels that identify various objects on the road, such as pedestrians, traffic signals, and vehicles.

Benefits of Synthetic Data

The main benefits of training AI with synthetic data are:

  • As stated, real-life data is hard to acquire because of various constraints, making synthetic data your best bet. Quality synthetic data that can get as close as possible to real data is the best source of learning for AI learning models.
  • With synthetic data, you don’t have the risks of confidentiality or secrecy breaches that come with real-life data. Real-life data, when legally obtained with consent, comes with strings attached.
  • Synthetic data enables multiple different scenario explorations. For example, in a self-driven car, synthetic data can help exploring driving on a congested street or a highway – without needing to get on the road.

Limitations and Issues

Synthetic data is both an advantage and a limitation because it is not real-world data, regardless of quality.

An AI model takes longer to learn about real-world objects with synthetic data.

Synthetic data is likely to contain erroneous and biased data that could lead to unintended training outcomes because the data doesn’t match real-world use cases.

For example, synthetic data on credit scores and loan applications may contain wrong and biased data against specific communities or be inaccurate because it’s not in sync with the latest changes in data laws.

The outcome could be not only unintended but also dangerous.

However, synthetic data, despite limits, is still the best available data source on which AI models can learn.

However, business organizations might be extremely wary about using AI in sensitive use-cases such as medical treatment, social issues, and loan applications.

The Bottom Line

Acquiring real-world data seems to be a major hindrance in the learning of AI models, and data acquisition faces many obstacles in many forms.

Considering AI can do remarkable things, major institutions like governments, corporations, and research institutions need to work out how to enable AI systems to parse real-time data and strip off parts that, if processed, might cause real-world problems.

However, in the meantime, synthetic data — used carefully — is better than nothing.


Related Terms

Kaushik Pal

Kaushik is a technical architect and software consultant, having over 23 years of experience in software analysis, development, architecture, design, testing and training industry. He has an interest in new technology and innovation areas. He focuses on web architecture, web technologies, Java/J2EE, open source, WebRTC, big data and semantic technologies. He has demonstrated his expertise in requirement analysis, architecture design & implementation, technical use case preparation, and software development. His experience has spanned different domains like insurance, banking, airlines, shipping, document management and product development, etc. He has worked with a wide variety of technologies starting from mainframe (IBM S/390),…