Synthetic Data

What Does Synthetic Data Mean?

Synthetic data is input that is generated mathematically from a statistical model. Synthetic data plays an important role in finance, healthcare and artificial intelligence (AI) when it is used to protect personally identifiable information (PII) in raw data and fabricate massive amounts of new data to train machine learning (ML) algorithms.


Synthetic data is created by executing sequential statistical regression models against each variable in a real-world data source. Any new data collected from the regression models will statistically have the same properties as the originating data, but its values will not correspond to a specific record, person or device.

Synthetic data provides data scientists and analysts with quick access to additional data and frees them from having to worry about compliance. Its varied uses include:

  • Machine learning (ML) — synthetic data can be used to quickly create additional data that statistically resembles the originating raw data.
  • Analytics — synthetic data can be used to build large datasets by extrapolating information from relatively small datasets.
  • Compliance — synthetic data can be used to provide data privacy by de-coupling the information a record contains from its originating source.
  • Information security — synthetic data can be used to populate honeypots with fabricated data that's realistic enough to attract attackers.
  • Software development — synthetic data can be used in quality assurance (QA) to test code changes in a sandbox environment.

Techopedia Explains Synthetic Data

Perhaps the clearest way to explain the concept of synthetic data is that synthetic data is not “real” data created naturally in the real world, “IRL” or “in the meatspace” as pros sometimes refer to the non-digital world. Synthetic data is created without actual driving organic data events.

For example, while a real set of identifiers is collected about a customer who uses a platform, an engineer could ultimately just create the same identifiers for a fictional customer, and load them into the system – and that would be an example of synthetic data.

A better understanding of synthetic data has to do with how it's used in machine learning and similar technologies. The key is in how that data is generated, because unlike real data, synthetic data has to be created.

Synthetic data is a fundamental concept in new data technologies that makes use of non-authentic, invented or automatically generated data that are not event-generated in the real world. In contrasting real and synthetic data, it's possible to understand more about how machine learning and other new forms of artificial intelligence work.

The use of synthetic data is due to be a major issue in the development of future test and training data sets from machine learning technologies such as neural networks.


Related Terms

Latest Artificial Intelligence Terms

Related Reading

Margaret Rouse

Margaret Rouse is an award-winning technical writer and teacher known for her ability to explain complex technical subjects to a non-technical, business audience. Over the past twenty years her explanations have appeared on TechTarget websites and she's been cited as an authority in articles by the New York Times, Time Magazine, USA Today, ZDNet, PC Magazine and Discovery Magazine.Margaret's idea of a fun day is helping IT and business professionals learn to speak each other’s highly specialized languages. If you have a suggestion for a new definition or how to improve a technical explanation, please email Margaret or contact her…