What Does Synthetic Data Mean?
Synthetic data is input that is generated mathematically from a statistical model. Synthetic data plays an important role in finance, healthcare and artificial intelligence (AI) when it is used to protect personally identifiable information (PII) in raw data and fabricate massive amounts of new data to train machine learning (ML) algorithms.
Synthetic data is created by executing sequential statistical regression models against each variable in a real-world data source. Any new data collected from the regression models will statistically have the same properties as the originating data, but its values will not correspond to a specific record, person or device.
Synthetic data provides data scientists and analysts with quick access to additional data and frees them from having to worry about compliance. Its varied uses include:
- Machine learning (ML) -- synthetic data can be used to quickly create additional data that statistically resembles the originating raw data.
- Analytics -- synthetic data can be used to build large datasets by extrapolating information from relatively small datasets.
- Compliance -- synthetic data can be used to provide data privacy by de-coupling the information a record contains from its originating source.
- Information security -- synthetic data can be used to populate honeypots with fabricated data that's realistic enough to attract attackers.
- Software development -- synthetic data can be used in quality assurance (QA) to test code changes in a sandbox environment.