Extended definition
Synthetic data is artificially generated data that reproduces the statistical properties of a real dataset without exposing the original records. The idea is to learn the distribution that generated the real data and, from it, sample new plausible examples. Figueira and Vaz (2022) organize the techniques into two broad families: traditional, statistics-based methods such as Bayesian networks, trees, and Synthpop, and deep-learning methods such as generative adversarial networks, variational autoencoders, diffusion models, and, more recently, language models. The quality of a synthetic dataset is assessed along three dimensions that tend to be in tension: fidelity, the statistical resemblance to the real data; utility, the performance of models trained on the synthetic when applied to the real; and privacy, the risk of re-identifying individuals from the original set. Murtaza and colleagues (2023), in the review of the healthcare domain, show that there is no single quality metric, and that the choice of generator depends on the desired balance among these three dimensions.
When it applies
Synthetic data applies when real data is scarce, expensive to obtain, or too sensitive to circulate. It applies to privacy-preserving sharing: a synthetic dataset can be published and reused where the original data, protected by legislation, could not. It applies to data augmentation, generating additional examples to train models when the real sample is small, and to balancing rare classes. It applies to system development and testing, providing realistic data without exposing real records. Dankar and Ibrahim (2021) offer practical guidelines for making synthetic data genuinely useful, showing that preprocessing, tuning, and utility measurement directly affect the quality of the result. In research, synthetic data enables reproducibility by allowing a substitute to be shared when the real data cannot be opened.
When it does not apply
Synthetic data does not apply as an automatic guarantee of privacy. A generator that overfits the original may memorize and reveal real records, and privacy holds only when explicitly measured, not assumed. It does not apply as a perfect substitute for the real: fidelity is partial, and subtle patterns, tail correlations, and longitudinal structures tend to be lost, which Murtaza and colleagues (2023) flag as a recurrent limitation. It does not apply without task-based validation: a dataset that looks statistically faithful may train models that fail on real data. It does not apply to creating nonexistent information; the synthetic amplifies and protects what is already in the data, but does not invent new signal. And it does not apply where it inherits and amplifies the bias of the source set, a risk that requires auditing before use.
Applications by field
- Health: privacy-preserving sharing of patient data, with re-identification risk assessment before release.
- Finance: generation of transactional data for fraud detection and system testing without exposing customer records.
- Computer vision: synthetic image data for augmentation and for rare scenarios that are hard to collect.
- Research and reproducibility: a publishable substitute for a sensitive dataset, allowing replication without opening the real data.
Common pitfalls
The first pitfall is assuming privacy without measuring it: synthetic is not a synonym for anonymous, and generators that memorize can leak real records. The second is trusting statistical fidelity alone without testing utility on the target task. The third is ignoring the trade-off among the three dimensions: maximizing privacy usually degrades fidelity and utility, and the balance point is a decision, not a default. The fourth is inheriting the source data’s bias without auditing it, propagating inequality to the trained models. The fifth is treating synthetic data as a creator of new signal, when it only reorganizes and protects information already present, without replacing the collection of real data when that is what is missing.