This primer presents a taxonomy that categorizes the main uses of synthetic data by intended purpose, based on use cases. While the categories and examples below are not exhaustive or mutually exclusive, they seek to guide organizations and decision-makers to better understand how to leverage synthetic data in their sector. Taxonomy and use cases2 Synthetic data uses TABLE 1 Purpose: Why might synthetic data be used? Example use cases To enhance privacy To safeguard privacy and confidentiality by generating statistically equivalent datasets that maintain analytical utility while eliminating the risk of exposing sensitive personal information.Official statistics: When confidentiality requirements prevent statistical offices from sharing granular microdata from official statistics, the use of synthetic data enables access to detailed data while maintaining privacy.11 To bridge data deficiencies To fill critical data gaps caused by scarcity or bias in organic data.Healthcare and demographics: Organic datasets often exclude vulnerable populations – under-representing low-resource languages,12 rare genetic conditions and entire demographic groups in clinical research. Synthetic data can improve model fairness and health equity by offering representative examples13 and simulating under- represented groups in clinical trials.14 Child behaviour modelling: Synthetic data can replicate children’s behaviour patterns for safer research and development without the direct involvement of minors.15 Financial inclusion: Synthetic data can correct lending practices driven by gender-biased data to provide fairer access to financial services for women in emerging markets.16 Criminal justice: Synthetic data can ensure diverse demographic representation that reduces racial bias in predictive policing models. To improve model performance To provide realistic, diverse inputs during model training for accuracy improvements.Diversification of training data: AI model developers are often looking to diversify training datasets to deliver more accurate or representative outputs. For example, ByteDance reportedly uses synthetic data to augment training datasets for its LLMs.17 Add noise to training data: For some use cases, adding synthetic noise during training can improve model robustness,18 prevent overfitting19 or help assess the impact of existing noise on model performance.20 For scenario modelling and forecasting To generate realistic future scenarios or to create precise virtual replicas of physical systems to simulate, analyse and optimize real-world performance (digital twins). Crisis modelling: To simulate catastrophic events – prolonged blackouts, supply chain collapses, pandemic outbreaks – policy- makers can stress-test responses and build resilience across climate adaptation, urban planning, epidemiology and critical infrastructure. Robotics and autonomous vehicles: Synthetic modelling in robotics offers safe alternatives to plan for scenarios where it is impossible to obtain sufficient training data, such as real-world testing to detect near-misses or failures.21 For example, Waymo created a digital twin of San Francisco’s streetscapes to test autonomous vehicle behaviour in diverse driving conditions, including those not easily encountered – or safe to replicate – in real life.22 Synthetic Data: The New Data Frontier 6

Synthetic Data 2025