Synthetic Data 2025
Page 6 of 14 · WEF_Synthetic_Data_2025.pdf
This primer presents a taxonomy that categorizes
the main uses of synthetic data by intended
purpose, based on use cases. While the categories
and examples below are not exhaustive or mutually exclusive, they seek to guide organizations and
decision-makers to better understand how to
leverage synthetic data in their sector. Taxonomy and use cases2
Synthetic data uses TABLE 1
Purpose: Why might synthetic data be used? Example use cases
To enhance privacy
To safeguard privacy and confidentiality by
generating statistically equivalent datasets that
maintain analytical utility while eliminating the risk of
exposing sensitive personal information.Official statistics: When confidentiality requirements prevent
statistical offices from sharing granular microdata from official
statistics, the use of synthetic data enables access to detailed data
while maintaining privacy.11
To bridge data deficiencies
To fill critical data gaps caused by scarcity or bias in
organic data.Healthcare and demographics: Organic datasets often exclude
vulnerable populations – under-representing low-resource languages,12
rare genetic conditions and entire demographic groups in clinical
research. Synthetic data can improve model fairness and health
equity by offering representative examples13 and simulating under-
represented groups in clinical trials.14
Child behaviour modelling: Synthetic data can replicate children’s
behaviour patterns for safer research and development without the
direct involvement of minors.15
Financial inclusion: Synthetic data can correct lending practices driven
by gender-biased data to provide fairer access to financial services for
women in emerging markets.16
Criminal justice: Synthetic data can ensure diverse demographic
representation that reduces racial bias in predictive policing models.
To improve model performance
To provide realistic, diverse inputs during model
training for accuracy improvements.Diversification of training data: AI model developers are often
looking to diversify training datasets to deliver more accurate or
representative outputs. For example, ByteDance reportedly uses
synthetic data to augment training datasets for its LLMs.17
Add noise to training data: For some use cases, adding synthetic
noise during training can improve model robustness,18 prevent
overfitting19 or help assess the impact of existing noise on model
performance.20
For scenario modelling and forecasting
To generate realistic future scenarios or to create
precise virtual replicas of physical systems
to simulate, analyse and optimize real-world
performance (digital twins). Crisis modelling: To simulate catastrophic events – prolonged
blackouts, supply chain collapses, pandemic outbreaks – policy-
makers can stress-test responses and build resilience across climate
adaptation, urban planning, epidemiology and critical infrastructure.
Robotics and autonomous vehicles: Synthetic modelling in robotics
offers safe alternatives to plan for scenarios where it is impossible
to obtain sufficient training data, such as real-world testing to detect
near-misses or failures.21 For example, Waymo created a digital twin of
San Francisco’s streetscapes to test autonomous vehicle behaviour in
diverse driving conditions, including those not easily encountered – or
safe to replicate – in real life.22
Synthetic Data: The New Data Frontier
6
Ask AI what this page says about a topic: