Synthetic Data 2025
Page 4 of 14 · WEF_Synthetic_Data_2025.pdf
Synthetic data:
Core definitions1
Broadly speaking, data2 can be defined as a
collection of facts, numbers, words or observations,3
either structured or unstructured, and generated
through a variety of interactions or processes (e.g.
commercial transactions, creative or industrial
processes, human behaviour or environmental
measurements). It may be collected directly via
human interaction or indirectly from devices, systems
or networks, and is stored through technological
means. Data processing and data analysis then allow
for the transformation of raw datasets into insights
for decision-making. But not all data is created equal,
making it essential for stakeholders to understand
the origin, purpose, authenticity and reliability of the
data that feeds their decision-making and artificial
intelligence (AI) systems.
Organic data refers to data that is generated
naturally and collected directly from real sources,
through authentic and unaltered interactions,
behaviours or phenomena, such as user actions
on a website, sensor readings of an environment or
transactions within a system.4 While organic data
may be filtered, aggregated or organized for easier
analysis, its core features and statistical properties
remain faithful to real-world events.
Synthetic data refers to data that is generated
by artificial means, such as statistical algorithmic
methods or by AI. Synthetic data is created rather
than collected from real-world sources, with the aim
of addressing varied challenges of data unavailability,
scarcity, privacy or representativeness. For specific
use cases, synthetic data reproduces the key
statistical characteristics, structure or distribution
of organic data and is generated through methods
like statistical modelling, machine learning (ML)
algorithms, simulations or hybrid approaches.5 While
initially developed to enhance privacy,6 it has many
other uses as a substitute or complement when real-
world data is unavailable, impractical or suboptimal.7 Like data in general, synthetic data can take many
forms, each defined by its underlying generation
method and the challenge it seeks to solve:
–AI-generated data is a type of synthetic
data produced by AI models, including ML
methods and generative models (e.g. generative
adversarial networks or GANs; large language
models or LLMs).8 It is often created to replicate
real-world data for specific tasks, such as
image generation, text synthesis or content
creation, but can also be produced purely for
the enjoyment of a user. It can support data
augmentation, model training and the creation
of datasets for AI systems.9 AI-generated data
is distinct because it comes from AI models,
which allow for more creativity and reasoning
in their outputs, but introduces unique risks like
lack of transparency or hallucinations.
–Simulated data is a type of synthetic data
that is generated through traditional modelling,
simulation techniques and AI methods. It aims
to accurately represent the characteristics
and behaviour of real-world phenomena (e.g.
physical systems, economic models or digital
twins). While the line between simulated data
and other forms of synthetic data is increasingly
blurring, a distinctive feature of simulated data
is its focus on representing real-world dynamics
and behaviours, often for testing, scenario
analysis or risk assessment.
Hybrid datasets, meanwhile, combine organic
and synthetic data, often with the goal of creating
more robust datasets, for example to fill gaps where
real data is missing or under-represented in clinical
trial data.10 Conversely, they may contain synthetic
data refined through human-in-the-loop editing to
enhance its accuracy or align it more closely with
the real world.
Ask AI what this page says about a topic: