Synthetic Data 2025
Page 8 of 14 · WEF_Synthetic_Data_2025.pdf
As synthetic data becomes increasingly prevalent in
the broader data ecosystem, it can also be misused
or exploited to generate false representations
and undermine public trust. Additionally, the
strategic value of synthetic data can only be
leveraged if its implementation is carried out
responsibly. Dimensions like privacy, accuracy and
representation that synthetic data is designed to
benefit can instead be worsened if its risks are not
addressed. The most critical issues include:23
–Representativeness and bias. If synthetic data
is generated from biased or non-inclusive sources,
or without the involvement of under-represented
groups, models built from such data can
perpetuate or even amplify existing inequalities.
For example, biased healthcare training datasets
may result in misdiagnosis or unequal care.
–Accuracy and utility. High-quality source
data and robust generative processes are
critical; otherwise, errors can be introduced in
downstream systems. For example, synthetic
images meant to train computer vision systems
must reflect realistic lighting, motion and
occlusion characteristics.
–Model collapse. Overreliance on synthetic
data in generative AI model training can lead
to deterioration in model performance, as it
may not capture real-world complexity – a
risk known as “model collapse” or “model
autophagy.” Clear identification, documentation
and traceability of synthetic data are vital to
maintain transparency and model integrity. –Provenance and traceability. Without reliable
metadata or provenance tracking, users cannot
assess data origin or understand whether a
dataset is synthetic, AI-generated or a mix
of both to make informed decisions. Poor
traceability also increases the risk of model
collapse if synthetic data is unknowingly
integrated into the AI training set.
–Privacy and confidentiality. While synthetic
data is often seen as privacy-preserving, the
level of protection depends on the generative
method. Poorly anonymized datasets can
leak sensitive information about individuals or
groups or be vulnerable to deanonymization,
especially when linked to external datasets in
unsecured environments.
–Misuse and deception. A well-known risk with
AI-generated synthetic media is the potential
for misuse to create deepfakes or other
convincing but deceptive material. Without clear
legal frameworks for attribution and labelling,
malicious actors can exploit synthetic or AI-
generated data to impersonate individuals,
spread disinformation or violate consent.
–Erosion of public trust. As synthetic content
proliferates, public scepticism about data
authenticity grows – even for organic data.
This “liar’s dividend” threatens societal trust
and is worsened by poor disclosure or
traceability practices. Strategy and governance3
Synthetic Data: The New Data Frontier
8
Ask AI what this page says about a topic: