Synthetic Data 2025

Page 8 of 14 · WEF_Synthetic_Data_2025.pdf

As synthetic data becomes increasingly prevalent in the broader data ecosystem, it can also be misused or exploited to generate false representations and undermine public trust. Additionally, the strategic value of synthetic data can only be leveraged if its implementation is carried out responsibly. Dimensions like privacy, accuracy and representation that synthetic data is designed to benefit can instead be worsened if its risks are not addressed. The most critical issues include:23 –Representativeness and bias. If synthetic data is generated from biased or non-inclusive sources, or without the involvement of under-represented groups, models built from such data can perpetuate or even amplify existing inequalities. For example, biased healthcare training datasets may result in misdiagnosis or unequal care. –Accuracy and utility. High-quality source data and robust generative processes are critical; otherwise, errors can be introduced in downstream systems. For example, synthetic images meant to train computer vision systems must reflect realistic lighting, motion and occlusion characteristics. –Model collapse. Overreliance on synthetic data in generative AI model training can lead to deterioration in model performance, as it may not capture real-world complexity – a risk known as “model collapse” or “model autophagy.” Clear identification, documentation and traceability of synthetic data are vital to maintain transparency and model integrity. –Provenance and traceability. Without reliable metadata or provenance tracking, users cannot assess data origin or understand whether a dataset is synthetic, AI-generated or a mix of both to make informed decisions. Poor traceability also increases the risk of model collapse if synthetic data is unknowingly integrated into the AI training set. –Privacy and confidentiality. While synthetic data is often seen as privacy-preserving, the level of protection depends on the generative method. Poorly anonymized datasets can leak sensitive information about individuals or groups or be vulnerable to deanonymization, especially when linked to external datasets in unsecured environments. –Misuse and deception. A well-known risk with AI-generated synthetic media is the potential for misuse to create deepfakes or other convincing but deceptive material. Without clear legal frameworks for attribution and labelling, malicious actors can exploit synthetic or AI- generated data to impersonate individuals, spread disinformation or violate consent. –Erosion of public trust. As synthetic content proliferates, public scepticism about data authenticity grows – even for organic data. This “liar’s dividend” threatens societal trust and is worsened by poor disclosure or traceability practices. Strategy and governance3 Synthetic Data: The New Data Frontier 8
Ask AI what this page says about a topic: