Synthetic Data 2025

Page 9 of 14 · WEF_Synthetic_Data_2025.pdf

Recommendations Given the outlined risks and challenges, key decision- makers must carefully consider the trade-offs when deciding whether and how to leverage synthetic data. Typically, in organizations, the use of synthetic data is shaped by a collaboration between developers and adopters (e.g. researchers and data scientists) with regulators and policy-makers (e.g. legal advisors, executive leadership and policy teams). For developers and adopters: –Prioritize model quality: Implement quality assessment protocols to ensure that synthetic data generation models reflect relevant dimensions while also preserving real-world data distributions and meeting privacy and fairness standards. –Invest in robust traceability and provenance: Implement robust systems to track data origins and transformations, using metadata to identify synthetic elements and their sources. Upfront investment is essential, as retroactive tracing can be costly or impossible. –Ensure transparency: Make data generation processes transparent to distinguish synthetic data from organic data. –Implement technical safeguards: Techniques like watermarking, cryptographic provenance or dataset “nutrition labels” build trust and should be combined with human oversight for high-risk applications. –Diversify stakeholder engagement: Involve diverse communities in governance to identify risks, enhance legitimacy and assess for biases, especially with marginalized groups. –Mitigate model collapse: Avoid relying solely on synthetic data for training AI models. Use hybrid approaches that combine synthetic and organic data,24 and incorporate self- correction mechanisms based on organic data distributions.25 For regulators and policy-makers: –Tailor governance: Not all synthetic data is created equal. Governance frameworks must distinguish between synthetic data intended to replicate real-world distributions and AI- generated data created for entertainment, expression or model training. Regulations should specify intended use, impact and safeguards for each category. –Develop context-aware standards: Support the creation of sector-specific standards and related benchmarks, such as those for responsible AI, which safeguard an organization’s long-term ability to innovate responsibly. It would be useful to build on efforts by privacy regulators (e.g. the European Commission,26 Personal Data Protection Commission Singapore27 and the Information Commissioner’s Office in the United Kingdom28) and international organizations (e.g. the United Nations), especially for confidentiality-protecting public data releases. –Promote education and capacity-building: Provide guidance for developers, regulators and decision-makers on when and how to use synthetic data responsibly, by using tools like impact assessments, provenance checklists and red-teaming exercises. Synthetic Data: The New Data Frontier 9
Ask AI what this page says about a topic: