Synthetic Data 2025
Page 13 of 14 · WEF_Synthetic_Data_2025.pdf
Endnotes
1. The World Economic Forum’s network of Global Future Councils (GFCs) is the world’s foremost multistakeholder and
interdisciplinary knowledge network dedicated to promoting innovative thinking to shape a more resilient, inclusive and
sustainable future. Learn more about the GFC on Data Frontiers here: https://initiatives.weforum.org/global-future-council-
on-data-frontiers/home.
2. For the purposes of this publication, data is considered at the level of datasets or collections, rather than individual
elements such as pixels, images and documents.
3. IBM. (2024, October 1). What is data? https://www.ibm.com/think/topics/data.
4. Johnston, P . (2025, April 1). The Role of Organic and Synthetic Data in AI Safety and Security. ActiveFence. https://www.
activefence.com/the-role-of-organic-and-synthetic-data-in-ai-safety-and-security.
5. Synthetic data may be created through a variety of techniques, including statistical models (e.g. generative adversarial
networks or GANs, diffusion models, and generative AI models such as large language models or LLMs); mathematical
models (e.g. physical simulations and differential equations); rule-based systems and hybrid approaches. See https://aws.
amazon.com/what-is/synthetic-data/ for discussion of various generation techniques.
6. Rubin, D. (1993). Discussion: Statistical Disclosure Limitation. Journal of Official Statistics. 9: 461–468.
7. European Data Protection Supervisor. (n.d.). Synthetic Data. https://www.edps.europa.eu/press-publications/publications/
techsonar/synthetic-data_en.
8. Cao, Y., et. al. (2023, March 7). A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from
GAN to ChatGPT. arXiv:2303.04226v1.
9. Goyal, M., & Mahmoud, Q. H. (2024). A Systematic Review of Synthetic Data Generation Techniques Using Generative AI.
Electronics, 13(17), 3509.
10. IBM. (2023, January 31). What is synthetic data? https://www.ibm.com/think/topics/synthetic-data.
11. United Nations. Economic Commission for Europe. (2023). Synthetic Data for Official Statistics: A Starter Guide. https://unece.
org/statistics/publications/synthetic-data-official-statistics-starter-guide.
12. Peppin, A., et. al. (2025, May 27). The Multilingual Divide and its Impact on Global AI Safety. arXiv:2505.21344v1.
13. Peng, C., et. al. (2023, May 22). A Study of Generative Large Language Model for Medical Research and Healthcare.
arXiv:2305.13523v1; Juwara, L., et. Al. (2024, April 12). An evaluation of synthetic data augmentation for mitigating
covariate bias in health data. doi: 10.1016/100946.
14. El Kababji, S., et. al. (2025, March 5). Augmenting Insufficiently Accruing Oncology Clinical Trials Using Generative
Models: Validation Study. doi: 10.2196/66821.
15. Terblanche, C., et. al. (2024, July 11). The development of synthetic child speech in three South African languages. doi:
10.1080/2374312.
16. World Economic Forum. (2024, September). Advancing Data Equity: An Action-Oriented Framework. https://www3.
weforum.org/docs/WEF_Advancing_Data_Equity_2024.pdf.
17. ByteDance. (2025, May 12). Bytedance’s Seed1.5-Embedding Model Achieves Sota in Retrieval: Training Details Unveiled.
https://seed.bytedance.com/en/blog/bytedance-s-seed1-5-embedding-model-achieves-sota-in-retrieval-training-details-unveiled.
18. Karpukhin, et. al (2019, February 5). Training on Synthetic Noise Improves Robustness to Natural Noise in Machine
Translation. arXiv:1902.01509v1.
19. Shuryak I. (2017). Advantages of Synthetic Noise and Machine Learning for Analyzing Radioecological Data Sets. PloS
one, 12(1), e0170007. https://doi.org/10.1371/journal.pone.0170007.
20. De Vries & Thierens (2024, September 23). Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise
Research. arXiv:2309.04318v2.
21. Stanford Online. (2025 April 30). Stanford Seminar - Towards Open World Robot Safety [Video]. YouTube. https://www.
youtube.com/watch?app=desktop&v=zRq_3f4qrcU.
22. Tancik, M., et. al. (2022, February 10). Block-NeRF: Scalable Large Scene Neural View Synthesis. arXiv:2202.05263v1.
23. De Wilde, P ., et. al. (2024, February 29). Recommendations on the Use of Synthetic Data to Train AI Models. United
Nations University.
24. Kazdan, J., et. al. (2025, March 17). Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World.
arXiv:2410.16713v4.
25. Gillman, N., et. al. (2024, June 10). Self-Correcting Self-Consuming Loops for Generative Model Training. arXiv:2402.07087v3.
26. European Commission. (n.d.). Data protection. https://commission.europa.eu/law/law-topic/data-protection_en.
27. Personal Data Protection Commission Singapore. (n.d.). Privacy Enhancing Technology (PET): Proposed Guide on Synthetic
Data Generation. https://www.pdpc.gov.sg/help-and-resources/2024/07/proposed-guide-on-synthetic-data-generation.
28. United Kingdom Information Commissioner’s Office (ICO). (n.d.) Guidance on Privacy-Enhancing Technologies. https://ico.
org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-sharing/privacy-enhancing-technologies/.
Synthetic Data: The New Data Frontier
13
Ask AI what this page says about a topic: