Synthetic data: Core definitions1 Broadly speaking, data2 can be defined as a collection of facts, numbers, words or observations,3 either structured or unstructured, and generated through a variety of interactions or processes (e.g. commercial transactions, creative or industrial processes, human behaviour or environmental measurements). It may be collected directly via human interaction or indirectly from devices, systems or networks, and is stored through technological means. Data processing and data analysis then allow for the transformation of raw datasets into insights for decision-making. But not all data is created equal, making it essential for stakeholders to understand the origin, purpose, authenticity and reliability of the data that feeds their decision-making and artificial intelligence (AI) systems. Organic data refers to data that is generated naturally and collected directly from real sources, through authentic and unaltered interactions, behaviours or phenomena, such as user actions on a website, sensor readings of an environment or transactions within a system.4 While organic data may be filtered, aggregated or organized for easier analysis, its core features and statistical properties remain faithful to real-world events. Synthetic data refers to data that is generated by artificial means, such as statistical algorithmic methods or by AI. Synthetic data is created rather than collected from real-world sources, with the aim of addressing varied challenges of data unavailability, scarcity, privacy or representativeness. For specific use cases, synthetic data reproduces the key statistical characteristics, structure or distribution of organic data and is generated through methods like statistical modelling, machine learning (ML) algorithms, simulations or hybrid approaches.5 While initially developed to enhance privacy,6 it has many other uses as a substitute or complement when real- world data is unavailable, impractical or suboptimal.7 Like data in general, synthetic data can take many forms, each defined by its underlying generation method and the challenge it seeks to solve: –AI-generated data is a type of synthetic data produced by AI models, including ML methods and generative models (e.g. generative adversarial networks or GANs; large language models or LLMs).8 It is often created to replicate real-world data for specific tasks, such as image generation, text synthesis or content creation, but can also be produced purely for the enjoyment of a user. It can support data augmentation, model training and the creation of datasets for AI systems.9 AI-generated data is distinct because it comes from AI models, which allow for more creativity and reasoning in their outputs, but introduces unique risks like lack of transparency or hallucinations. –Simulated data is a type of synthetic data that is generated through traditional modelling, simulation techniques and AI methods. It aims to accurately represent the characteristics and behaviour of real-world phenomena (e.g. physical systems, economic models or digital twins). While the line between simulated data and other forms of synthetic data is increasingly blurring, a distinctive feature of simulated data is its focus on representing real-world dynamics and behaviours, often for testing, scenario analysis or risk assessment. Hybrid datasets, meanwhile, combine organic and synthetic data, often with the goal of creating more robust datasets, for example to fill gaps where real data is missing or under-represented in clinical trial data.10 Conversely, they may contain synthetic data refined through human-in-the-loop editing to enhance its accuracy or align it more closely with the real world.

Synthetic Data 2025