Data is crucial for developing equitable, accurate and fair AI models. Various data-related challenges exist, including data accessibility, imbalance and ownership. Different methodologies are being implemented globally to address these issues.2.2 Curate diverse, high-quality datasets Key challenges in curating high-quality and diverse datasets TABLE 2 Key challenges Examples of successful initiatives Access to high- quality dataOpen data platforms: Government programmes are encouraging the development of open data sharing through the mutual sharing of public and private datasets. Synthetic data: Synthetic data is being used when there is a lack of diverse dataset availability, specifically for model training requirements. Transparent multi-sided data markets: Developing marketplaces that allow for the structured exchange of data are helping to free data currently locked away within large platforms. Addressing current data inequityDiverse and inclusive regional datasets: Capturing and curating datasets that represent local communities ensures that regional knowledge and insight are represented within AI model development. Digital language banks: Governments, the private sector and non-governmental organizations (NGOs) are collaborating to capture differences in idioms, cultural norms and religious considerations to build diverse language training datasets. Data equity approach: The adoption of a data equity approach across industries is helping to ensure that data represents all parts of the population. Increasing data ownership considerationsInternational data sharing agreements: Cross-border data flows within bilateral or multilateral trade agreements are accelerating the pace of innovation and AI product deployment while protecting national interests. Data residency requirements: National security and data protections are being governed through data residency requirements, which are also impacting AI regional infrastructure investment. Keeping pace with advancements in AIData governance frameworks: Frameworks and data protection rules are providing robust guidelines to ensure data accuracy, reliability, consistency, licencing and compliance across all stages of the AI development life cycle. Consensus on data quality: National and regional collaboration can help build parameters for collecting high- quality data, including the timeliness, accuracy, completeness, representativeness and consistency of metadata. Lack of trust in AI Robust AI disclosure requirements: These are being developed to ensure that individuals and organizations understand when their outputs are AI-derived, while providing greater transparency on sources. New guidance for the thresholds of data collection: Refreshed guidelines on data privacy are being adopted to address risks related to personal data collected by AI. Opt-in/out approaches: Organizations are exploring the possibility of providing an opt-in/opt-out approach for individuals to make informed choices on the benefits of AI usage versus their choice not to engage. Curating diverse and high-quality datasets requires a coordinated action plan involving many stakeholders. A preliminary set of five capabilities frames how this strategic objective can be delivered: Available and accessible data To realize the transformative nature of AI, data must be available and accessible for AI model development so that AI can truthfully and accurately represent the spectrum of communities it aims to empower. In the context of inclusive AI, it is important to consider the sensitivity, relationship, originality and value of data for model development. Globally, governments have committed to the United Nations Global Digital Compact, which emphasizes the need for multistakeholder cooperation for the development and deployment of open data, software and AI models. Fugaku LLM, for example, is a Japanese-based open-source LLM developed by a public-private and academic partnership.12 Trained on over 380 billion tokens of data, significant effort was made to ensure that at least 60% of the training data originated in Japan for a Japanese audience. When data is not available, artificially generated data (known as “synthetic data”) can bridge the gap.13 However, while synthetic data can be helpful, training a model purely on synthetic data can result in narrow model outputs, leading to the erosion of the diversity that the synthetic data is aiming to address.14 Blueprint for Intelligent Economies 11

A Blueprint for Intelligent Economies 2024