A Blueprint for Intelligent Economies 2024
Page 11 of 21 · WEF_A_Blueprint_for_Intelligent_Economies_2024.pdf
Data is crucial for developing equitable, accurate and fair AI models. Various
data-related challenges exist, including data accessibility, imbalance and
ownership. Different methodologies are being implemented globally to
address these issues.2.2 Curate diverse, high-quality datasets
Key challenges in curating high-quality and diverse datasets TABLE 2
Key challenges Examples of successful initiatives
Access to high-
quality dataOpen data platforms: Government programmes are encouraging the development of open data sharing through
the mutual sharing of public and private datasets.
Synthetic data: Synthetic data is being used when there is a lack of diverse dataset availability, specifically for
model training requirements.
Transparent multi-sided data markets: Developing marketplaces that allow for the structured exchange of data
are helping to free data currently locked away within large platforms.
Addressing current
data inequityDiverse and inclusive regional datasets: Capturing and curating datasets that represent local communities
ensures that regional knowledge and insight are represented within AI model development.
Digital language banks: Governments, the private sector and non-governmental organizations (NGOs) are
collaborating to capture differences in idioms, cultural norms and religious considerations to build diverse language
training datasets.
Data equity approach: The adoption of a data equity approach across industries is helping to ensure that data
represents all parts of the population.
Increasing data
ownership
considerationsInternational data sharing agreements: Cross-border data flows within bilateral or multilateral trade agreements
are accelerating the pace of innovation and AI product deployment while protecting national interests.
Data residency requirements: National security and data protections are being governed through data residency
requirements, which are also impacting AI regional infrastructure investment.
Keeping pace with
advancements in AIData governance frameworks: Frameworks and data protection rules are providing robust guidelines to ensure
data accuracy, reliability, consistency, licencing and compliance across all stages of the AI development life cycle.
Consensus on data quality: National and regional collaboration can help build parameters for collecting high-
quality data, including the timeliness, accuracy, completeness, representativeness and consistency of metadata.
Lack of trust in AI Robust AI disclosure requirements: These are being developed to ensure that individuals and organizations
understand when their outputs are AI-derived, while providing greater transparency on sources.
New guidance for the thresholds of data collection: Refreshed guidelines on data privacy are being adopted to
address risks related to personal data collected by AI.
Opt-in/out approaches: Organizations are exploring the possibility of providing an opt-in/opt-out approach for
individuals to make informed choices on the benefits of AI usage versus their choice not to engage.
Curating diverse and high-quality datasets requires a
coordinated action plan involving many stakeholders.
A preliminary set of five capabilities frames how this
strategic objective can be delivered:
Available and accessible data
To realize the transformative nature of AI, data must
be available and accessible for AI model development
so that AI can truthfully and accurately represent the
spectrum of communities it aims to empower. In the
context of inclusive AI, it is important to consider
the sensitivity, relationship, originality and value of
data for model development. Globally, governments
have committed to the United Nations Global Digital Compact, which emphasizes the need for
multistakeholder cooperation for the development and
deployment of open data, software and AI models.
Fugaku LLM, for example, is a Japanese-based
open-source LLM developed by a public-private and
academic partnership.12 Trained on over 380 billion
tokens of data, significant effort was made to ensure
that at least 60% of the training data originated in
Japan for a Japanese audience.
When data is not available, artificially generated data
(known as “synthetic data”) can bridge the gap.13
However, while synthetic data can be helpful, training
a model purely on synthetic data can result in narrow
model outputs, leading to the erosion of the diversity
that the synthetic data is aiming to address.14
Blueprint for Intelligent Economies
11
Ask AI what this page says about a topic: