AI Agents in Action Foundations for Evaluation and Governance 2025
Page 19 of 34 · WEF_AI_Agents_in_Action_Foundations_for_Evaluation_and_Governance_2025.pdf
such as reasoning, planning and tool use. Once
validated in sandbox environments that mirror
real-world tasks, agents may progress to controlled
deployment, where they are integrated into workflows
under close monitoring, with safeguards in place to
confirm that they align with human or established
decisions. Full deployment should only follow once
reliability has been demonstrated, with fallback
mechanisms and defined human oversight. Audit
logs are central throughout this life cycle, providing
structured records of agent activity and the rationale
behind it. Audit logs also support governance
by enabling oversight and accountability, aiding
debugging by tracing errors and points of failure,
and helping inform evaluation.
The following principles support this life cycle of
agent evaluation:
–Contextualization: Reflect the tools,
workflows and edge cases the agent will
encounter in practice.
–Multidimensional assessment: Define success
across various factors, including accuracy,
robustness, latency tolerance, compliance,
and user trust.
–Temporal and behavioural monitoring: Track
performance over time to detect regressions,
shifts in behaviour, or failures to adapt to
evolving inputs.Emerging evaluation tools are increasingly applied
in enterprise settings to support the continuous
assessment of agentic systems, helping to track
reasoning, compare outcomes to expectations
and detect anomalies that are overlooked by
traditional testing. Major cloud providers have also
started embedding such frameworks into their AI
platforms, highlighting the importance of deployer-
side evaluation for adoption.
By approaching evaluation as a structured, context-
aware and continuous process, organizations can
more effectively determine whether an agent is fit
for deployment.
To illustrate how these principles apply in practice,
the following illustration examines a coding co-pilot
agent. The illustration applies the evaluation
dimensions from a deployer’s perspective, showing
how task-level and system-level metrics can be
used to assess reliability, safety and overall
performance in an operational setting.
Effective evaluation depends on close collaboration
between providers and adopters, where transparent
documentation, model specifications and performance
reports from providers enable deployers to validate
reliability, identify risks and apply safeguards
throughout the system life cycle.
The results form an integrated performance
profile that informs subsequent risk assessment
and governance. An effective
provider-focused
evaluation should
begin with a
technical screening
of baseline
capabilities, such as
reasoning, planning
and tool use.
Function Predictability
Use case
Autonomy Environment
Authority RoleTool call
success
Edge case
robustness
Trust indicatorsCapabilities
Task success
rate
Task completion
time
Error types And more...Define context Evaluate risks
Identify risks Manage risks
Analyse risksAccess
control
Trustworthiness &
explainabilityTraceability &
identity
Monitoring &
loggingLegal &
compliance
Manual
redundancyLong-term
management
Human
oversightTesting &
validation
And more...
Evaluation
criteriaClassification
dimensionsRisk assessment
life cycleProgressive governance practices
Define
the useFoundations for AI agent evaluation and governance – evaluation criteria FIGURE 8
AI Agents in Action: Foundations for Evaluation and Governance
19
Ask AI what this page says about a topic: