AI Agents in Action Foundations for Evaluation and Governance 2025
Page 18 of 34 · WEF_AI_Agents_in_Action_Foundations_for_Evaluation_and_Governance_2025.pdf
Robust evaluation is crucial for assessing
agent performance and limitations across
diverse contexts.
As organizations begin deploying agents with
different functional roles, the need for structured
evaluation becomes more important. This section
explores how evaluation methodologies are evolving
to reflect this growing complexity.
Agent “evaluation” refers to the measurement of an
AI agent’s performance and operation in representative
contexts, generating evidence about how well it
achieves intended functions, under what conditions
and with what limitations. This means that robust
evaluation frameworks are essential for building
trust in AI agents’ performance. By providing clear,
multidimensional assessments of agent capabilities
and limitations, evaluations can help organizations
develop appropriate expectations and confidence
in agentic systems.
While the evaluation of foundation models such
as LLMs is supported by a rich landscape of
standardized benchmarks,18,19,20 agent evaluation
remains nascent. Unlike static model testing, agents
operate as orchestrated systems that combine tool
use, memory, decision-making and user interaction,
which exceed the scope of traditional benchmarks.
In response, several agent-specific capability
benchmarks have begun to emerge:
–AgentBench: Tests agents in interactive
environments like web browsing and games,
and is useful for evaluating real-time decision-
making and adaptability21
–SWE-bench: Evaluates an agent’s ability to
resolve GitHub issues in open-source repositories, providing real-world measures of reasoning,
code modification and system integration22
–HCAST: Compares agent performance to
human developers in areas such as programming
tasks, offering calibrated insights into agent
coding capabilities, for example23
Although these emerging benchmarks offer valuable
signals, they are typically built for academic or
research settings, where tasks are predefined,
environments are static and outcomes are often
deterministic. They rarely capture operational
realities such as ambiguous success criteria or
dynamic workflows.
Evaluation requires clear performance metrics that
capture both task-level and system-level outcomes.
Examples include task success rate, completion
time, error types, tool call success, throughput,
robustness against edge cases and user trust
indicators. These metrics help establish whether
the system delivers its functions reliably and provide
the operational evidence that later informs risk
assessment and governance decisions.
Providers benchmark systems to assess technical
maturity, while procurers and deployers are
responsible for ensuring that agents operate
safely and compliantly within specific industry,
organizational and operational contexts. Therefore,
deployment environments provide the most
accurate ground truth, but deployers often lack the
resources to design comprehensive benchmarks. In
many cases, this makes collaboration with providers
essential to establishing meaningful metrics.
An effective provider-focused evaluation should begin
with a technical screening of baseline capabilities, 2.2 Evaluation
AI Agents in Action: Foundations for Evaluation and Governance
18
Ask AI what this page says about a topic: