Robust evaluation is crucial for assessing agent performance and limitations across diverse contexts. As organizations begin deploying agents with different functional roles, the need for structured evaluation becomes more important. This section explores how evaluation methodologies are evolving to reflect this growing complexity. Agent “evaluation” refers to the measurement of an AI agent’s performance and operation in representative contexts, generating evidence about how well it achieves intended functions, under what conditions and with what limitations. This means that robust evaluation frameworks are essential for building trust in AI agents’ performance. By providing clear, multidimensional assessments of agent capabilities and limitations, evaluations can help organizations develop appropriate expectations and confidence in agentic systems. While the evaluation of foundation models such as LLMs is supported by a rich landscape of standardized benchmarks,18,19,20 agent evaluation remains nascent. Unlike static model testing, agents operate as orchestrated systems that combine tool use, memory, decision-making and user interaction, which exceed the scope of traditional benchmarks. In response, several agent-specific capability benchmarks have begun to emerge: –AgentBench: Tests agents in interactive environments like web browsing and games, and is useful for evaluating real-time decision- making and adaptability21 –SWE-bench: Evaluates an agent’s ability to resolve GitHub issues in open-source repositories, providing real-world measures of reasoning, code modification and system integration22 –HCAST: Compares agent performance to human developers in areas such as programming tasks, offering calibrated insights into agent coding capabilities, for example23 Although these emerging benchmarks offer valuable signals, they are typically built for academic or research settings, where tasks are predefined, environments are static and outcomes are often deterministic. They rarely capture operational realities such as ambiguous success criteria or dynamic workflows. Evaluation requires clear performance metrics that capture both task-level and system-level outcomes. Examples include task success rate, completion time, error types, tool call success, throughput, robustness against edge cases and user trust indicators. These metrics help establish whether the system delivers its functions reliably and provide the operational evidence that later informs risk assessment and governance decisions. Providers benchmark systems to assess technical maturity, while procurers and deployers are responsible for ensuring that agents operate safely and compliantly within specific industry, organizational and operational contexts. Therefore, deployment environments provide the most accurate ground truth, but deployers often lack the resources to design comprehensive benchmarks. In many cases, this makes collaboration with providers essential to establishing meaningful metrics. An effective provider-focused evaluation should begin with a technical screening of baseline capabilities, 2.2 Evaluation AI Agents in Action: Foundations for Evaluation and Governance 18

AI Agents in Action Foundations for Evaluation and Governance 2025