such as reasoning, planning and tool use. Once validated in sandbox environments that mirror real-world tasks, agents may progress to controlled deployment, where they are integrated into workflows under close monitoring, with safeguards in place to confirm that they align with human or established decisions. Full deployment should only follow once reliability has been demonstrated, with fallback mechanisms and defined human oversight. Audit logs are central throughout this life cycle, providing structured records of agent activity and the rationale behind it. Audit logs also support governance by enabling oversight and accountability, aiding debugging by tracing errors and points of failure, and helping inform evaluation. The following principles support this life cycle of agent evaluation: –Contextualization: Reflect the tools, workflows and edge cases the agent will encounter in practice. –Multidimensional assessment: Define success across various factors, including accuracy, robustness, latency tolerance, compliance, and user trust. –Temporal and behavioural monitoring: Track performance over time to detect regressions, shifts in behaviour, or failures to adapt to evolving inputs.Emerging evaluation tools are increasingly applied in enterprise settings to support the continuous assessment of agentic systems, helping to track reasoning, compare outcomes to expectations and detect anomalies that are overlooked by traditional testing. Major cloud providers have also started embedding such frameworks into their AI platforms, highlighting the importance of deployer- side evaluation for adoption. By approaching evaluation as a structured, context- aware and continuous process, organizations can more effectively determine whether an agent is fit for deployment. To illustrate how these principles apply in practice, the following illustration examines a coding co-pilot agent. The illustration applies the evaluation dimensions from a deployer’s perspective, showing how task-level and system-level metrics can be used to assess reliability, safety and overall performance in an operational setting. Effective evaluation depends on close collaboration between providers and adopters, where transparent documentation, model specifications and performance reports from providers enable deployers to validate reliability, identify risks and apply safeguards throughout the system life cycle. The results form an integrated performance profile that informs subsequent risk assessment and governance. An effective provider-focused evaluation should begin with a technical screening of baseline capabilities, such as reasoning, planning and tool use. Function Predictability Use case Autonomy Environment Authority RoleTool call success Edge case robustness Trust indicatorsCapabilities Task success rate Task completion time Error types And more...Deﬁne context Evaluate risks Identify risks Manage risks Analyse risksAccess control Trustworthiness & explainabilityTraceability & identity Monitoring & loggingLegal & compliance Manual redundancyLong-term management Human oversightTesting & validation And more... Evaluation criteriaClassiﬁcation dimensionsRisk assessment life cycleProgressive governance practices Deﬁne the useFoundations for AI agent evaluation and governance – evaluation criteria FIGURE 8 AI Agents in Action: Foundations for Evaluation and Governance 19

AI Agents in Action Foundations for Evaluation and Governance 2025