Foundations for AI agent evaluation and governance: progressive governance practices FIGURE 10 Function Predictability Use case Autonomy Environment Authority RoleTool call success Edge case robustness Trust indicatorsCapabilities Task success rate Task completion time Error types And more...Deﬁne context Evaluate risks Identify risks Manage risks Analyse risksAccess control Trustworthiness & explainabilityTraceability & identity Monitoring & loggingLegal & compliance Manual redundancyLong-term management Human oversightTesting & validation And more... Deﬁne the useEvaluation criteriaClassiﬁcation dimensionsRisk assessment life cycleProgressive governance practices For all agents, regardless of their level of autonomy, authority or the complexity of their operational context, specific governance mechanisms should serve as a baseline for adoption. At a minimum, every agent should operate under strict access control based on the principle of least privilege, with clear task boundaries that prevent unnecessary system or data access. Basic legal and compliance checks, such as data protection impact assessments and privacy compliance reviews, are necessary to ensure alignment with regulatory obligations. In addition, technical controls such as input and output filters can help constrain agent behaviour by screening potentially harmful, irrelevant or non-compliant interactions before they propagate through the system. Prior to deployment, agents should undergo sandbox or controlled pilot testing using non- production data to validate expected behaviour and mitigate unintended effects. All actions and planning should be recorded in an audit log for traceability, supported by monitoring tools or alerts tailored to the agent’s overall profile. This enables the detection of anomalies early, while balancing concerns about privacy and surveillance risks associated with monitoring at scale. Human oversight, through policy reviews, audit log analysis and supervisory triggers, helps ensure alignment with organizational priorities. Unique identifiers and output tagging support attribution, performance tracking and post-incident analysis. In practice, the depth of safeguards should scale with the agent’s autonomy, authority, complexity of context and overall impact. Higher-risk systems require proportionally greater investment in monitoring and oversight, with a deliberate balance between human review and automated, continuous monitoring. By embedding these measures into the life cycle of all agents, organizations establish a governance baseline that can scale proportionally with complexity and risk. This foundation helps address immediate operational safety and compliance needs, creating the structures and practices upon which more advanced, context-specific governance mechanisms can be layered as agents become more autonomous, integrated and capable. Prior to deployment, agents should undergo sandbox or controlled pilot testing using non- production data to validate expected behaviour. AI Agents in Action: Foundations for Evaluation and Governance 27

AI Agents in Action Foundations for Evaluation and Governance 2025