2 A structured foundation for evaluating and governing AI agents enables consistent assessment and oversight across contexts. As AI agents mature and adoption increases, a functional understanding of their roles and properties is beginning to take shape. Rather than classifying agents solely by modality (e.g. text, speech, vision) or domain (e.g. customer service, decision support, workflow orchestration), it is more effective to evaluate them according to their intended purpose, core properties and operating context. This approach creates a clearer foundation for assessing impacts and designing safeguards that are proportionate to an agent’s role. Systematic classification is important because it provides a common basis for comparing agents, anticipating risks and linking evaluation and governance decisions to the realities of how an agent operates. Without it, oversight risks may become inconsistent, reactive or disconnected from an agent’s actual capabilities and environment. To establish this foundation, this report introduces four foundational pillars which, in combination, provide a structured approach to assessment and adoption: –Classification: Establish the agent’s characteristics and operational context to inform downstream assessment. –Evaluation: Generate evidence of performance and limitations in representative settings. –Risk assessment: Analyse potential harm using classification and evaluation as inputs. –Governance: Translate classification, evaluation and risk assessment results into safeguards and accountability proportionate to the agent’s profile. These foundations apply to diverse AI agents, encompassing both virtual and embodied systems in different operational contexts. They provide a consistent basis for assessing performance, identifying risks and establishing governance mechanisms that scale with an agent’s autonomy, authority and function. To address classification, evaluation, risk assessment and governance, it is useful to distinguish between two main stakeholder perspectives:13 –Provider: Refers to organizations or individuals that supply AI systems, platforms or tools. Their responsibilities include ensuring that products are developed and maintained in accordance with responsible and ethical guidelines, and that the necessary documentation and support are provided. –Adopter: Refers to individuals within an organization who use AI systems, encompassing responsibilities such as procurement and deployment. Procurement involves the responsibility of acquiring AI solutions for organizational use by conducting due diligence and ensuring that all AI agent solutions comply with organizational policies and regulatory requirements. Deployment is the responsibility for implementing AI systems in accordance with documented requirements and plans, while ensuring that risks and impacts of the AI agent are properly assessed and managed. The adopter depends on the provider for transparent documentation, model and system specifications, and sufficient performance and risk information to support responsible deployment and oversight throughout the system life cycle. The four pillars form a continuous and parallel progression in which classification provides structure, evaluation establishes evidence, risk assessment identifies and mitigates potential harms, and governance translates those insights into safeguards and accountability.Foundations for AI agent evaluation and governance Systematic classification is important because it provides a common basis for comparing agents, anticipating risks and linking evaluation and governance. AI Agents in Action: Foundations for Evaluation and Governance 12

AI Agents in Action Foundations for Evaluation and Governance 2025