AI Agents in Action Foundations for Evaluation and Governance 2025
Page 27 of 34 · WEF_AI_Agents_in_Action_Foundations_for_Evaluation_and_Governance_2025.pdf
Foundations for AI agent evaluation and governance: progressive governance practices FIGURE 10
Function Predictability
Use case
Autonomy Environment
Authority RoleTool call
success
Edge case
robustness
Trust indicatorsCapabilities
Task success
rate
Task completion
time
Error types And more...Define context Evaluate risks
Identify risks Manage risks
Analyse risksAccess
control
Trustworthiness &
explainabilityTraceability &
identity
Monitoring &
loggingLegal &
compliance
Manual
redundancyLong-term
management
Human
oversightTesting &
validation
And more...
Define
the useEvaluation
criteriaClassification
dimensionsRisk assessment
life cycleProgressive governance practices
For all agents, regardless of their level of autonomy,
authority or the complexity of their operational
context, specific governance mechanisms should
serve as a baseline for adoption. At a minimum,
every agent should operate under strict access
control based on the principle of least privilege,
with clear task boundaries that prevent unnecessary
system or data access. Basic legal and compliance
checks, such as data protection impact
assessments and privacy compliance reviews,
are necessary to ensure alignment with regulatory
obligations. In addition, technical controls such as
input and output filters can help constrain agent
behaviour by screening potentially harmful, irrelevant
or non-compliant interactions before they propagate
through the system.
Prior to deployment, agents should undergo
sandbox or controlled pilot testing using non-
production data to validate expected behaviour
and mitigate unintended effects. All actions and
planning should be recorded in an audit log for
traceability, supported by monitoring tools or alerts
tailored to the agent’s overall profile. This enables the detection of anomalies early, while balancing
concerns about privacy and surveillance risks
associated with monitoring at scale. Human
oversight, through policy reviews, audit log analysis
and supervisory triggers, helps ensure alignment
with organizational priorities. Unique identifiers and
output tagging support attribution, performance
tracking and post-incident analysis. In practice,
the depth of safeguards should scale with the
agent’s autonomy, authority, complexity of context
and overall impact. Higher-risk systems require
proportionally greater investment in monitoring and
oversight, with a deliberate balance between human
review and automated, continuous monitoring.
By embedding these measures into the life cycle
of all agents, organizations establish a governance
baseline that can scale proportionally with complexity
and risk. This foundation helps address immediate
operational safety and compliance needs, creating
the structures and practices upon which more
advanced, context-specific governance mechanisms
can be layered as agents become more autonomous,
integrated and capable. Prior to
deployment,
agents should
undergo sandbox
or controlled pilot
testing using non-
production data to
validate expected
behaviour.
AI Agents in Action: Foundations for Evaluation and Governance
27
Ask AI what this page says about a topic: