AI Agents in Action Foundations for Evaluation and Governance 2025
Page 20 of 34 · WEF_AI_Agents_in_Action_Foundations_for_Evaluation_and_Governance_2025.pdf
CASE STUDY 2
Coding co-pilot – evaluation
Agent characteristics
1. Function
3. Predictability
4. Autonomy
5. AuthorityAssists human developers with code generation and debugging
Deterministic Non-deterministic2. Role
Specialist Generalist
Low High
Low HighOperational context
6. Use case
7. Environment
Simple ComplexCoding co-pilot
A coding co-pilot operates in the software development
domain, assisting programmers within their coding
environment by generating, completing and debugging
code to improve productivity and reduce errors.
Coding co-pilot – evaluation
Evaluation starts with controlled tests in development
environments to verify productivity gains while ensuring
safety, reliability and compliance. Evaluation follows several
key steps including:
–Contextualization: Testing across coding tasks such as
code generation, debugging and documentation to reflect
real workflows
–Performance: Measuring task success rate, completion
time and error frequency, along with system metrics like
tool-call success –Robustness: Exposing the agent to ambiguous or
conflicting code to assess recovery, error handling
and adaptability
–Human trust: Gathering user feedback on reliability
and usefulness
–Monitoring: Using continuous logging to detect performance
drift, anomalous tool use or regressions after deployment
AI Agents in Action: Foundations for Evaluation and Governance
20
Ask AI what this page says about a topic: