AI Agents in Action Foundations for Evaluation and Governance 2025

Page 20 of 34 · WEF_AI_Agents_in_Action_Foundations_for_Evaluation_and_Governance_2025.pdf

CASE STUDY 2 Coding co-pilot – evaluation Agent characteristics 1. Function 3. Predictability 4. Autonomy 5. AuthorityAssists human developers with code generation and debugging Deterministic Non-deterministic2. Role Specialist Generalist Low High Low HighOperational context 6. Use case 7. Environment Simple ComplexCoding co-pilot A coding co-pilot operates in the software development domain, assisting programmers within their coding environment by generating, completing and debugging code to improve productivity and reduce errors. Coding co-pilot – evaluation Evaluation starts with controlled tests in development environments to verify productivity gains while ensuring safety, reliability and compliance. Evaluation follows several key steps including: –Contextualization: Testing across coding tasks such as code generation, debugging and documentation to reflect real workflows –Performance: Measuring task success rate, completion time and error frequency, along with system metrics like tool-call success –Robustness: Exposing the agent to ambiguous or conflicting code to assess recovery, error handling and adaptability –Human trust: Gathering user feedback on reliability and usefulness –Monitoring: Using continuous logging to detect performance drift, anomalous tool use or regressions after deployment AI Agents in Action: Foundations for Evaluation and Governance 20
Ask AI what this page says about a topic: