Agent Evaluations

Evaluate Claude agents credibly. Learn task-based eval suites (SWE-bench, METR, AgentBench, Anthropic-published agent evals), success metrics (pass rate, partial credit, cost-per-success), regression suites for production agents, the offline-vs-online eval split, the link to production monitoring (success rate per task class, drift detection), and the failure modes of weak eval design.