How to Build Evals for AI Agents

If you’re here from the talk, welcome! Add me on LinkedIn.

The slides are available under CC BY-SA 4.0 license. Feel free to use them for your own purposes.

If you remember nothing else from the talk, remember these three things:

Your test set is your spec. If a behavior isn’t in the eval, it isn’t a requirement. Make the set diverse on purpose — happy path, boring edge cases, known failure modes, adversarial inputs, and “things the agent shouldn’t do”.
Evals are never done. Pull real production traces into your test set every week, or it will drift away from your users.
Evals are not a substitute for good infrastructure. A broken retry loop, a missing timeout, or a tool call without an idempotency key will hurt you more than a 2% accuracy regression. Evals tell you what is wrong; infrastructure decides what happens when something is wrong.

If you want to try out your newfound knowledge on a very simple agentic task, take a look at my AI Triage Bot via KittenClaw. Your coding agent can run the bot interactively and develop both the solution and the evals as it goes. It’s a great sandbox to play in.

Reading List

The eval space is moving fast, but a few pieces have aged well. Start at the top, work down — these are written for builders, not researchers, and assume no ML background.

LLM Evaluation: a Beginner’s Guide by Evidently AI — if “LLM-as-a-judge” isn’t yet a phrase you’d use in a sentence, read this one first. It assumes no ML background and lays out the vocabulary the rest of this list takes for granted.
An LLM-as-Judge Won’t Save The Product — Fixing Your Process Will by Eugene Yan — start here. The closest single piece of writing to the spirit of this talk: the process around your evals matters more than the cleverness of any individual grader. Pairs directly with Lesson 3.
Your AI Product Needs Evals by Hamel Husain — the post that crystallised the field; case-study driven and refreshingly opinionated.
A Field Guide to Rapidly Improving AI Products by Hamel Husain — tactical follow-up, with the look-at-your-data flywheel I gestured at in Lesson 2. Also his article Using LLM-as-a-Judge For Evaluation, which explains the whole process in exhaustive detail.

For more detailed reading:

Task-Specific LLM Evals that Do & Don’t Work by Eugene Yan — secondary reading once the process post lands; a tour of evaluation dimensions with real examples of where each grader breaks.
Simon Willison’s evals tag — a running commentary on what works and what doesn’t in practice. Great for absorbing taste in small doses.
Scaling Up “Vibe Checks” for LLMs by Shreya Shankar (Stanford MLSys #97) — the same lessons in talk form, if you’d rather watch than read.

As always, turn your critical thinking skills on and carefully engage with the claims each source makes. The field is young enough that any “best practice” you read today may be embarrassing in eighteen months. The way you find out is by running the evals yourself.

If you liked my talk, feel free to check out other talks I’ve given. A natural next stop is Infrastructure-based Safety for Your ‘Claw, which picks up where Lesson 3 leaves off.

How to Build Evals for AI Agents

Roast My Tech Stack

Reading List

Sitemap

Connect