July 2, 2026lectures

How to Build Evals for AI Agents

June 2026

If you’re here from the talk, welcome! Add me on LinkedIn.

The slides are available under CC BY-SA 4.0 license. Feel free to use them for your own purposes.

If you remember nothing else from the talk, remember these three things:

  1. Your test set is your spec. If a behavior isn’t in the eval, it isn’t a requirement. Make the set diverse on purpose — happy path, boring edge cases, known failure modes, adversarial inputs, and “things the agent shouldn’t do”.
  2. Evals are never done. Pull real production traces into your test set every week, or it will drift away from your users.
  3. Evals are not a substitute for good infrastructure. A broken retry loop, a missing timeout, or a tool call without an idempotency key will hurt you more than a 2% accuracy regression. Evals tell you what is wrong; infrastructure decides what happens when something is wrong.

Reading List

The eval space is moving fast, but a few pieces have aged well. Start at the top, work down — these are written for builders, not researchers, and assume no ML background.

For more detailed reading:

As always, turn your critical thinking skills on and carefully engage with the claims each source makes. The field is young enough that any “best practice” you read today may be embarrassing in eighteen months. The way you find out is by running the evals yourself.


If you liked my talk, feel free to check out other talks I’ve given. A natural next stop is Infrastructure-based Safety for Your ‘Claw, which picks up where Lesson 3 leaves off.