Skip to main content

Command Palette

Search for a command to run...

What model checking taught me about evaluating AI coding agents

Why a unit test is the wrong tool for a nondeterministic agent.

Updated
3 min read
A
I'm Alex Voloshin. I help engineering teams operationalize AI coding agents. This blog is the long-form companion to ai-assets — a public repo of 26 agents, 53 skills, 45 eval rubrics with 270 calibration samples, 18 hooks, and 32 user-invocable workflows that work across Claude Code, Codex, and Windsurf. What you'll find here: - Tri-vendor parity write-ups. Engineering teams adopting AI coding agents shouldn't have to lock into one runtime to get production-grade tooling. Posts here track concrete differences between Claude Code, Codex, and Windsurf, and the patterns that survive across all three. - Eval methodology. Most public discourse on AI coding-agent evaluation is hand-wavy. I write about specific rubrics, calibration sample design, and the failure modes calibration data catches in production. - Formal-methods × agentic dev. LTL, model checking, and formal requirements review apply to agent specifications more directly than people think. Posts at the intersection of academic CS and practical agent orchestration. - Production war stories. Concrete failure modes, post-mortem-style write-ups, and the patterns that grew out of fixing them. What you won't find here: - Generic "AI is changing everything" takes - Tutorials for first-time agent users (better resources exist for that) - Vendor-locked content disguised as agnostic advice 15+ years building production software. Currently formalizing the CS foundation through an MS in Computer Science. Subscribe below if you want one well-edited post per ~2 weeks. No spam, no growth-hack frequency.

A unit test asks one question: did this run pass? That works when code is deterministic. An LLM coding agent is not. The same prompt produces different code each time, so one passing run proves almost nothing.

Model checking, a technique from formal verification, offers a better lens, and it changed how I write eval rubrics.

What does model checking actually do?

It verifies a system by checking properties across all of its possible behaviors, not by running one example. The hard part is never the checking. It is specifying the property: stating precisely what "correct" means before you look at any output.

Two kinds of property matter. A safety property says something bad never happens, and a violation is one concrete trace you can point at. A liveness property says something good eventually happens. Name the property first, and the rest follows.

Why does this fit AI coding agents?

A nondeterministic agent behaves like a system with a huge space of possible runs, not a function with one output. A pass/fail test samples a single run and tells you nothing about the others.

So here is the reframe: an eval rubric is a property. Writing a good rubric is specifying a property precisely, the same hard part as in model checking. Safety properties become "never" rubrics: never commit a secret, never edit a file outside scope. Liveness properties become "eventually" rubrics: the spec eventually gets fully handled.

A worked example

Take the property "the agent never modifies a file outside the task's scope." That is a safety property: a bad thing that must never happen, and a violation is a single trace.

As an eval rubric it becomes scope-adherence: score 0 if any out-of-scope file was touched, full score if not. The calibration "bad" sample is the counterexample, a recorded run where the agent "helpfully" fixed an unrelated file. The "good" sample is a run that stayed in bounds.

Notice what changed. The rubric is no longer a vague "did it stay focused?" It is a precise property with a sharp violation condition. [ALEX: swap in a real rubric from ai-skills here if you have a sharper one.]

What to take from this

  1. Name the property before you read the output. Without a stated property you are not evaluating, only reacting.

  2. Sort each property into safety or liveness. "Never" and "eventually" score differently.

  3. Treat your worst calibration sample as a counterexample. It shows the exact trace that breaks the property.

FAQ

Do I need formal methods to use this? No. You need the discipline, not the notation: state the property precisely, before grading.

Isn't a rubric still just one check? A unit test checks one run. A property holds, or fails, across many. That is the shift: from "did this pass?" to "does this property hold?"

The eval rubrics I build on this idea are public, in the repo: github.com/alex-voloshin-dev/ai-skills.

More from this blog