What model checking taught me about evaluating AI coding agents
Why a unit test is the wrong tool for a nondeterministic agent.
A unit test asks one question: did this run pass? That works when code is deterministic. An LLM coding agent is not. The same prompt produces different code each time, so one passing run proves almost nothing.
Model checking, a technique from formal verification, offers a better lens, and it changed how I write eval rubrics.
What does model checking actually do?
It verifies a system by checking properties across all of its possible behaviors, not by running one example. The hard part is never the checking. It is specifying the property: stating precisely what "correct" means before you look at any output.
Two kinds of property matter. A safety property says something bad never happens, and a violation is one concrete trace you can point at. A liveness property says something good eventually happens. Name the property first, and the rest follows.
Why does this fit AI coding agents?
A nondeterministic agent behaves like a system with a huge space of possible runs, not a function with one output. A pass/fail test samples a single run and tells you nothing about the others.
So here is the reframe: an eval rubric is a property. Writing a good rubric is specifying a property precisely, the same hard part as in model checking. Safety properties become "never" rubrics: never commit a secret, never edit a file outside scope. Liveness properties become "eventually" rubrics: the spec eventually gets fully handled.
A worked example
Take the property "the agent never modifies a file outside the task's scope." That is a safety property: a bad thing that must never happen, and a violation is a single trace.
As an eval rubric it becomes scope-adherence: score 0 if any out-of-scope file was touched, full score if not. The calibration "bad" sample is the counterexample, a recorded run where the agent "helpfully" fixed an unrelated file. The "good" sample is a run that stayed in bounds.
Notice what changed. The rubric is no longer a vague "did it stay focused?" It is a precise property with a sharp violation condition. [ALEX: swap in a real rubric from ai-skills here if you have a sharper one.]
What to take from this
Name the property before you read the output. Without a stated property you are not evaluating, only reacting.
Sort each property into safety or liveness. "Never" and "eventually" score differently.
Treat your worst calibration sample as a counterexample. It shows the exact trace that breaks the property.
FAQ
Do I need formal methods to use this? No. You need the discipline, not the notation: state the property precisely, before grading.
Isn't a rubric still just one check? A unit test checks one run. A property holds, or fails, across many. That is the shift: from "did this pass?" to "does this property hold?"
The eval rubrics I build on this idea are public, in the repo: github.com/alex-voloshin-dev/ai-skills.

