How I evaluate AI coding agents: the rubric taxonomy

If you ship an AI coding agent into a team's workflow, "did this output look right?" is not enough. You need rubrics: scorable properties that say what good and bad mean before you grade anything. I keep 48 of them, public. Here's how I organize them.
This is the hub post for the "Evaluating AI coding agents" series. Each later post goes deep on one bucket or one rubric.
What is each rubric trying to catch?
A rubric is a property an output should satisfy (see Post B for the model-checking framing). Each one has a score scale and six paired calibration samples: three "good" examples that should pass and three "bad" ones that should fail. The pair keeps the judge honest in production.
The 48 rubrics fall into three buckets: cross-cutting checks, harness checks, and per-workflow checks.
The three buckets
Base rubrics: cross-cutting
These check fundamentals that apply across many outputs: evidence-backed claims, schema-conformance, spec-following. One bad calibration sample is named opinion-without-evidence (score 1.0 / 5.0); it catches an output that asserts a position with no supporting code, doc link, or data. Most outputs are scored against several base rubrics, not just one.
Meta-tool rubrics: harness checks
These check the harness itself, not the developer-facing output. When a workflow spawns a sub-agent, did the spawn payload conform to the schema? When a hook fired, did it produce the expected event log? Small bucket, but the rubrics that catch silent breakage in the agent machinery.
Per-workflow rubrics: one per major workflow
One rubric per major workflow: analyze, code-review, spike, feature-design, develop, bugfix, and so on. Each is tailored to what that workflow should produce. A bad sample for code-review is mixed-with-security-scan.score-1.4.md: a review that drifted into a security audit and lost focus. A good spike is auth-openidconnect-vs-oauth.score-4.4.md: a clean, decisive go/no-go.
How I use the taxonomy
When I add a new workflow, I ask three questions. Which base rubrics already apply? Does the harness call any new sub-agents that need a meta-tool rubric? What is the one per-workflow rubric tailored to this workflow's output? If a workflow cannot answer the third, it is not ready to ship.
FAQ
Why three buckets, not themes like security or style? Themes cross all three buckets. A rubric belongs to exactly one bucket by what it scores against: the harness, a workflow, or any output.
Where do the six paired calibration samples come in? Post D walks one rubric in full: its score scale and the boundary each pair draws.
The 48 rubrics live in the public repo: github.com/alex-voloshin-dev/ai-skills/tree/main/plugin/eval.

