Skip to main content

Command Palette

Search for a command to run...

How I evaluate AI coding agents: the rubric taxonomy

Updated
3 min read
How I evaluate AI coding agents: the rubric taxonomy
A
I'm Alex Voloshin. I help engineering teams operationalize AI coding agents. This blog is the long-form companion to ai-skills — a public repo of 26 agents, 77 skills, 48 eval rubrics with 288 calibration samples, 18 hooks, 12 rules, and 32 user-invocable workflows that work across Claude Code, Codex, and Windsurf. What you'll find here: - Tri-vendor parity write-ups. Engineering teams adopting AI coding agents shouldn't have to lock into one runtime to get production-grade tooling. Posts here track concrete differences between Claude Code, Codex, and Windsurf, and the patterns that survive across all three. - Eval methodology. Most public discourse on AI coding-agent evaluation is hand-wavy. I write about specific rubrics, calibration sample design, and the failure modes calibration data catches in production. - Formal-methods × agentic dev. LTL, model checking, and formal requirements review apply to agent specifications more directly than people think. Posts at the intersection of academic CS and practical agent orchestration. - Production war stories. Concrete failure modes, post-mortem-style write-ups, and the patterns that grew out of fixing them. What you won't find here: - Generic "AI is changing everything" takes - Tutorials for first-time agent users (better resources exist for that) - Vendor-locked content disguised as agnostic advice 17+ years building production software. Subscribe below if you want one well-edited post per ~2 weeks. No spam, no growth-hack frequency.

If you ship an AI coding agent into a team's workflow, "did this output look right?" is not enough. You need rubrics: scorable properties that say what good and bad mean before you grade anything. I keep 48 of them, public. Here's how I organize them.

This is the hub post for the "Evaluating AI coding agents" series. Each later post goes deep on one bucket or one rubric.

What is each rubric trying to catch?

A rubric is a property an output should satisfy (see Post B for the model-checking framing). Each one has a score scale and six paired calibration samples: three "good" examples that should pass and three "bad" ones that should fail. The pair keeps the judge honest in production.

The 48 rubrics fall into three buckets: cross-cutting checks, harness checks, and per-workflow checks.

The three buckets

Base rubrics: cross-cutting

These check fundamentals that apply across many outputs: evidence-backed claims, schema-conformance, spec-following. One bad calibration sample is named opinion-without-evidence (score 1.0 / 5.0); it catches an output that asserts a position with no supporting code, doc link, or data. Most outputs are scored against several base rubrics, not just one.

Meta-tool rubrics: harness checks

These check the harness itself, not the developer-facing output. When a workflow spawns a sub-agent, did the spawn payload conform to the schema? When a hook fired, did it produce the expected event log? Small bucket, but the rubrics that catch silent breakage in the agent machinery.

Per-workflow rubrics: one per major workflow

One rubric per major workflow: analyze, code-review, spike, feature-design, develop, bugfix, and so on. Each is tailored to what that workflow should produce. A bad sample for code-review is mixed-with-security-scan.score-1.4.md: a review that drifted into a security audit and lost focus. A good spike is auth-openidconnect-vs-oauth.score-4.4.md: a clean, decisive go/no-go.

How I use the taxonomy

When I add a new workflow, I ask three questions. Which base rubrics already apply? Does the harness call any new sub-agents that need a meta-tool rubric? What is the one per-workflow rubric tailored to this workflow's output? If a workflow cannot answer the third, it is not ready to ship.

FAQ

Why three buckets, not themes like security or style? Themes cross all three buckets. A rubric belongs to exactly one bucket by what it scores against: the harness, a workflow, or any output.

Where do the six paired calibration samples come in? Post D walks one rubric in full: its score scale and the boundary each pair draws.

The 48 rubrics live in the public repo: github.com/alex-voloshin-dev/ai-skills/tree/main/plugin/eval.

Evaluating AI coding agents

Part 1 of 1

Rubrics, calibration samples, and the taxonomy I use to evaluate AI coding-agent output across workflows.