Building an eval suite (and catching my own tests being wrong)

Phase 4 of my four-weekend AI engineering curriculum. The final phase. The one most engineers handwave through and most interviewers actually care about: how do you know if your model is good?

Up to now I’d been testing my agent by talking to it. Type a question, eyeball the response, decide if I’m happy. That works at 5 questions, breaks at 50, and dies completely the moment you ask “did my last change make things worse?”

This phase replaces vibes with measurement. By the end I had:

A hand-curated golden set of 20 cases covering five behavior categories
An eval runner that scores each case deterministically and with an LLM judge
A markdown report per run for diffing changes
LangFuse self-hosted via Docker, with full traces of every agent AND judge call

The most useful thing I built, though, wasn’t any of the above. It was the iteration on the eval itself — catching my own tests being too strict, catching my judge being biased, catching a real agent regression hiding underneath the noise.

This is that story.

The mental shift

Vibes-based testing answers “did this work?” with a hunch. Measurement-based testing answers it with a number.

Before:  ask → look → "feels right" → ship
After:   golden set → automated scoring → pass rate → ship when ≥ baseline

What separates “I built an AI feature” from “I built an AI feature I know works” is the second arrow.

The pipeline

Four pieces, each with one job.

1. The golden set is hand-written. The curriculum was loud about this: never auto-generate eval cases with the same model you’re testing. You’ll just measure what it can already answer. I wrote 20 cases across five categories: 5 easy retrieval (“how do I refund?”), 5 multi-tool flows (“refund order ORD-1004”), 3 out-of-scope (“what’s the weather?”), 4 ambiguous (“just help me”), 3 edge cases (empty input, very long input, prompt injection).

Each case has:

An input (the user message)
An ordered expected_tools list — order matters because chains matter
must_contain / must_not_contain substring assertions
An expected_behavior description for the judge

2. The eval runner loads the golden set, runs each case through the agent loop directly (no HTTP), captures the final response and the ordered list of tools called. ~150 lines.

3. Deterministic scoring checks each case three ways: right tools in the right order, response contains required phrases, response avoids forbidden ones. Each check is binary; all three must pass.

4. LLM-as-judge is a separate Claude call that scores the response on four rubric questions: was it correct, did it avoid hallucination, did it take the appropriate action, was the tone good. All binary.

A case passes overall if both deterministic AND judge agree.

The first run looked suspiciously good

First score: 80% (16 of 20). That smelled wrong. First eval runs are almost never that clean.

The judge was Haiku 4.5 — same as the agent. Haiku was judging Haiku. Same-model judging produces lenient scores almost by construction; the judge is predisposed to think the agent’s responses are fine because that’s what it would have written.

The fix was a one-line change: point the judge at Sonnet 4.6 instead. Cost went from ~$0.03 per run to ~$0.13 (Sonnet is roughly 4× the per-token price). The new score: 45%.

Forty-five. From 80. Same agent. Different judge.

That gap is the bias you’re paying for when you use one model for both roles. Bigger judge for smaller worker is now my default for any future agent project.

But Sonnet had its own biases

The first Sonnet run flagged many things as hallucinations:

“The agent says ‘PayPal 1–3 days, bank transfer 7–14 days’ — these specific timelines aren’t verified as coming from the tool result. Could be hallucinated.”

The agent wasn’t hallucinating. Those timelines come from the help-center article that search_articles returned. The judge didn’t see the tool results — it only saw the final response. From its perspective, every specific number looked invented.

The fix: pass the actual tool inputs AND outputs to the judge prompt. With that context, the judge can verify facts against what tools actually returned. The hallucination flags dropped from being everywhere to being exactly where they should be.

A second bias: for out-of-scope queries like “what’s 17 × 23?”, the judge marked correctness=false because the agent declined to answer. But declining was the expected behavior. The rubric needed to say so explicitly. I rewrote that rule to clarify: “for cases where expected behavior is to decline, declining IS correct.”

Then I caught my own tests being wrong

A pattern emerged: deterministic check failing (must_contain: ["ORD-1004"] missing from response) while the judge passed (response is fine).

When deterministic and judge disagree consistently, one of them is wrong. In this case, my deterministic assertion was wrong. I’d written the test as if the agent must always echo the order ID back in its reply. But the user had just typed ORD-1004 in their message. The created ticket includes ORD-1004 in its description. Forcing the agent to also say “for ORD-1004” in the user-facing reply is performative — and it’s the kind of arbitrary specificity that creeps into eval sets when the writer projects their own preferred phrasing onto the test.

I relaxed the assertions and the calibration disagreements went away.

This is one of the most common eval mistakes: “I would say X, so the agent must say X.” No — the agent must do something correct; X is one valid version of correct.

The one real agent bug

Even after fixing the calibration issues, one case persistently failed: multi-004-wrong-item. The user said “ORD-1001 — I think I got the wrong tamper.” The agent looked up the order, then asked a clarifying question instead of opening a ticket. Same Phase 3 failure mode I’d already fixed for refund queries, regressed for the “wrong item” wording.

The fix lived in the create_ticket tool description: I added an explicit rule that when the user has clearly described a concrete order problem, the agent should open the ticket without asking clarification — the human picking up the ticket can ask follow-ups, and asking the user to repeat themselves is bad service.

This is the demo for the post: here’s the regression, here’s the one-line fix, here’s the eval going from ✗ to ✓.

The finding I didn’t expect: non-determinism is the headline

I ran the same eval twice in a row. Same prompts. Same inputs. The pass rate moved by 3 cases between runs.

amb-004-multi-intent: failed in run N, passed in run N+1
oos-002-math: passed in run N, then in N+1 the agent said “That’s math outside my wheelhouse, but the answer is 391” — declining and answering in the same breath
edge-002-very-long: handled both topics in run N, only handled one in N+1

Same agent. Different luck. Single-run pass rates are noisy. To measure real quality, you’d need to run each case 3+ times and report median or worst-case. The “80%” or “55%” numbers in my run reports are point estimates with a real ±10% range around them.

The curriculum doesn’t call this out directly. It should. If you report a single eval run as your model’s accuracy, your number has more measurement error than you’re admitting.

Observability — the part everyone skips

I self-hosted LangFuse via Docker. Six services (postgres, redis, clickhouse, minio, langfuse-web, langfuse-worker), about 100 lines of docker-compose.langfuse.yml. First-time setup including image pulls took ~30 minutes.

The instrumentation principle that matters: every LLM call gets a trace.

I initially instrumented only the agent loop. Then I realized the judge IS an LLM call, costing real Sonnet money, with its own potential failure modes — and I’d forgotten to trace it. I added a generation span to the judge so its inputs, outputs, scores, and token usage all land in the same trace tree as the agent it’s judging.

Now when a case fails, I can open it in LangFuse and see:

What the agent was thinking at each iteration
What each tool returned
What the judge saw and scored
Token cost for both halves, broken down

That’s the actual debugging loop. Without it, “this case used to pass and now doesn’t” is a guess. With it, you can see exactly which step diverged from last run.

What I’d do differently

Run each eval case 3 times from the start, report worst-case. Eliminates run-to-run noise instead of pretending it isn’t there.
Pass the system prompt to the judge. Some “hallucination” flags were the agent following system-prompt instructions the judge couldn’t see.
Track per-case token cost in the report. Cost variance across cases is itself a signal worth surfacing.

The honest summary

Four runs of this phase, in order:

Run	What changed	Det	Judge	Both
1	Haiku judging Haiku	75%	55%	45%
2	Switched to Sonnet judge	75%	55%	45%
3	Judge got tool context + better rubric; eval calibration relaxed	80%	75%	70%
4	Agent fix for wrong-item routing	80%	75%	70%

The agent didn’t actually get much better between runs. What got better was my ability to measure it accurately. Eval is iterative — you debug the test as much as the agent, because the test is an artifact you wrote and can get wrong too.

If I were interviewing tomorrow and someone asked “how do you know if your AI feature is good?” — this is the answer I’d want to give. Not a number. A process. A way of asking the question that produces a usable answer.

Follow-up: making the eval provider-agnostic

I went ahead and built the thing the previous section promised. Three structural changes turned the pipeline into something that can compare provider combinations cleanly.

1. The agent loop split into a dispatcher. lib/agent.ts is now pure types plus a runAgent() function that dispatches to runAgentAnthropic() or runAgentOllama() based on the CHAT_PROVIDER env var. Mirrors the pluggable-LLM pattern from Phase 2. The route handler and the eval runner didn’t change — they call runAgent and don’t care which provider is underneath.

2. Ollama tool-calling — the wire-level differences that bit. Implementing the Ollama agent loop surfaced three concrete differences from Anthropic that aren’t obvious from skimming the docs:

Tool format follows OpenAI conventions. { type: "function", function: { name, description, parameters } } vs Anthropic’s { name, description, input_schema }. Same JSON Schema underneath, different wrapper.
Tool results go back as { role: "tool", content } messages, not as content blocks inside a user message. Anthropic uses content arrays; Ollama uses a separate role.
Streaming uses NDJSON, and tool_calls typically arrive in the final chunk (with done: true) rather than incrementally. The parser has to buffer text deltas while watching for tool_calls to land on the done event.

3. Judge gets the same treatment. A new JUDGE_PROVIDER env var picks between Anthropic and Ollama for the LLM-as-judge call. Anthropic still defaults — stronger judge means less bias — but a fully-local eval is now one env-var flip away. For Ollama I added format: "json" to the request, which constrains output to valid JSON. Without it, smaller models routinely wrap responses in markdown fences and the strict parser falls over.

I also moved every model name into env, with .env.example as the source of truth. Change a model by editing .env.local, not source code.

The 4-way comparison matrix

With both halves pluggable, four interesting experiments are one env-var flip apart:

CHAT_PROVIDER	JUDGE_PROVIDER	What it answers
anthropic	anthropic	Baseline — the numbers above in this post
ollama	anthropic	Agent-provider comparison (judge held constant)
anthropic	ollama	How a local judge sees the same outputs
ollama	ollama	Fully local — what zero-budget actually delivers

The middle two are the genuinely interesting experiments. “How much quality do I lose going local on the agent?” and “How much does my measurement instrument shape my conclusions?” are different questions and they shouldn’t be conflated — switching both at once mixes two changes and you can’t attribute deltas to either.

A small but important rule I gave myself: when comparing agent providers, keep the judge constant. And vice versa. Otherwise pass-rate changes blend agent-quality deltas with judge-strictness deltas.

What’s next

The comparison numbers themselves. Llama 3.1 8B is already pulled from Phase 2; Qwen 2.5 14B is downloading for the judge side. Once both are in place, I’ll run the four cells of the matrix above and write up the diff.

The blog post for that one writes itself.