Skip to content

From Tests to Evals

The Gap Tests Do Not Close

You have automated tests. They pass. Your application does what the code tells it to do: the API returns the right format, the parser extracts the right fields, the UI renders the right components. Everything is green.

Then a user reads the AI-generated summary on your page and says: "This is wrong."

The code ran correctly. The output was incorrect. Your tests verified that the system behaved as specified. Nobody verified that the system's output accurately conveyed what it should.

This is the gap. Tests verify behavior: the code does what you told it to do. Evals verify correctness: the output says what it should say. If your application has any AI-generated content (summaries, recommendations, analysis, descriptions), tests alone leave a blind spot.

What Is an Eval?

An evaluation harness (eval) runs your system against inputs where you already know the correct answer, then checks whether the system got it right.

Tests Evals
Question asked Does the code behave as specified? Does the output accurately reflect the source material?
Source of truth Your acceptance criteria Expert judgment about what the output should say
What it catches Regressions, broken behavior Mischaracterized information, missing key details, misleading summaries
When it runs Every code change Every change that affects AI-generated output
Example "Given a search query, the results page displays matching entries" "Given this input data, the AI-generated summary captures all three key findings and does not add claims the source does not support"

Tests and evals are not competing concepts. They are complementary. Tests protect against breaking what works. Evals protect against being confidently wrong.

Golden Datasets: The Answer Key

The mechanism that makes evals work is a golden dataset: a curated set of real inputs paired with expert-verified expected outputs. Think of it as the answer key for your system.

You build a golden dataset by collecting representative inputs to your system and having a domain expert define what the correct output looks like for each one. If your application summarizes reports, an expert reviews actual reports and writes what a correct summary should capture. If your application generates recommendations, an expert defines what a good recommendation includes for each scenario.

Two grading methods:

  • Deterministic checks compare outputs to expected values directly. Did the system extract the right category? Does the summary mention all three key findings? These are binary (pass or fail), fast, and reproducible.
  • AI-graded rubrics use a language model to evaluate subjective qualities. Is the summary clear? Does it accurately represent the source? Is anything misleading? These handle nuance that string comparison cannot, but they are slower and need periodic calibration against human judgment.

The practical pattern: start with deterministic checks for everything that has a verifiable answer. Layer AI-graded rubrics only for subjective dimensions like clarity and completeness. Most teams find that the majority of their eval surface can be covered deterministically.

In Your AI Assistant

You can build an eval harness with your AI coding assistant. Describe your inputs and expected outputs, and ask it to generate a test suite that compares actual output against the golden dataset. Start with deterministic checks: "Does the output contain these key terms? Does it match this expected structure?" Add AI-graded rubrics later for subjective quality dimensions.

Try It

If your application has any AI-generated output (summaries, descriptions, recommendations, analysis), you can start building an eval right now.

Ask your AI coding assistant:

My application generates [describe your AI-generated output].
I want to build an eval harness that checks whether the output
is correct, not just whether the code runs.

Help me:
1. Identify 3-5 representative inputs I could use as test cases
2. Define what "correct" output looks like for each one
3. Write deterministic checks that verify the output captures
   the key information

In Your AI Assistant

This works best when you start small. Pick one AI-generated output, define what "correct" means for three to five inputs, and build deterministic checks. You can expand the golden dataset over time as you discover edge cases.

What if my application has no AI-generated output?

If your application does not generate summaries, recommendations, or analysis, evals may not apply to your product today. But the concept still applies to your development process. Any time AI generates code, documentation, or configuration for your project, you can evaluate whether the output is correct. The advanced track explores this further: evaluating not just your product but the skills and processes that build it.

Key Insight

Tests verify behavior: the code does what you told it to do. Evals verify correctness: the output says what it should say. Golden datasets are the answer key, expert-verified input/output pairs that define what "correct" looks like. Start with deterministic checks for verifiable answers, layer AI-graded rubrics for subjective quality. The same discipline you applied to acceptance criteria (define "done" before you build) extends to "define correct before you ship."