Jun 26, 2026

You Can't Unit-Test an AI Agent. You Can Regression-Gate It.

I run 32 published scrapers. 2,190 production runs between them. Every one of those is deterministic code, and I test it the way you test deterministic code: feed it a fixture, assert parsed == expected, done. Same input, same output, forever.

Then you bolt an LLM step onto the end. A summary pass over the scraped thread, an extraction over the page. And the first test you write for it is red on the first run. Not because the agent is wrong. Because it phrased the answer differently than the string you pinned.

That is the whole trap. The reflex from deterministic code, assert output == golden, is the wrong tool for a non-deterministic output, and most teams discover this by writing the test, watching it flake, and deleting it. Then they ship the agent with no test at all.

You can’t unit-test the text. You can regression-gate the invariants inside it. That distinction is the article, and there is a 60-line script at the bottom that shows the difference and then shows where it breaks.

TL;DR

A strict assert agent_output == golden fails on correct answers that were merely reworded or reordered. In the demo below, it passes 0 of 6 runs, including 2 that are fine.
A rubric gate asserts invariants instead: facts that must be present (must-include), strings that must never appear (must-exclude, plus secret shapes). It passes 3 of 6 and catches all 3 real regressions.
It is a tripwire, not a proof. The demo prints a 6th run that is semantically wrong but satisfies every rule, and the gate lets it through. On purpose.
stdlib only, deterministic, fixtures are synthetic and labelled. Run it twice, get the same bytes.

Why the unit test goes red on a correct answer

Here is the captured output of one agent on one task: summarize a Trustpilot refund-dispute thread. Two runs, both correct:

Run 1: The thread is about a refund the customer never received. They were told the 14 days return window had closed, and shipping was the sticking point.

Run 2: Shipping is the core complaint. The customer says the 14 days window should not apply, and they are still owed a refund.

Same facts. Refund never issued, 14-day window, shipping is the friction. A human grading these would pass both. A string equality check fails both, because neither matches the golden string you happened to capture the day you wrote the test.

This is not a temperature-zero problem you can config your way out of. Reword the prompt, bump the model version, change one upstream token in the context, and the surface text moves. OpenAI’s own evaluation guide says it plainly: “Generative AI is variable. Models sometimes produce different output from the same input, which makes traditional software testing methods insufficient for AI architectures.” The International AI Safety Report 2026 (arXiv 2602.21012, chaired by Yoshua Bengio, February 2026) makes the broader version of the point: there is an evaluation gap, where performance on pre-deployment tests does not reliably predict real-world behavior. Different scale, same root. The output you are asserting against is not a thing you control byte-for-byte.

So == is out. The instinct after that is usually “well, then I can’t test it,” and that is where the gate gets deleted.

The move: assert the invariants, not the string

You don’t control the wording. You do control what the answer is required to carry and what it must never carry. Those are invariants, and invariants are deterministic even when the text around them is not.

For the refund thread, three facts have to survive every rephrasing or it is a regression:

the word refund
the 14 days window
shipping

And some strings must never appear, no matter how the model phrases things:

a refusal boilerplate like “I cannot assist” (the model bailed instead of answering)
a secret shape like sk- or AKIA (a key leaked out of the context into the answer)

Now the verdict is deterministic again. Not “does the text match,” but “are the required facts present and are the forbidden strings absent.” The subject is non-deterministic. The gate over it is not.

That is a regression gate, not a unit test. It does not assert the agent is right. It asserts the agent did not break a rule you already know about. For catching regressions between prompt versions, that is exactly the question you want answered.

The 60-line gate

stdlib only, re and nothing else. No network, no clock, no randomness. The six runs are static strings, so the script prints the same bytes every time. They are synthetic fixtures with ground-truth labels, not a dump of a real agent. I want the demo reproducible on your machine, which means no live sampling.

"""regression_gate.py — a deterministic gate over a non-deterministic agent.

You cannot `assert agent_output == golden` on an LLM step: the text gets
rephrased, reordered, and re-politened on every run, so a strict `==` fails
on answers that are actually fine. This compares two approaches on the SAME
six captured runs of one agent task:

  NAIVE  : strict `output == golden` (the reflex from deterministic code)
  RUBRIC : must-include invariants must all be present, AND no must-exclude
           string may appear (case-insensitive substring, word-bounded).

The rubric gate is a TRIPWIRE: it catches when a declared invariant breaks.
It does NOT prove the answer is correct. run6 is a wrong answer that still
satisfies every rule, and it is printed below as a PASS — on purpose.

stdlib only (re). No network, no RNG, no clock, no subprocess, no env reads.
The six runs are STATIC strings, not sampled live, so the output is
byte-for-byte reproducible. Synthetic fixtures, not a real agent log.
"""

import re

# ---------------------------------------------------------------------------
# The rubric for one task: "summarize this Trustpilot refund-dispute thread".
#   must_include      = facts the answer is REQUIRED to carry (regression if
#                       dropped).
#   must_exclude      = phrases that must NEVER appear (refusal boilerplate).
#                       Matched word-bounded so "cannot" would not fire inside
#                       a legit "cannot be refunded".
#   forbidden_prefix  = secret SHAPES that must never appear (sk-..., AKIA...).
#                       Matched as a prefix, NOT word-bounded: a leaked key has
#                       no clean right edge, it runs straight into junk.
# Splitting these two is the whole point — a word-bounded check would let a
# leaked "sk-live-7Qd..." walk past, because the right boundary never lands.
# ---------------------------------------------------------------------------
RUBRIC = {
    "must_include": ["refund", "14 days", "shipping"],
    "must_exclude": ["I cannot assist", "I'm unable to help"],
    "forbidden_prefix": ["sk-", "AKIA"],
}

# ---------------------------------------------------------------------------
# Six CAPTURED runs of the same agent on the same input. Same task, different
# wording / different failure, exactly what you get across prompt versions and
# temperatures. These are hand-written fixtures, labelled with ground truth.
# ---------------------------------------------------------------------------
RUNS = [
    # run1 — correct, rephrased differently from the golden string.
    ("run1",
     "The thread is about a refund the customer never received. They were "
     "told the 14 days return window had closed, and shipping was the sticking "
     "point.",
     "correct, reworded"),

    # run2 — correct, but sentences in a different ORDER.
    ("run2",
     "Shipping is the core complaint. The customer says the 14 days window "
     "should not apply, and they are still owed a refund.",
     "correct, reordered"),

    # run3 — REGRESSION: dropped a required fact ("14 days" missing).
    ("run3",
     "Customer wants a refund and is unhappy about shipping costs on the "
     "returned item.",
     "regression: dropped '14 days'"),

    # run4 — REGRESSION: model bailed into a boilerplate refusal.
    ("run4",
     "I cannot assist with summarizing this content. Please consult the "
     "original thread for refund and shipping details within 14 days.",
     "regression: refusal boilerplate"),

    # run5 — REGRESSION: a secret from the context leaked into the answer.
    ("run5",
     "Summary: refund dispute, 14 days window, shipping. (debug api_key="
     "sk-live-7Qd2mZ leaked from context)",
     "regression: secret leak"),

    # run6 — HONEST FLOOR: every required word is present, no forbidden string,
    # so the rubric PASSES it. But the meaning is inverted: it claims the
    # refund was ISSUED, which the thread never says. Token-present, sense-wrong.
    ("run6",
     "The refund was already issued within 14 days and shipping was fully "
     "covered, so the customer has no remaining complaint.",
     "WRONG meaning, but all tokens present"),
]

# The golden string the naive test was written against (one specific wording).
GOLDEN = (
    "The customer is disputing a refund that was never issued. They were told "
    "the 14 days return window had passed, and shipping is the main point of "
    "contention."
)


def run_naive(output, golden):
    """The reflex from deterministic code: exact string equality."""
    return output == golden


def _present(needle, haystack):
    """Word-bounded, case-insensitive substring check.

    Word boundaries stop 'refund' from matching 'refundable' and stop
    'cannot' (if it were a rule) from firing inside 'cannot be refunded'.
    """
    pattern = r"(?<![a-z0-9])" + re.escape(needle.lower()) + r"(?![a-z0-9])"
    return re.search(pattern, haystack.lower()) is not None


def run_gated(output, rubric):
    """Deterministic verdict over a non-deterministic answer.

    PASS iff every must_include is present, no must_exclude phrase appears, and
    no forbidden_prefix (secret shape) appears. Reasons name the invariant that
    fired. Returns (passed, reasons).
    """
    reasons = []
    for needle in rubric["must_include"]:
        if not _present(needle, output):
            reasons.append("missing required: " + repr(needle))
    for needle in rubric["must_exclude"]:
        if _present(needle, output):
            reasons.append("forbidden present: " + repr(needle))
    for prefix in rubric["forbidden_prefix"]:
        # Prefix match only: a secret has a clean LEFT edge but bleeds into
        # junk on the right, so we do not require a trailing word boundary.
        if re.search(r"(?<![a-z0-9])" + re.escape(prefix.lower()),
                     output.lower()) is not None:
            reasons.append("secret-shaped token present: " + repr(prefix))
    return (len(reasons) == 0, reasons)


def main():
    print("=" * 64)
    print("regression gate: one task, six captured agent runs")
    print("=" * 64)

    naive_pass = 0
    rubric_pass = 0
    rubric_fail_lines = []

    for name, output, truth in RUNS:
        if run_naive(output, GOLDEN):
            naive_pass += 1
        passed, reasons = run_gated(output, RUBRIC)
        if passed:
            rubric_pass += 1
        else:
            rubric_fail_lines.append("  " + name + " FAIL  " + reasons[0])

    print()
    print("NAIVE  exact-match gate:  %d/%d PASS" % (naive_pass, len(RUNS)))
    print("  the golden string matches 0 of the 6, including 2 correct answers")
    print("  that were merely reworded (run1) or reordered (run2). A test this")
    print("  brittle gets deleted by the third red build.")
    print()
    print("RUBRIC regression gate:   %d/%d PASS" % (rubric_pass, len(RUNS)))
    for line in rubric_fail_lines:
        print(line)
    print()
    print("It caught the three real regressions: a dropped fact, a refusal,")
    print("and a leaked secret. It passed run1 and run2, the reworded and")
    print("reordered correct answers the naive test failed.")
    print()
    print("-" * 64)
    print("HONEST FLOOR — run6 PASSED and is semantically WRONG:")
    print("-" * 64)
    print('  run6 says the refund "was already issued" — the thread never does.')
    print("  Every required word is present and nothing forbidden appears,")
    print("  so the rubric lets it through. The gate checks token PRESENCE,")
    print("  not MEANING. A wrong answer that name-drops the required words")
    print("  slips past.")
    print()
    print("  This invariant gate catches regressions on the rules you declared.")
    print("  It does NOT prove the answer is correct: semantic errors that")
    print("  satisfy the rubric pass. Put an LLM-judge or a human spot-check")
    print("  above it. It is a tripwire, not a proof of correctness.")


if __name__ == "__main__":
    main()

Run it with python3 -I regression_gate.py. This is the exact, unedited output:

================================================================
regression gate: one task, six captured agent runs
================================================================

NAIVE  exact-match gate:  0/6 PASS
  the golden string matches 0 of the 6, including 2 correct answers
  that were merely reworded (run1) or reordered (run2). A test this
  brittle gets deleted by the third red build.

RUBRIC regression gate:   3/6 PASS
  run3 FAIL  missing required: '14 days'
  run4 FAIL  forbidden present: 'I cannot assist'
  run5 FAIL  secret-shaped token present: 'sk-'

It caught the three real regressions: a dropped fact, a refusal,
and a leaked secret. It passed run1 and run2, the reworded and
reordered correct answers the naive test failed.

----------------------------------------------------------------
HONEST FLOOR — run6 PASSED and is semantically WRONG:
----------------------------------------------------------------
  run6 says the refund "was already issued" — the thread never does.
  Every required word is present and nothing forbidden appears,
  so the rubric lets it through. The gate checks token PRESENCE,
  not MEANING. A wrong answer that name-drops the required words
  slips past.

  This invariant gate catches regressions on the rules you declared.
  It does NOT prove the answer is correct: semantic errors that
  satisfy the rubric pass. Put an LLM-judge or a human spot-check
  above it. It is a tripwire, not a proof of correctness.

Reading the result

The naive gate scores 0 of 6. The golden string did not match a single captured run, and two of those runs are correct answers. That is the brittle test that gets deleted, except it deleted itself by passing nothing.

The rubric gate scores 3 of 6, and the three failures are the three things you actually want a test to catch:

run3 dropped the 14-day window. A required fact went missing between prompt versions. That is a real regression, and the gate named it: missing required: '14 days'.
run4 is a refusal. The model gave up and returned boilerplate instead of a summary. The gate caught the forbidden phrase.
run5 leaked a key. A sk-live-... token from the context bled into the answer. The gate caught the secret shape.

And the two it passed, run1 and run2, are the reworded and reordered correct answers the naive test choked on. That is the trade you wanted: stop failing on wording, start failing on broken invariants.

One detail in the code earns its place. The secret check is a separate list from the refusal phrases, and it is matched as a prefix, not word-bounded. I learned that the hard way building this: when I had sk- in the word-bounded must-exclude list, run5 passed. The trailing word-boundary never landed, because sk- runs straight into live-7Qd2mZ with no clean right edge. A leaked key is not a tidy word. So the secret shapes get their own prefix match, and that one split is the difference between the leak being caught and being missed. The left edge still requires a non-alphanumeric boundary, so a key fused straight onto a preceding letter, like apikeysk-live..., would still slip past; in production you’d drop the left lookbehind on the prefix list too. Small bug, exactly the kind a rubric gate exists to surface, and it has a smaller one waiting right behind it.

The case it misses, and I am printing it

run6 passed. run6 is wrong.

It says the refund was already issued and shipping was fully covered and the customer has no complaint. The thread says the opposite: the refund was never issued. The summary inverted the meaning. But it contains the word “refund,” it contains “14 days,” it contains “shipping,” and it contains no refusal and no secret. Every invariant holds. The gate passes it.

This is the floor of the method, and it is printed in the same stdout as the wins, on purpose. A rubric checks token presence, not meaning. A wrong answer that name-drops the required words walks straight through. If I only showed you the 3 caught regressions, I would be selling you the gate as a correctness proof. It is not one.

So be precise about what this buys you. The invariant gate catches regressions on the rules you declared. It does not prove the answer is correct. Semantic errors that satisfy the rubric pass. That sentence is the whole honest scope of the tool, and it is why the verdict above prints it rather than hiding it.

There are three more edges worth naming before you ship this:

The rubric is hand-maintained. Who writes the must-include list, and does it drift as the product changes? If the task grows a fourth required fact and nobody updates the rubric, the gate goes quietly stale. That is an open question, not a solved one.
Substring matching has false-positive risk. I word-bounded the includes so “refund” does not fire inside “refundable” and “cannot” would not fire inside “cannot be refunded.” That covers the obvious cases. It does not cover everything, and a careless rule will flag a legitimate answer.
This is not a replacement for the integration tests around the agent. The parsing, the retries, the IO, the tool calls: that is deterministic code, and it gets the strict assert == it deserves. The rubric gate is only for the one non-deterministic text output at the end.

What sits above this gate is the semantic check the gate cannot do: an LLM-as-a-judge pass, or a human spot-check on a sample. The rubric is the cheap, fast, deterministic first layer that fails the build on a leaked key or a dropped fact in milliseconds. The expensive judge runs on what survives. Layered, not either-or.

What to do with this on Monday

If you have an agent step with no test because the obvious test flaked and got deleted, you do not need the judge yet. You need the tripwire.

Pick one task. Write down the three or four facts the answer must always carry, and the handful of strings it must never carry: refusal boilerplate, secret prefixes, a competitor’s name, whatever your domain forbids. That is your rubric. Capture a few real outputs, run the gate, watch it pass the reworded-but-correct ones and fail the broken ones. It is 60 lines and it runs offline.

It will not tell you the agent is right. It will tell you the moment the agent stopped doing something it used to do, which is the entire job of a regression test, and it is more than the zero tests most agent steps ship with.

This is the same shape as a validator I wrote for a different problem: a deterministic verdict over a subject you do not control. There the subject was an HTTP response that returned 200 OK with garbage in the body. Here it is the agent’s final text. Same instinct, different layer.

Follow for the numbers from the next batch of runs. And tell me: what is the one invariant you would put in a must-exclude list for your agent, the string that should never, ever show up in an answer? I read every comment.

Written with AI assistance (the demo, fixtures, and prose were drafted with an LLM and then run, verified, and edited by me). The six runs are synthetic fixtures, not a real agent log, and they are labelled as such in the code. The 2,190-run figure is from my own Apify actor dashboard.

More production scraping tips: t.me/scraping_ai