Your AI Agent Trusts a 200 OK. I Logged How Often the Page Was Garbage


Yesterday I handed an agent a web_fetch tool. It fetched a page, got back a 200 and a screenful of text, and confidently built a plan on it. The text was a Cloudflare “Just a moment…” screen. The agent never noticed.

That’s the failure I want to fix today. Not the fetch. The trust. Your tool returns status=200 and a non-empty string, and your agent treats that string as “what the page said.” Most of the time it is. Sometimes it’s a challenge wall, an empty shell, or a body cut off mid-stream, and the agent reasons on garbage with full confidence and zero error.

So here’s a 40-line gate that sits between the fetch tool and the agent and answers one question: is this blob usable as content at all? It tags every fetch OK / BLOCKED / EMPTY_SHELL / TRUNCATED before your model ever reads it. Real, deterministic output below.

Quick answer: A web-fetch tool’s 200 OK + non-empty body does not mean “usable content.” That body can be an anti-bot challenge, an empty JS shell, an “access denied” notice, or a truncated stream, all served as 200. sanity_check(text, url, status) runs zero network calls and tags the blob OK / BLOCKED / EMPTY_SHELL / TRUNCATED before your model reads it. Garbage becomes an explicit signal, not a silent input to reasoning.

This is for anyone building agents with web access (LangChain, a ReAct loop, an MCP tool, your own while loop) who has watched the model be confidently wrong and couldn’t tell why. The why is often this: the page lied with a 200, and nothing checked.

The artifact first: four verdicts on six blobs

Here’s the real output, copy-pasted, not cleaned up. The gate ran against six fixtures: one real captured page (example.com) and five synthetic bodies I hand-wrote to reproduce failure classes I’ve hit in production. The synthetic ones are labeled synthetic so you don’t mistake them for live pulls.

VERDICT      KIND       URL
----------------------------------------------------------------
OK           real       https://example.com
                        → len=353B ratio=0.57
BLOCKED      synthetic  https://shop.example/product/991
                        → soft-block marker 'just a moment...' (status=200)
BLOCKED      synthetic  https://api.example/v2/orders
                        → soft-block marker 'access denied' (status=200)
EMPTY_SHELL  synthetic  https://app.example/dashboard
                        → visible≈0B ratio=0.00 (markup, no content)
TRUNCATED    synthetic  https://blog.example/crawl-bill
                        → no </html> / mid-tag end …'were blunt: most of the spend was on <sp'
EMPTY_SHELL  synthetic  https://example.com/empty
                        → empty body (status=200)
----------------------------------------------------------------
1 of 6 blobs were usable content  ::  {'OK': 1, 'BLOCKED': 2, 'EMPTY_SHELL': 2, 'TRUNCATED': 1}

Every one of those six rows was a 200. One was content. The other five were the kinds of 200 that look fine to a fetch tool and ruin a downstream plan. The gate gives each a name your agent can branch on instead of a string it has to believe.

I ran this on Python 3.13.5. No third-party imports, no network. Run python3 fetch_sanity.py and you get the same bytes. I checked: two consecutive runs hash to the same md5. That matters, and I’ll come back to why.

Why “200 and not empty” is the wrong success check

Tool contracts lie by omission. A fetch tool’s idea of success is usually two things: the HTTP status was 2xx, and the body wasn’t empty. Both can be true while the body is useless.

I run scrapers in production: roughly 2,190 runs across 32 published actors, the Trustpilot one alone at 962 runs. The failure that cost me the most debugging time wasn’t a 500 or a timeout. Those are loud; you catch them. It was the 200 that came back with a body that wasn’t the page. Four shapes show up again and again:

  • The challenge wall. Cloudflare’s “Just a moment…”, Akamai’s “Access Denied”, a generic “verify you are human”, served with status 200, not 403. The bytes are real HTML. They’re just not your page.
  • The empty shell. A single-page app ships <div id="root"></div> and three script tags. The data renders in a browser. Your raw fetch got the skeleton. (Predicting that before you fetch is its own post; see below.)
  • The truncated body. A size cap, a dropped connection, or a slow stream cut off, and you got the first 8 KB of a 40 KB page. Looks like a page. Ends mid-sentence.
  • The literal empty 200. Some servers answer a blocked or rate-limited request with 200 and a body of "". I once watched a scraper return empty arrays for days because nobody raised on a soft-blocked request that came back 200-and-nothing.

Here’s the agentic twist that makes this worse than in a plain scraper. A scraper that gets garbage produces a bad row, and a human eventually eyeballs the table. An agent that gets garbage feeds it straight into its own next decision (calls another tool, writes a summary, answers the user) with no human in the loop. The silent failure compounds. There’s no exception to catch and no retry to trigger, because as far as every layer can tell, the fetch succeeded.

The fix is not smarter prompting. It’s a checkpoint that turns a silent 200 into an explicit verdict before the model reasons.

What the gate actually checks

The gate is a heuristic, and I’d rather hand you a blunt one I trust than a clever one I can’t reason about. It checks three things, in order, and stops at the first hit.

1. Soft-block markers. A short list of strings that mean “this is a wall, not a page”: just a moment..., enable javascript and cookies to continue, attention required, access denied, verify you are human, cf-ray, the captcha vendors. Match any (case-insensitive) and the verdict is BLOCKED, even at status 200, especially at status 200. These are strings I’ve watched come back from real targets dressed as success.

m = _BLOCK_RE.search(low)
if m:
    return "BLOCKED", f"soft-block marker {m.group(0)!r} (status={status})"

2. Visible-text-to-markup ratio. Strip <script> and <style>, strip the remaining tags, measure what readable text is left versus the size of the whole blob. A real article is mostly words. An empty shell is mostly markup, a ratio near zero. The verdict is EMPTY_SHELL in two cases: the body is literally empty, or it’s markup with under ~200 bytes of visible text and a ratio under 0.10. Note the second clause: a short blob that’s almost all readable text (a terse JSON reply, a one-line message) has a high ratio, so it stays OK, not EMPTY_SHELL. The shell verdict is specifically for “lots of markup, almost no words.”

3. Truncation. If the body opened an <html> tree and never closed it, or ends mid-tag, it got cut off. Verdict TRUNCATED. The reason string even echoes the last 40 characters so you can see where it stopped. …'most of the spend was on <sp' is a body that died inside a <span>.

Nothing tripped? OK. The agent may proceed.

That’s the whole decision surface. Notice what it is not doing. It does not validate any field’s value: no checksums, no ranges, no cross-field logic. It does not compare this fetch to a previous one or track a schema over time. It does not decide whether you needed a browser. It answers exactly one question, is this a page at all?, and gets out of the way. The narrowness is the feature.

The thresholds are deliberately blunt

200 bytes of visible text. A ratio under 0.10. These aren’t fitted to anything. They’re “is there clearly almost no content here,” with everything above left as OK. Tune them to your traffic: a site that ships a thin-but-real <title> and a 150-byte summary will trip EMPTY_SHELL at these numbers, which might be a false alarm for you. Raise the floor. The point isn’t my constants. It’s that “is this usable content” is answerable from bytes you already have, before the model spends a token on them.

And the soft-block list is a denylist, so it’s never complete. A challenge page with wording I haven’t seen sails through as OK. The list catches the vendors I’ve met across the fleet; it won’t catch a custom wall some site rolls tomorrow. More on that in the failure modes.

Wiring it into an agent loop

The gate is a pure function, so it drops in wherever your tool returns. The pattern: fetch, gate, branch.

text, status = my_web_fetch(url)          # your existing tool
verdict, reason = sanity_check(text, url, status)

if verdict == "OK":
    observation = text                    # let the agent reason on it
else:
    observation = f"[fetch unusable: {verdict}] {reason}"   # tell the agent the truth

The second branch is the whole point. Instead of handing the model a challenge page and hoping it notices, you hand it [fetch unusable: BLOCKED] soft-block marker 'just a moment...'. Now the model knows the observation failed and can do something sane: try a different source, escalate to a browser-based fetch, or tell the user it couldn’t read the page, instead of confidently summarizing a captcha screen.

That’s the move: convert a silent success into a spoken failure. Models are good at reacting to an error they can see. They’re terrible at noticing one nobody told them about.

Where this lives in the chain (and the two siblings)

This gate is one checkpoint in a longer pipeline, and it’s easy to confuse with neighbors, so here are the seams.

If you want to predict whether a page will even come back as a shell before you fetch it, that’s a different tool that reads the raw response shape. I wrote a 30-line probe that tells you if a page needs a browser. That one runs before the fetch. This one runs after: it doesn’t switch renderers or decide how to fetch, it just flags the agent that the blob it got back is a skeleton.

And the fetch tool itself, the thing that produced the 200, is the 60-line MCP web_fetch server I built yesterday. That post ends with an honest warning: it “does not beat anti-bot systems” and returns “nothing useful” on a challenged page. This gate is the answer to what do you do with that nothing. You gave the agent eyes; now you teach it not to trust a forgery.

Where this gate is wrong

It’s a heuristic. I’d rather tell you the misses than let you find them in an agent that’s already in front of a user.

  • A challenge page in wording I haven’t met. The denylist catches Cloudflare, Akamai, the common captchas. A site that rolls its own “please hold” page with novel text will pass as OK. There’s no clean way around this short of a model-based classifier, which costs tokens on every fetch: the opposite of the point.
  • A short page that is genuinely the content. A 120-byte API error rendered as JSON, a terse status page, a stub doc: these can trip EMPTY_SHELL when they’re exactly what you asked for. The ratio test can’t tell “empty shell” from “small real page.” Tune the floor, or skip the ratio check for endpoints you know return short bodies.
  • A truncated body that happens to close its tags. If the cut-off landed right after a </html> (rare, but possible with a buffered proxy) the truncation check misses it. Length-versus-Content-Length would catch that, but my fetch tool doesn’t always have a reliable Content-Length, so I left it out rather than half-implement it.
  • A truncated body with no markup at all. The truncation check only fires on HTML (it looks for an unclosed <html> tree or a mid-tag cut). A JSON or plain-text response that got chopped mid-array has no tags to read, so it sails through as OK. For JSON endpoints, pair this with a json.loads in a try and treat a parse failure as TRUNCATED yourself.
  • A blocked page with a 403. This gate is for the 200-shaped lie. If your fetch tool already raises on 4xx/5xx (mine does), those never reach here, which is correct. The gate exists for the failures your status check waves through.

That last point is the design line. The gate doesn’t replace your status handling. It catches the class your status handling is structurally blind to: success codes carrying non-content.

Why “no network” isn’t a footnote

The gate makes zero network calls on purpose. sanity_check(text, url, status) is a pure function of its inputs: same blob in, same verdict out. That buys three things. Tests pin a fixture to a verdict and never flake. The output above is reproducible, byte for byte (I checked the md5). And a gate that called out to a live anti-bot site to “confirm” a block would add latency, egress, and a second thing that can fail. The blob already arrived. Everything we need to judge it is in the bytes.

Same discipline as every checkpoint I ship: the browser probe, the schema canary, the field sanity checks are all pure functions over data you already have. It’s what lets the next person re-run them and get my exact result instead of taking my word.

What I’d do on Monday

Put the gate right after your fetch tool returns, before the result becomes an observation.

  1. Gate every fetch, not just the ones that look wrong. The whole problem is that the bad ones look fine. A 200 with a screen of HTML is exactly what a challenge wall looks like.
  2. Feed the verdict to the agent, not just your logs. [fetch unusable: BLOCKED] in the observation lets the model route around the failure. A line in your logs that the model never sees does not.
  3. Tune the soft-block list to your own traffic. Watch what your targets actually send back as a 200 for a week, and add the strings you see. The list in the file is my fleet’s, not yours.

I’ll be straight about the limit: this catches the common shapes of a 200 that isn’t content. It will not catch a clever, novel wall, and it can’t tell a tiny real page from an empty one without your help on the threshold. It’s a floor, not a fortress. But most agent loops I’ve seen don’t even have the floor. They hand the model the string and hope.

Here’s the open question I haven’t solved cleanly. The EMPTY_SHELL check and the soft-block list are both content-shape signals: they look at the blob in isolation. But the strongest signal that a fetch failed is often relative. This page is 200 bytes when the same URL gave 40 KB yesterday, or every URL on this host suddenly returns the same challenge string. That’s drift across fetches, and a pure per-blob function can’t see it without state. If you’ve found a cheap way to fold “this looks wrong compared to last time” into a per-call gate without dragging a database into your agent loop, I genuinely want to see it.

What’s the worst 200 OK your agent ever believed, and what tipped you off that the page was garbage? 👇


Follow for the next checkpoint from our production runs. I read every comment.

Full script (fetch_sanity.py, stdlib only, no network, the exact file I ran):

#!/usr/bin/env python3
"""fetch_sanity.py — one gate between a web-fetch tool and your agent's reasoning.

A fetch tool can return HTTP 200 and a non-empty body that is still NOT content:
an anti-bot challenge page, an empty JS shell, an "access denied" notice, or a
body that got cut off mid-stream. The agent treats all of it as "the page said
this" and plans on garbage. This function answers ONE question before reasoning:
is the returned blob usable as content at all?

    sanity_check(text, url, status) -> (verdict, reason)
    verdict in {"OK", "BLOCKED", "EMPTY_SHELL", "TRUNCATED"}

Pure function. No network, no I/O, deterministic: same (text, url, status) in,
same verdict out. Run it, diff it, trust it. It is a heuristic, not an oracle —
the "Where it's wrong" section of the post is honest about the misses.
"""
import re

# Soft-block markers: a 200 body that is really a challenge / denial wall.
# Lowercased substring/regex match against the body. Each one is a real string
# I have seen come back with status 200 instead of 403.
BLOCK_MARKERS = [
    r"just a moment\.\.\.",          # Cloudflare interstitial
    r"enable javascript and cookies to continue",
    r"attention required",           # Cloudflare block page title
    r"access denied",
    r"verify you are (?:a )?human",
    r"are you a robot",
    r"complete the security check",
    r"cf-ray",                       # Cloudflare ray id leaks into the body
    r"px-captcha|hcaptcha|g-recaptcha|/recaptcha/",
    r"request unsuccessful\. incapsula",
]
_BLOCK_RE = re.compile("|".join(BLOCK_MARKERS), re.IGNORECASE)

_TAG_RE = re.compile(r"(?is)<(script|style)\b.*?</\1>")
_ANYTAG_RE = re.compile(r"(?s)<[^>]+>")


def _visible_ratio(text):
    """Fraction of the blob that is visible text after stripping script/style/tags.
    A real article is mostly words; an empty JS shell is mostly markup."""
    if not text:
        return 0.0
    stripped = _TAG_RE.sub(" ", text)
    stripped = _ANYTAG_RE.sub(" ", stripped)
    visible = re.sub(r"\s+", " ", stripped).strip()
    return len(visible) / len(text)


def sanity_check(text, url, status):
    """Return (verdict, reason) for one fetched blob. No network calls."""
    body = text or ""
    low = body.lower()

    # 1) BLOCKED — a soft-block / challenge / denial wall served as 200.
    m = _BLOCK_RE.search(low)
    if m:
        return "BLOCKED", f"soft-block marker {m.group(0)!r} (status={status})"

    # 2) EMPTY_SHELL — almost nothing to read. Either a literal empty body
    #    (200 + ""), or markup with the content rendered client-side: a raw
    #    fetch handed the agent a skeleton, not a page.
    ratio = _visible_ratio(body)
    has_markup = "<" in body and ">" in body
    visible_len = int(ratio * len(body))
    if visible_len < 200:
        if not body.strip():
            return "EMPTY_SHELL", f"empty body (status={status})"
        if has_markup and ratio < 0.10:
            return "EMPTY_SHELL", f"visible≈{visible_len}B ratio={ratio:.2f} (markup, no content)"

    # 3) TRUNCATED — body cut off mid-stream: opened an HTML tree but never
    #    closed it, or ends mid-tag / mid-word with no terminal punctuation.
    if has_markup:
        opened_html = "<html" in low
        closed_html = "</html>" in low
        ends_mid_tag = bool(re.search(r"<[a-z][^>]*$", body.rstrip(), re.IGNORECASE))
        if (opened_html and not closed_html) or ends_mid_tag:
            tail = body.rstrip()[-40:].replace("\n", " ")
            return "TRUNCATED", f"no </html> / mid-tag end …{tail!r}"

    # 4) OK — nothing tripped. The agent may reason on this.
    return "OK", f"len={len(body)}B ratio={ratio:.2f}"


# ---------------------------------------------------------------------------
# Fixtures. One REAL captured body (example.com, fetched once and pasted in as a
# byte-for-byte string so this stays offline/deterministic) + five SYNTHETIC
# bodies hand-written to reproduce failure classes I have hit in production.
# Synthetic ones are labeled (synthetic) so nobody mistakes them for a live pull.
# ---------------------------------------------------------------------------

# Real: the actual body of https://example.com (RFC-style sample page, public,
# unchanging). Captured once, hardcoded so the gate needs no network.
EXAMPLE_COM = (
    "<!doctype html><html><head><title>Example Domain</title></head><body>"
    "<div><h1>Example Domain</h1><p>This domain is for use in illustrative "
    "examples in documents. You may use this domain in literature without prior "
    "coordination or asking for permission.</p>"
    "<p><a href=\"https://www.iana.org/domains/example\">More information...</a>"
    "</p></div></body></html>"
)

# Synthetic: a Cloudflare "Just a moment..." interstitial served with status 200.
CF_CHALLENGE = (
    "<!DOCTYPE html><html lang=\"en-US\"><head><title>Just a moment...</title>"
    "<meta http-equiv=\"refresh\" content=\"390\"></head><body>"
    "<div class=\"main-wrapper\"><h1>example.com</h1>"
    "<h2>Checking if the site connection is secure</h2>"
    "<p>example.com needs to review the security of your connection before "
    "proceeding.</p></div>"
    "<!-- cf-ray: 8e2a1f0c9d4e7b21-FRA --></body></html>"
)

# Synthetic: an "Access Denied" wall (Akamai-style) returned as 200.
ACCESS_DENIED = (
    "<html><head><title>Access Denied</title></head><body>"
    "<h1>Access Denied</h1>"
    "<p>You don't have permission to access this resource.</p>"
    "<p>Reference #18.abcd1234.1718200000</p></body></html>"
)

# Synthetic: an empty SPA shell — all markup, the content arrives via JS.
JS_SHELL = (
    "<!doctype html><html><head><meta charset=\"utf-8\">"
    "<link rel=\"stylesheet\" href=\"/static/app.css\"></head><body>"
    "<div id=\"root\"></div>"
    "<script src=\"/static/runtime.js\"></script>"
    "<script src=\"/static/vendor.js\"></script>"
    "<script src=\"/static/main.js\"></script></body></html>"
)

# Synthetic: a real article body that got cut off mid-stream (size cap / dropped
# connection). Opens <html>, never closes it, ends mid-tag.
TRUNCATED_BODY = (
    "<!doctype html><html><head><title>How we cut the crawl bill</title></head>"
    "<body><article><h1>How we cut the crawl bill 82%</h1>"
    "<p>We started by measuring the per-run cost of a headless browser across "
    "every target in the fleet. The first surprise was how often we paid for "
    "Chrome on pages that answered a plain GET in eighty milliseconds. The second "
    "was the cost of the pages that came back empty. We logged each run and "
    "tallied the verdicts, and the numbers were blunt: most of the spend was on "
    "<sp"
)

FIXTURES = [
    ("https://example.com",                 EXAMPLE_COM,     200, "real"),
    ("https://shop.example/product/991",    CF_CHALLENGE,    200, "synthetic"),
    ("https://api.example/v2/orders",       ACCESS_DENIED,   200, "synthetic"),
    ("https://app.example/dashboard",       JS_SHELL,        200, "synthetic"),
    ("https://blog.example/crawl-bill",     TRUNCATED_BODY,  200, "synthetic"),
    ("https://example.com/empty",           "",              200, "synthetic"),
]


if __name__ == "__main__":
    tally = {}
    print(f"{'VERDICT':<12} {'KIND':<10} URL")
    print("-" * 64)
    for url, body, status, kind in FIXTURES:
        verdict, reason = sanity_check(body, url, status)
        tally[verdict] = tally.get(verdict, 0) + 1
        print(f"{verdict:<12} {kind:<10} {url}")
        print(f"{'':<12} {'':<10}{reason}")
    usable = tally.get("OK", 0)
    n = len(FIXTURES)
    print("-" * 64)
    print(f"{usable} of {n} blobs were usable content  ::  {tally}")

Written by Aleksey Spinov. I run scrapers in production (2,190 runs across 32 published actors, the Trustpilot one at 962) and write up the failures the tutorials skip. The gate, the fixtures, and every verdict above were produced and verified by me on Python 3.13.5; the output shown is the real run, not a mock-up.

AI disclosure: drafted with AI assistance. The code was run locally (stdlib only, no third-party deps, no network); the stdout in this post is the actual output, and the synthetic fixtures are labeled as synthetic throughout.


More production scraping tips: t.me/scraping_ai