HTTP 200 Is a Lie: A 30-Line Schema Canary for Source Drift


A scraper that returns HTTP 200 is not a scraper that returns good data. Those are two different claims, and almost every monitoring setup I’ve seen conflates them.

Here’s the failure mode nobody writes code for. The source you scrape quietly changes. A field gets renamed, a number comes back as a string, one column goes blank. Your request still gets a 200. Your parser doesn’t throw. Your job exits green. And from that day forward, every scheduled run feeds slightly-wrong records into your corpus. No alarm. No stack trace. Just slow rot.

I’ve got 2,190 production scraper runs behind me, 962 of them on a single Trustpilot review scraper (that’s a raw lifetime run counter on my Apify profile, knotless_cadence, as of May 2026; not a controlled study, just a long-running meter). And the thing I learned running one source that many times isn’t about proxies or rate limits. It’s this: the most expensive failures are the ones that don’t fail.

This post is about catching them cheap. There’s a 30-line stdlib validator at the bottom. Copy it, run it, drop it into your pipeline today.

TL;DR

  • HTTP 200 means “the server answered,” not “the data is the shape you expect.” Monitoring the status code misses silent source drift entirely.
  • Drift is a field renamed, a type changed (5"5"), or a value gone empty. Your parser tolerates all three and keeps writing.
  • The fix is a contract on the record’s shape — declared once, checked every run. In typed REST APIs this exists. In HTML scraping there’s usually no contract at all.
  • Below: a canary() in stdlib Python (no jsonschema), the real output of a run, and the honest limits of where this breaks.

Why “it didn’t crash” is the wrong success signal

When people monitor a scraper, they watch three things: did the request succeed (2xx), did the job finish, did the row count look roughly right. All three can be green while the data is garbage.

Think about what a scraper actually does. It fetches bytes, then a parser pulls fields out of those bytes by position or selector or key. The fetch layer is loud — a 403, a timeout, a connection reset, those throw and you notice. The parse layer is quiet. If you ask BeautifulSoup for .select_one(".review-text") and the source renamed that class last Tuesday, you don’t get an exception. You get None. Your code does .get_text() on it, or wraps it in a try, or just stores the empty string. The run finishes. The row count is unchanged. Everything looks fine.

That gap, loud fetch layer and silent parse layer, is the whole problem. And it widens over time, because sources change on their own schedule, not yours.

There’s a fresh post making the rounds on Hacker News (24 May 2026, modest at 2 points) called “How API Drift Silently Breaks Data Pipelines.” The author’s framing is good: your pipeline is only as stable as the contracts those upstream APIs promised. They name the exact failure modes I keep hitting: fields renamed (user_iduserId), types converted (floats becoming strings), structure shifts. But the post stops at the diagnosis. It describes the disease and doesn’t hand you the thermometer. And it’s written for typed REST APIs, where at least a contract existed to drift away from.

HTML scraping has it worse. There is no contract. A REST API at least published a shape once. A web page promises you nothing. You reverse-engineered the structure by reading the DOM, and the site owner has zero obligation to keep it stable, or any awareness that you depend on it. So the contrarian point of this whole post: the place that needs a contract most is the place that has none. So add one.

The contract idea (it’s smaller than you think)

A contract is just: what does a healthy record look like? For a review scraper, a healthy record has a non-empty name string, a numeric rating, a non-empty text string, and a date string. That’s it. Four fields, three properties each.

The trick is to assert that shape on every run, on the output of your parser, and to treat a violation as a signal — not an exception that kills the job, but a measurement you can graph and alert on. You’re not validating that the network worked. You’re validating that the data is still the data.

People reach for jsonschema here. Fine tool. But it’s a dependency, a learning curve, and honestly overkill for “are my four fields still the right four fields.” I wanted something I could paste into any collector in five minutes with nothing but the standard library. So I wrote it as a plain dict and two small functions.

The canary

Here’s the whole thing. json is the only import.

"""Schema canary — assert the SHAPE of a parsed record, not just HTTP 200.
stdlib only (no jsonschema): a field contract + a drift report over a batch.
"""
import json

# 1) The CONTRACT: what a healthy record must look like. Declared ONCE.
#    type is checked structurally; "nonempty" is a separate, stricter gate.
CONTRACT = {
    "name":   {"type": str,          "required": True, "nonempty": True},
    "rating": {"type": (int, float), "required": True, "nonempty": True},
    "text":   {"type": str,          "required": True, "nonempty": True},
    "date":   {"type": str,          "required": True, "nonempty": True},
}

def is_empty(v):
    if v is None:
        return True
    if isinstance(v, str) and v.strip() == "":
        return True
    if isinstance(v, (list, dict)) and len(v) == 0:
        return True
    return False

def check_record(rec, contract=CONTRACT):
    """Drift findings for ONE record: (field, problem)."""
    findings = []
    for field, rule in contract.items():
        if field not in rec:
            if rule["required"]:
                findings.append((field, "MISSING_KEY"))
            continue
        val = rec[field]
        if not isinstance(val, rule["type"]):
            findings.append((field, "WRONG_TYPE"))
            continue
        if rule.get("nonempty") and is_empty(val):
            findings.append((field, "EMPTY"))
    for k in rec:                       # a renamed/new field = schema shift upstream
        if k not in contract:
            findings.append((k, "UNEXPECTED_KEY"))
    return findings

def canary(records, contract=CONTRACT, fail_ratio=0.05):
    """Aggregate over a batch. Trip when the drift ratio crosses fail_ratio."""
    total = len(records)
    drifted, by_problem = 0, {}
    for r in records:
        f = check_record(r, contract)
        if f:
            drifted += 1
            for _, problem in f:
                by_problem[problem] = by_problem.get(problem, 0) + 1
    ratio = drifted / total if total else 0.0
    return {
        "total": total,
        "drifted": drifted,
        "drift_ratio": round(ratio, 3),
        "by_problem": by_problem,
        "tripped": ratio > fail_ratio,
    }

Four kinds of drift, and each maps to a real thing a source does:

  • MISSING_KEY: a required field disappeared. The source dropped a column, or your selector now matches nothing and your parser didn’t even set the key.
  • WRONG_TYPE: the classic. rating came back as "5" instead of 5. Your downstream sum() or >= 4 comparison either crashes much later or silently coerces wrong.
  • EMPTY: the key exists, the type is right, but the value is blank. This is the quietest failure. Nothing looks broken structurally; the data is just gone. On a long-running source this is the one that rots a corpus for weeks before anyone notices.
  • UNEXPECTED_KEY: a field you didn’t expect showed up. Usually means the source renamed something. The old key vanishes (MISSING_KEY) and a new one appears (UNEXPECTED_KEY) in the same record. That pairing is a fingerprint of a rename.

The fail_ratio matters. You don’t want to trip on one weird record out of fifty thousand; sources have always had a few malformed rows. You want to trip when drift becomes systematic. Set it to whatever your baseline noise is plus a margin. I’ll show why a per-batch ratio beats a per-record assert in a second.

Run it: green, then tripped

Demo scenario, and I tried to make it honest rather than dramatic. A batch of 48 review records. On day one, every record matches the contract. Weeks later the source quietly reshapes its output. Only 3 of the 48 records carry the new shape: rating became a string, text got renamed to body, and one record’s name came back blank. The server still returns 200. The parser still doesn’t crash.

Here’s the actual output, copied straight from the terminal (Python 3.11.9, stdlib only):

HEALTHY: {"total": 48, "drifted": 0, "drift_ratio": 0.0, "by_problem": {}, "tripped": false}
DRIFTED: {"total": 48, "drifted": 3, "drift_ratio": 0.062, "by_problem": {"WRONG_TYPE": 3, "MISSING_KEY": 3, "UNEXPECTED_KEY": 3, "EMPTY": 1}, "tripped": true}
one bad record -> [('name', 'EMPTY'), ('rating', 'WRONG_TYPE'), ('text', 'MISSING_KEY'), ('body', 'UNEXPECTED_KEY')]

Read the DRIFTED line. Only 6.2% of records drifted, three out of forty-eight. A row-count monitor sees 48 records both days and says nothing. An HTTP monitor sees 200 both days and says nothing. The canary trips, because 0.062 is over the 0.05 threshold, and by_problem tells you what drifted: three wrong types, three missing keys, three unexpected keys.

Now look at the last line, the single bad record. One source change produced four findings on one row: name went empty, rating flipped to a string, text vanished, body appeared. That text MISSING_KEY plus body UNEXPECTED_KEY pairing? That’s a rename, caught without you knowing in advance which field the source would touch. You declared what should be there. Everything else is signal.

I’ll be straight about what this demo is and isn’t. The 6.2% is a constructed example, not a measurement from a specific incident. I built the batch to show a partial drift crossing a 5% threshold. The point of the number is the mechanism (a small fraction of drift still trips a ratio-based canary), not the value itself.

Why a ratio, not a hard assert

Your first instinct might be: just assert isinstance(rec["rating"], int) in the parser and let it blow up. Don’t. Two reasons.

First, real sources are noisy. A handful of records are always malformed: a user left a field blank, a date is missing, whatever. If you hard-assert per record, you’re paging yourself at 3 a.m. for noise that was always there. The ratio absorbs baseline weirdness and only fires on a trend.

Second, a crash loses the run. If you blow up on record 12,000 of 50,000, you’ve thrown away the 11,999 good records you already pulled. The canary is a report, not an exception. You finish the run, write the good rows, and emit a drift metric beside the data. Then your existing alerting (the thing that already watches your job metrics) decides what to do with a tripped: true.

That’s the design choice I’d defend: drift detection is monitoring, not control flow. Measure it, graph it, alert on the trend. Don’t let it kill the job that’s still producing useful data.

The cost math: why this compounds

Here’s the part that makes it worth your afternoon. A scraper isn’t a one-shot script; it’s a recurring job. Say you scrape a source daily and the layout drifts on a Tuesday. Without a canary, the gap between “drift started” and “human noticed” is however long until someone downstream complains about weird numbers. In my experience that’s rarely under a week, often a month, because the data still looks plausible. Empty strings and stringified numbers don’t jump out in a dashboard.

So the bill isn’t one bad run. It’s (days undetected) × (records per run) poisoned rows, plus the backfill to fix them, plus the trust hit when whoever consumes the data realizes a slice of it was wrong for a month. The canary turns “noticed in a month” into “noticed on the next run.” That’s the entire ROI. It collapses the detection window from weeks to one cycle, for thirty lines and zero dependencies.

A second reason 200 lies: deliberate slop

There’s a darker version of this worth knowing about. Some site operators have started intentionally serving garbage to bots they don’t like. There’s a post from late May 2026 (also on HN, 6 points) titled “Webmaster, it’s time to serve slop to AI crawlers,” where the author wires up a tiny language model to detect crawler user-agents and reply with “incoherent AI-generated content.” Their words: “Since these bots are looking to scrape data, why not serve them some data? But not just any data, let’s serve them some slop.”

If that’s pointed at you, you get a clean 200 and a body full of plausible-looking nonsense. A status-code monitor is blind to it. A shape monitor isn’t your full defense here either, because well-crafted slop can match your contract’s shape. But a contract is the floor: the fastest, cheapest tripwire, and it catches the lazy version (empties, type flips, structural changes) immediately. Natural redesign and adversarial slop hit the same blind spot, and the canary covers the structural part of both.

Honest limits

This is a tripwire, not a guarantee. Be clear-eyed about where it doesn’t help:

  • Value drift inside a valid shape. If rating stays an integer but the source starts returning 3 where it used to return 5, the shape is fine and the canary says nothing. Catching that needs distribution checks (mean/quantile of rating over time), which is a different, heavier tool.
  • You still have to maintain the contract. If you legitimately add a field, you update CONTRACT, or you’ll get UNEXPECTED_KEY noise. That’s a feature (it forces you to acknowledge schema changes) but it’s also work.
  • Per-record only. This validates individual records, not relationships between them (e.g., “reviews should be newer than the account”). Cross-record invariants are out of scope on purpose, to keep the canary small.
  • No magic for adversarial sources. As above, shape-matching slop will pass. If you suspect deliberate poisoning, you need content-level sanity checks too.

I’d rather ship the 30-line tripwire that catches 80% of silent drift today than wait for the perfect data-quality framework. You can always add distribution checks later. You can’t un-rot a corpus you didn’t know was rotting.

What to do Monday

Pick your noisiest recurring scraper. Write down what one healthy record looks like — four or five fields, types, which must be non-empty. Paste the canary, point CONTRACT at those fields, and call canary(records) right after your parse step, before you write. Emit drift_ratio and by_problem as a metric next to your row count. Wire tripped into whatever already pages you.

That’s the whole move. You’re no longer trusting that a 200 means the data is good. You’re checking.


Written by Alexey Spinov. I run production scrapers — 2,190 lifetime runs, 962 on one Trustpilot source (profile: apify.com/knotless_cadence). Code in this post was run on Python 3.11.9; the output blocks are copied from real runs. Drafted with AI assistance and edited/verified by me — I don’t publish numbers or output I haven’t run myself.

Follow for the numbers from the next batch of runs. And tell me your worst silent-drift story — the data that was wrong for weeks before anyone caught it. I read every comment.


More production scraping tips: t.me/scraping_ai