Your Scraper Got Clean Data. The Site Lied to It.


Your Scraper Got Clean Data. The Site Lied to It.

Your scraper ran clean. HTTP 200 on every request. The schema validated. Every price sat in a sane range, every date was in the past, every ISBN had thirteen digits. Zero errors, zero retries. You shipped the dataset.

And every value in it was a lie the site fed you on purpose, because it knew you were a bot.

That’s the failure I want to talk about. Not a block. Not a captcha. Not a crash. The site looked at your traffic, decided not to fight you, and instead handed back a 200 full of plausible garbage — values engineered to pass every check you have. This is the one corner of data quality where valid and true come apart, and almost every detector people write only measures the first one.

TL;DR

  • A site that detects a scraper can serve a 200 with a flawless schema and values that pass every sanity rule — and are deliberately fabricated. This is documented, shipping anti-bot behavior, not a hypothetical.
  • Status codes and sanity checks can’t catch it. They answer “is my pipeline correct?” The poisoned row’s question is “is the source telling the truth?” No range check answers that.
  • The fix is grounding: check each row against an independent invariant the source can’t fake by making the value look plausible — an ISBN-13 checksum, price * qty == line_total, a real second origin.
  • The trap I didn’t expect: naive cross-source consensus gets fooled too. “Three sources agree” means nothing if all three are mirrors of one poisoned first-party page. Independence is the signal, not the vote count.
  • The numbers in the demo below are a deterministic synthetic dataset, not a measurement of any real site. What’s real is the volume that earns me the right to talk about plausible-but-false data: 2,190 production runs across 32 actors, one Trustpilot scraper at 962.

Why I get to talk about plausible lies

I run production scrapers. Thirty-two published actors, 2,190 runs logged in production as of this week (my own Apify dashboard — that’s the live counter, not a rounded brag). One of them, a Trustpilot review scraper, has run 962 times against the same kind of source.

That last one matters more than the total, and here’s why. Reviews are the textbook class of data where plausible and true are different problems. A fake review and a real review look identical at the schema level: both have a star rating in 1–5, a date, a body in fluent English, an author handle. Sanity passes on both. The entire job of scraping that surface well is knowing that a clean, well-formed, perfectly typed record can still be fabricated. After 962 runs hitting that wall, “the value looks right and is still false” isn’t a thought experiment to me. It’s the default assumption.

Now let me be honest about what I don’t have. I do not have a clean, published figure for how many poisoned rows we’ve actually caught in the wild, or what share of any specific site’s responses are fake. That number is n/d, and I’m not going to invent it. The 2,190 / 962 is the part that’s real — it’s the exposure that makes the failure class familiar. The detection mechanism below, I’ll show you on a dataset you can run yourself in two seconds, so nothing rests on a number you can’t check.

This is not the other failures in the series

If you’ve read the rest of this series, draw the boundary hard, because these failures rhyme and the fixes don’t.

This is not a broken status code. HTTP 200 lying about the shape of a response is the schema canary — that catches a 200 whose structure has drifted. Here the shape is flawless. The canary would report HEALTHY.

This is not a field that violates a sanity rule — a price of $0, a date in the future, a language that doesn’t match the country. That post catches values that look wrong. This is the opposite: every field looks right, passes every one of those sanity checks, and the value is still a deliberate lie the source served because it made you as a bot.

It isn’t a crash you resume from — there’s no crash, the job exits 0. The question those posts answer is “is my pipeline correct?” This one asks something none of them touch: “is the source telling the truth?” And no status code, no schema validator, no range check will ever answer that.

New axis: trust. Not the shape of a response, not a field’s plausibility, not a crash. Whether the values themselves are real.

This is shipping, not a thought experiment

The reason this isn’t paranoia: the anti-bot industry already does it on purpose. The polite framing is “tarpit” or “decoy content” — when a site fingerprints a crawler, instead of returning a 403 it returns a 200 full of generated content that wastes your time and pollutes your dataset. Cloudflare shipped a feature in 2025 that feeds suspected AI crawlers a maze of AI-generated decoy pages on purpose; their own writeup frames it as serving believable-but-irrelevant data instead of blocking (Cloudflare, AI Labyrinth, 2025). Bruce Schneier has been cataloguing the broader version — deliberate data poisoning aimed at scrapers and the models trained on them (Schneier on Security, 2025).

So the threat model flipped. The old story was “they’ll block you.” The new story is “they’ll let you in and lie to you,” because a poisoned dataset that you trust is worth more to them than a blocked request you’ll just retry from another IP.

Why status and sanity are silent by design

Here’s the uncomfortable part. The poisoned row passes your checks not because your checks are bad, but because they were built to answer a different question.

A status code answers: did the transport succeed? 200 — yes. A schema canary answers: is the structure intact? All keys present, types correct, no nulls — yes. A sanity rule answers: is this value inside the range a real value would fall in? Price is $44.99, qty is 1, date is last month — yes, yes, yes.

Every one of those is a question about form. None of them is a question about correspondence to reality. And an adversary who controls the response can satisfy every form check trivially — that’s the whole point of serving a decoy instead of a block. They’re not sending you malformed junk that trips a validator. They’re sending you a beautifully formed lie.

Sanity catches accidents: the source glitched, a field came back null, a scraper bug doubled a value. It does not catch adversarial fabrication, because fabrication is designed to look sane. Validity is not truth. You can’t range-check your way to trust.

What actually catches it: grounding to an invariant

So if you can’t trust the value, what can you trust? An invariant the source can’t satisfy just by making the number look plausible — something that has to be computed against an independent reference and will break if the value was made up.

Three kinds are cheap and shockingly effective:

  1. A self-checking identifier. An ISBN-13 isn’t just thirteen digits. Its last digit is a checksum over the first twelve. A fabricated ISBN that “looks real” almost never satisfies the checksum.
  2. A cross-field arithmetic invariant. price * qty == line_total. Each field can be individually plausible and the math can still be incoherent — which is exactly what happens when values get swapped around.
  3. Real independent corroboration. Not “how many sources cite this,” but “how many distinct origins do.” Three citations that all resolve to the same first-party domain are one source wearing three hats.

Here’s the whole probe. Pure stdlib, no network, no browser, no keys, no random — a deterministic synthetic dataset stands in for the collected rows so you get the exact output I did. The “collection” is a hardcoded list because the mechanism — grounding to an invariant — doesn’t depend on the transport one bit.

def isbn13_checksum_ok(isbn13):
    """An ISBN-13 is valid only if sum(d_i * w_i) % 10 == 0, weights alternating
    1, 3. Sanity asks 'is it 13 digits'. This asks 'is it a REAL ISBN'."""
    digits = [int(c) for c in isbn13]
    total = sum(d * (1 if i % 2 == 0 else 3) for i, d in enumerate(digits))
    return total % 10 == 0


def arithmetic_invariant_ok(r, tol=0.01):
    """price * qty must equal line_total. Each field can be individually plausible
    and still violate this — incoherent values don't survive the cross-field math."""
    return abs(r["price"] * r["qty"] - r["line_total"]) <= tol


def registrable_domain(host):
    """Crude eTLD+1: last two labels. 'm.vendor-a.com' -> 'vendor-a.com'.
    In production use the public suffix list."""
    return ".".join(host.split(".")[-2:])


def independent_corroboration_ok(r, min_independent=2):
    """'3 sources agree' means nothing if all 3 are mirrors of one origin.
    Count DISTINCT registrable domains, not citations."""
    distinct = {registrable_domain(h) for h in r["sources_citing"]}
    return len(distinct) >= min_independent

Each row gets graded TRUSTED or POISONED(reason). Note what the checks do not need: they never need to know which field is the lie, or what the “correct” value was. They only need the invariant to hold. That’s the property that makes grounding work where sanity can’t — it doesn’t model the truth, it models a constraint the truth obeys and a lie usually breaks.

The twist: naive consensus gets fooled too

The obvious upgrade, the one most people reach for, is consensus. “Don’t trust one source. Cross-check three. If they agree, it’s true.” It feels bulletproof.

It isn’t, and this is the part I want you to take away even if you forget the rest. I ran a row into the probe that three sources corroborate — and it’s still poisoned. Because all three “sources” resolve to the same registrable domain: store.vendor-a.com, m.vendor-a.com, cdn.vendor-a.com. One poisoned first-party page, mirrored across a store front, a mobile host, and a CDN. A naive consensus check counts three agreements and votes confidently for the lie.

Consensus measures agreement. Agreement is not independence. If everyone is quoting the same poisoned origin, unanimity is exactly what you’d expect — and exactly what you should distrust. The fix is to count distinct origins, not distinct URLs: collapse each citation to its registrable domain and require at least two that don’t trace back to the same place. That’s the difference between “three sources said so” and “three independent sources said so,” and adversaries live in the gap between those two sentences.

The live run

Twelve “collected” rows. All twelve pass the schema canary (#5) and all twelve pass field sanity (#7) — by construction, so the old detectors stay silent. Three are poisoned, each caught by exactly one grounding rule. Run the script and you get:

=== POISON CHECK (deterministic synthetic dataset, not a measurement of a real site) ===
records collected        : 12
passed schema canary (#5): 12   <- shape is perfect on ALL
passed field sanity (#7) : 12   <- every value looks RIGHT; sanity is silent
--------------------------------------------------------
grounding checks (truth, not validity):
  ISBN-13 checksum        : 1 failed
  price*qty == line_total : 1 failed
  independent corroborat. : 1 failed (3 "sources" all point to one first-party)
--------------------------------------------------------
TRUSTED                  : 9
POISONED                 : 3
  - sku DEMO-0007  reason=isbn13_checksum
  - sku DEMO-0009  reason=arithmetic_invariant
  - sku DEMO-0011  reason=false_consensus_single_origin
========================================================
verdict: 3 clean, well-formed, sanity-passing rows are fabricated.

Read what that output is actually saying:

  • passed schema canary 12 and passed field sanity 12. All twelve clear every check from the earlier posts in this series. The schema canary and the sanity validator are structurally blind here — not broken, just answering a different question.
  • ISBN-13 checksum: 1 failed. DEMO-0007 has a thirteen-digit ISBN with a valid 978 prefix. Sanity loves it. The check digit doesn’t satisfy the weighted-mod-10 rule, so the number was made up.
  • price*qty == line_total: 1 failed. DEMO-0009 has a sane price, a positive integer qty, and a positive total — each field passes on its own. 34.99 * 3 is 104.97, not the 79.99 in the row. The values are individually plausible and jointly incoherent.
  • independent corroborat: 1 failed. DEMO-0011 is the consensus trap: three citations, one origin. The vote says trust it. Independence says don’t.
  • verdict: 3 ... fabricated among nine clean rows. The poison is invisible to status, shape, and sanity, and visible only to grounding.

Where this breaks (and I’m not going to oversell it)

Grounding is not a lie detector. It’s narrower than that, and the limits matter more than the wins.

You need an invariant, and not all data has one. A book has an ISBN checksum. A line item has arithmetic. A free-text Trustpilot review has neither. There is no checksum on “the food was cold and the staff were rude.” For pure text with no internal constraint and no independent anchor, grounding has nothing to grab — and that’s exactly the surface where poisoning is easiest. I can’t hand you a 30-line probe for that. I’d be lying if I said I could.

Cross-source corroboration needs sources that are genuinely separate. My domain collapse catches the lazy mirror case. It does not catch a determined adversary who plants the same lie across genuinely distinct domains, or a real ecosystem where everyone honestly syndicates from one upstream feed. Independence is a spectrum, and registrable-domain is a crude proxy for it.

A perfectly consistent fabrication beats this. Grounding catches incoherent lies — a value that breaks a constraint. If the adversary fabricates the ISBN and recomputes a valid checksum, and makes the arithmetic close, the invariant holds and the probe says TRUSTED. This raises the cost of the lie, which is the realistic goal. It does not make lying impossible.

So treat it as what it is: a cheap filter that catches the common, lazy poisoning and forces an adversary to do real, coordinated work to get past it. That’s a good trade for 30 lines. It is not trust by itself.

What to do Monday

Three moves, smallest first:

  1. Add every invariant your data already carries. Checksummed identifiers (ISBN, IBAN, EAN, VAT numbers), cross-field arithmetic (price * qty == total, percentages summing to 100), referential constraints (a foreign key that resolves). These are free — the constraint already exists in the domain; you just have to check it. Most pipelines never do.
  2. Count distinct origins, not citations. If you corroborate across sources, collapse each to its registrable domain before you count. “Three sources” that share a domain is one source. The public suffix list does this properly; the two-label hack in the demo is the starter version.
  3. Know which of your columns have no anchor, and flag them. The honest output isn’t only “this row is poisoned.” It’s also “this field has no invariant I can ground it against, so I can’t vouch for it.” A pipeline that knows what it can’t verify is worth more than one that pretends everything is fine.

You don’t need a fraud-detection platform for the first pass. You need to use the constraints your own data already ships with — most of us collect them and then never check them.


One thing I haven’t solved: how do you ground free text at all? Reviews, descriptions, comments — the highest-value, most-poisoned surface — have no checksum and often no independent anchor. The only handles I’ve found are weak: distribution drift across a corpus, stylometric oddities, timing clusters. None of them work on a single row, and all of them are gameable. If you’ve actually caught a fabricated free-text record in production — not a malformed one, a fabricated one — I want to know what signal you used. I read every comment.

Follow for the next numbers from the run log. And tell me: what’s the most convincing fake data a source ever fed your scraper — the one that passed every check you had?


Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: blog.spinov.online and my Apify profile.

AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The dataset is synthetic and deterministic; the output above is real stdout from executing the script.


More production scraping tips: t.me/scraping_ai