Your Scraper Collected 50 Rows. There Were 4,000.
Your Scraper Collected 50 Rows. There Were 4,000.
A scraper can pass every check you wrote and still be wrong about the one thing you actually care about: how much it collected.
No exception. No 500. No broken row. Exit code 0, logs green, every field valid. And the set on disk is a quarter of what the site actually has. I have run scrapers in production enough times to stop trusting a green run on its own, and this is the failure that taught me to count.
TL;DR
- A paginated source can serve fewer rows than it claims and never throw — page caps, hidden offset limits, infinite scroll that “ends” early.
- Your status check (200), schema check (valid row), and byte check (you got data) all pass. None of them counts records.
- The tell: declared total vs unique ids collected. Or, when there’s no declared total, the page that quietly repeats an earlier page.
- Below is a 40-line probe you can run right now. On a source that caps at 1,500 of a declared 4,000, it returned
VERDICT: INCOMPLETE (missing 2500 rows). - This is a completeness check, not a correctness check. Different layer, different bug.
What actually goes wrong
You write the loop everyone writes. Walk ?page=1, ?page=2, keep going until a page comes back empty. Stop. Save. Done.
The source has other plans. It says it has 4,000 records — the count is right there in the envelope, or in a “Showing 4,000 results” line in the HTML. But it only ever hands out real data for the first 30 pages. Page 31 doesn’t error. It doesn’t return empty either. It returns page 1 again. Still HTTP 200. Still 50 valid rows. Your loop has no reason to stop, so it grinds on until its own page budget runs out, collects a pile of rows, and exits clean.
You now have 5,000 rows in hand and feel great about it. Looks like plenty. The catch: only 1,500 are unique. The page cap fed you the same first page over and over, and those duplicates hid the shortfall behind a big-looking row count. That is the exact shape of “50 rows passed every check while 4,000 existed” — the scraper saw a lot of rows and trusted the volume.
This is a completeness check, not a correctness check
Quick scope, because this lands next to three failures I’ve written about and it is none of them. A bad status code is the schema canary, where HTTP 200 lies and the body is junk. A wrong field inside a valid row is a clean row that’s still wrong, a different problem with its own fix. And bytes you paid for that returned nothing is a cost problem; this is a count problem. Here the run is green and every row is correct. What’s wrong is the number of rows: you collected fewer than exist, and nothing threw. This check lives between your scraper and the source’s own claim about how many records there are. It is not about resume, crashes, ETags, 304s, or whether the data went stale. Just one question: did you get all of it.
That distinction matters because the tools that catch the other three are blind here. A status check sees 200 and is happy. A schema check sees a valid row and is happy. A byte counter sees data flowing and is happy. None of them ever asks “is this all of it.” That question needs its own line of code.
Where I keep meeting this
Listing sources. Anything paginated where the platform decides how deep you’re allowed to go. The scraper I’ve leaned on most for this — a Trustpilot review collector — has 962 production runs behind it, and reviews are paginated to the bone. “Showing N of M,” page after page, with the platform free to stop serving real pages whenever it wants. That’s the genre where the declared count and the collected count drift apart, and where a green run means almost nothing on its own.
I want to be precise about what I’m claiming, because the cheap version of this post would inflate it. I am not going to tell you “page caps cost me X rows on site Y” — I don’t keep a clean tally of how many runs hit a silent cap specifically, so I won’t invent one. What I’ll stand behind: across 2,190 production runs, the failure that scared me most wasn’t the loud one. The loud ones page you. This one ships a confident, half-empty dataset into something downstream and waits.
The probe
Here’s the whole thing. Pure stdlib, no network, no browser. The mock source lies the way real ones do, so you can watch the probe catch it before you wire it to your own fetch.
import hashlib
PAGE_SIZE = 50
DECLARED_TOTAL = 4000 # what the envelope claims exists
HIDDEN_PAGE_CAP = 30 # server silently refuses real data past this page
PAGE_BUDGET = 100 # every real scraper has a safety budget; so do we
# 30 pages * 50 = 1,500 reachable rows out of a declared 4,000
def mock_api(page):
"""One page, 1-based. The bug: any page past the cap serves page 1 again,
still HTTP 200 with a valid envelope. No error, no empty page."""
served = page if page <= HIDDEN_PAGE_CAP else 1 # <-- the silent cap
start = (served - 1) * PAGE_SIZE
rows = [{"id": start + i, "name": f"item-{start + i:05d}"}
for i in range(PAGE_SIZE)]
return {"total": DECLARED_TOTAL, "page": page, "rows": rows}
def page_fingerprint(rows):
ids = ",".join(str(r["id"]) for r in rows)
return hashlib.sha1(ids.encode()).hexdigest()[:12]
def scrape_naive():
"""Walk pages until one looks empty. It never looks empty here, so we
stop on the page budget and exit clean -- like real code does."""
collected, first_fp, cap_at_page = [], None, None
page = 1
while page <= PAGE_BUDGET:
rows = mock_api(page)["rows"]
if not rows:
break
fp = page_fingerprint(rows)
if page == 1:
first_fp = fp
elif fp == first_fp and cap_at_page is None:
cap_at_page = page - 1 # page K repeats page 1 -> cap is K-1
collected.extend(rows)
page += 1
return collected, first_fp, cap_at_page, page - 1
Two checks do the work, and they cover the two cases you actually meet.
Path A — you have a declared total. Compare it to your unique ids, not your raw count. Raw count is the thing the duplicates inflate; unique ids is the thing that tells the truth.
Path B — there is no declared total. Plenty of sources don’t give you one. Then the anchor is the fingerprint: the page that repeats an earlier page is exactly where the source quietly looped you. No total needed.
def main():
collected, first_fp, cap_at_page, pages_walked = scrape_naive()
unique_ids = len({r["id"] for r in collected})
declared = DECLARED_TOTAL
completeness = unique_ids / declared if declared else 1.0
print("=== COMPLETENESS PROBE ===")
print(f"declared total (envelope) : {declared}")
print(f"rows collected (raw) : {len(collected)}")
print(f"unique ids collected : {unique_ids}")
print(f"pages walked : {pages_walked}")
print(f"page-1 fingerprint : {first_fp}")
if cap_at_page is not None:
print(f"page {cap_at_page + 1} repeats page 1 -> "
f"SILENT PAGE CAP at page {cap_at_page}")
verdict = "INCOMPLETE" if unique_ids < declared else "OK"
print(f"completeness ratio : {unique_ids}/{declared} = {completeness:.3f}")
print(f"VERDICT : {verdict} (missing {declared - unique_ids} rows)")
Run it. This is the captured output from my machine, Python 3.13.5, no edits:
=== COMPLETENESS PROBE ===
declared total (envelope) : 4000
rows collected (raw) : 5000
unique ids collected : 1500
pages walked : 100
page-1 fingerprint : 323c5cd0274b
page 31 repeats page 1 -> SILENT PAGE CAP at page 30
completeness ratio : 1500/4000 = 0.375
VERDICT : INCOMPLETE (missing 2500 rows)
Read it line by line
rows collected (raw) : 5000 is the trap. Five thousand rows feels like a win. It’s the number a naive run brags about.
unique ids collected : 1500 is the truth. The page cap fed back page 1 from page 31 onward, so 3,500 of those 5,000 rows are duplicates. Strip them and you have 1,500.
page 31 repeats page 1 -> SILENT PAGE CAP at page 30 is the second detector earning its place. It found the cap without trusting the declared total at all — useful for every source that won’t tell you how many records it has.
completeness ratio : 1500/4000 = 0.375 is the headline. You collected 37.5% of what the source itself says exists. Three-eighths.
VERDICT : INCOMPLETE (missing 2500 rows) is the one boolean you bolt onto your run today. Green exit code, INCOMPLETE verdict. Those two are allowed to disagree, and when they do, the verdict is right.
What to do with this on Monday
Add the unique-id-vs-declared check to your pipeline and fail the run loud when the ratio drops below whatever floor you trust. I’d start strict — anything under 0.95 gets a human — and loosen it once you know a given source’s normal drift.
If the source gives no total, keep the fingerprint check. The page that repeats an earlier page is a free signal that the source stopped serving you real data. Cheap to compute, hard to fake.
And stop reporting raw row count as success. Report unique ids against the declared total, or against your own previous high-water mark for that source. Raw count is the number that lies to you the most cheerfully.
One thing I’m still unsure about, and I’ll say so plainly: the fingerprint trick assumes the source repeats a whole prior page. Some caps don’t loop — they just return a final partial page and stop, or shuffle order so no two pages match exactly. I haven’t found one clean detector that covers every flavor of silent cutoff. If you’ve hit a cap shape that slips past both the unique-id check and the page-repeat check, that’s the case I most want to hear about.
Written by Alexey Spinov. I run production scrapers — 2,190 runs across 32 published actors, the Trustpilot collector alone at 962 — and I write up the failures that a green run hides. This post was drafted with AI assistance and edited, fact-checked, and run by me; the probe output above is captured from a real run on my machine, not generated.
Follow for the next batch of numbers from real runs. And tell me in the comments: what’s the worst silently-incomplete dataset you’ve shipped before you noticed? I read every one.