Scraping All the Text Is the Easy 10%. Keeping the Corpus Worth Training On Is the Other 90% — Notes From 962 Runs
There’s a good ScrapingBee guide from May 19, 2026 — “How to scrape all text from a website for LLM training,” by Ilya Krukowski. It walks you through sitemaps, text extraction, proxies, concurrency, retries. Solid. If you do exactly what it says, you will end up with a folder full of text.
And that’s the part that fools people.
Getting the text out is the easy 10%. I’ve watched the other 90% eat teams alive, and it has almost nothing to do with extraction. It’s what happens after the first successful run — when the same scraper runs again next week, and the week after, and you slowly discover that most of what you’re collecting is stuff you already have.
I’m going to make a claim that sounds like a downer and is actually liberating: on a recurring scrape, the hard problem isn’t fetching pages. It’s not re-counting the same page as new data. Duplicates under different URLs. Pages that didn’t change but got re-downloaded anyway. Stale content quietly rotting between runs. None of that shows up in a tutorial that ends at “save to disk.” All of it shows up around run 50.
Let me tell you where I’m getting this, because I’m allergic to people who hand-wave “in production.”
Where my numbers come from (so you can decide if I’m full of it)
I run scrapers on Apify. My public profile is apify.com/knotless_cadence. The numbers I’ll use here come from my own Apify dashboard, as of May 2026 — they’re the raw lifetime run counters the platform shows me, not a sampled estimate, not rounded up for a headline:
- 2,190 production runs total, summed across 32 published actors.
- 962 of those runs are one actor — a Trustpilot review scraper. That’s my single most-run thing, and it’s the one I’ll lean on, because reviews are the perfect trap for this whole problem.
Why reviews? Because a review site is built to show you the same content under twenty different URLs. Sort by recent. Sort by rating. Filter by 1-star. Page 2 of the 5-star filter overlaps page 1 of “most recent.” Same review, same text, five URLs. A naive “scrape all the text” crawler treats every one of those as a new document. Multiply that by 962 runs and you don’t have a dataset — you have a hall of mirrors.
I should be honest about what 962 runs is and isn’t. It is not 962 distinct sites. It’s the same actor, fired over and over, mostly against the same kinds of pages. Which is exactly why it’s the right teacher for recurring collection — the failure mode I’m describing only appears when you scrape something more than once.
The number that should scare you isn’t mine
Here’s the part I can’t take credit for, and it’s the most important fact in this whole article.
In 2021, Katherine Lee, Daphne Ippolito, Nicholas Carlini and colleagues at Google and UPenn published “Deduplicating Training Data Makes Language Models Better” (arXiv:2107.06499, later at ACL 2022). They measured duplication inside C4 — the cleaned Common Crawl corpus that a generation of models trained on. Not a sketchy scrape. A flagship, already-cleaned dataset.
What they found, in their own measurements:
- About 3.04% of C4 are near-duplicates (their NearDup measurement) — and that’s after the cleaning pass.
- A single 61-word English sentence appears over 60,000 times in C4.
- One near-duplicate cluster holds ~250,933 examples — and C4 has 280 such clusters with more than 5,000 near-dupes each.
Their conclusion: models trained on the deduplicated version “emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy.”
Sit with that. The duplication wasn’t a rounding error in some intern’s weekend scrape. It was baked into a corpus that serious people had already filtered, and it measurably hurt the models trained on it. If a hand-curated reference dataset is 3.04% near-dupes, what do you think your raw, re-run-every-Monday scrape looks like?
I don’t have a clean percentage for my own corpora, and I won’t invent one — the honest answer is “it depends entirely on the site and how often you re-run.” But the direction is not in doubt, and it’s the direction nobody puts in the tutorial.
The 90% you don’t see, broken into three failures
When I say “the other 90%,” I mean three specific things that all wear the costume of “more data.”
1. The same page under N URLs (URL-level duplication).
?utm_source=newsletter, #comments, trailing slashes, query params in different orders, session IDs. Byte-for-byte the same page, and your crawler counts each as a new document. This is the cheapest to fix and the most ignored.
2. The same content re-downloaded even though nothing changed (wasted re-collection). This one’s quieter. The page genuinely didn’t change since last run, but you fetched it, parsed it, hashed it, and then decided it was a dup. You paid for the whole round trip to learn nothing. Across 962 runs that’s a staggering amount of bandwidth, parse time, and proxy budget spent re-confirming things you already knew.
3. Content decay (the dataset rots in place). The page changed, but in the wrong direction — the review got edited, the product page now 404s and redirects to a generic category, the article added a cookie wall. Your old copy is now subtly wrong, and your new copy might be worse. Nobody talks about decay because it doesn’t throw an error. It just slowly poisons the well.
The first two are dedup problems. The third is a freshness problem. The tools overlap, which is why I’ll handle them together.
The contrarian bit: conditional GET is a data-quality tool, not a politeness one
I wrote a different piece recently arguing that conditional GET — ETag / If-None-Match, Last-Modified / If-Modified-Since, the 304 Not Modified response — is how you stay polite to a source. That’s true. (It’s all in RFC 9110, §13 “Conditional Requests” — rfc-editor.org/rfc/rfc9110.html — if you want the actual spec.)
But here’s the angle I missed at the time, and the reason this is a separate article and not a footnote: conditional GET is also your first line of dedup. A 304 doesn’t just save the server work. It tells you, the collector, “this is the exact same content you already have — don’t even consider it new.” It’s the cheapest dedup signal in existence, handed to you by the HTTP spec, and most “scrape all the text” pipelines throw it away because they fetch with a bare GET every single time.
So the politeness story and the data-quality story are the same mechanism viewed from two ends of the wire. One protects the source. The other protects your corpus. You get both for the price of caching one header.
That said — and I want to be careful here — conditional GET only catches the case where the server is honest about caching. Plenty of sites send no ETag, no Last-Modified, or send a fresh one on every request out of misconfiguration. For those, the 304 layer does nothing, and you fall back to hashing the content yourself. Which brings me to code.
The 40-line corpus deduper (stdlib only — run it now)
This is the layer that sits after fetch and before “save to dataset.” Three passes, cheapest first:
- Canonical URL — collapse tracking params, fragments, trailing slashes, param order. Catches failure #1 for free.
- Exact content hash — SHA-256 of normalized text. Catches the same content arriving under a genuinely different URL.
- Near-dup via MinHash — for “almost the same” pages (a review edited by one word, a template page with one swapped field).
No dependencies. Python 3.11. Paste and run:
import hashlib, re, urllib.parse
from typing import Iterable
_TRACKING = {"utm_source","utm_medium","utm_campaign","utm_term","utm_content",
"gclid","fbclid","ref","ref_src","mc_cid","mc_eid"}
def canonical_url(raw: str) -> str:
p = urllib.parse.urlsplit(raw.strip().lower())
q = [(k, v) for k, v in urllib.parse.parse_qsl(p.query) if k not in _TRACKING]
q.sort()
path = p.path.rstrip("/") or "/"
return urllib.parse.urlunsplit((p.scheme, p.netloc, path, urllib.parse.urlencode(q), ""))
def normalize_text(t: str) -> str:
return re.sub(r"\s+", " ", t.lower()).strip()
def content_hash(t: str) -> str:
return hashlib.sha256(normalize_text(t).encode("utf-8")).hexdigest()
_MASK = (1 << 64) - 1
def _h(token: str, seed: int) -> int:
d = hashlib.blake2b(token.encode("utf-8"), digest_size=8,
salt=seed.to_bytes(8, "little")).digest()
return int.from_bytes(d, "little") & _MASK
def shingles(t: str, k: int = 5) -> set:
w = normalize_text(t).split()
if len(w) < k:
return {" ".join(w)} if w else set()
return {" ".join(w[i:i+k]) for i in range(len(w)-k+1)}
def minhash(t: str, num_perm: int = 64) -> tuple:
sh = shingles(t)
if not sh:
return tuple([0]*num_perm)
return tuple(min(_h(s, seed) for s in sh) for seed in range(num_perm))
def jaccard_est(a: tuple, b: tuple) -> float:
if not a or not b: return 0.0
return sum(1 for x, y in zip(a, b) if x == y) / len(a)
def dedup(docs: Iterable[dict], near_threshold: float = 0.85):
seen_urls, seen_hashes, kept_sigs = set(), set(), []
kept, url_dups, exact_dups, near_dups = [], 0, 0, 0
for d in docs:
cu = canonical_url(d["url"])
if cu in seen_urls: url_dups += 1; continue
seen_urls.add(cu)
ch = content_hash(d["text"])
if ch in seen_hashes: exact_dups += 1; continue
seen_hashes.add(ch)
sig = minhash(d["text"])
if any(jaccard_est(sig, ks) >= near_threshold for ks in kept_sigs):
near_dups += 1; continue
kept_sigs.append(sig); kept.append({"url": cu, "text": d["text"]})
return kept, dict(url_dups=url_dups, exact_dups=exact_dups, near_dups=near_dups)
I ran this against a synthetic re-run: one real review, the same review re-collected under three tracking-param URLs, a verbatim re-scrape under a new path, a one-word edit, and one genuinely different review. The actual output:
in: 7 documents
kept: 3 unique
dropped -> url_dups=3 exact_dups=1 near_dups=0
Seven in, three out. The URL layer ate three. The hash layer ate one. Honest data engineering, the whole thing reproducible on your machine in under a minute.
The part where the threshold bit me
Notice near_dups=0. The one-word edit — “refund within two days” became “three days” — did not get caught.
I expected it to. My first instinct was “the code’s wrong.” It isn’t. The text was only 17 words long, and on a short document, changing one word breaks every 5-word shingle that touches it — five shingles out of thirteen. I measured the actual Jaccard estimate between the two: 0.81. My threshold was 0.85. So it correctly stayed.
Then I ran the same one-word edit on a normal-length review — about 90 words, the length a real review actually is — and the estimate jumped to 0.89, comfortably over the line. Caught.
The lesson isn’t “tune the threshold until the demo looks good.” The lesson is: MinHash near-dup is unreliable on short text, and 0.85 is not a magic number. On long pages it’s forgiving; on tweet-length snippets it’s twitchy. If your corpus is mostly short reviews, lower the threshold and expect false positives. If it’s long articles, 0.85 is fine. I’d rather show you the demo failing honestly than fake a green checkmark — because the failing demo is the thing that’ll actually save you when your own near-dup pass behaves “weirdly.”
What I’d actually wire up on a recurring scrape
Asymmetric advice, because the failures aren’t symmetric:
- Always canonicalize URLs before you fetch. It’s free and it kills the biggest source of fake “new” documents.
- Always keep a conditional-GET cache (
ETag/Last-Modified→304). It’s your cheapest dedup and your politeness layer in one. RFC 9110 §13 has the exact semantics. - Hash content as the fallback for sites that won’t give you cache headers honestly.
- Use MinHash near-dup sparingly — it’s the most expensive pass and the most likely to misfire. Reserve it for templated/lightly-edited pages, and tune the threshold to your text length, not to a blog post’s default.
- Stamp every kept document with a fetch timestamp and a content hash. When you eventually ask “is this corpus stale?”, you’ll want both. I didn’t do this on early actors and regretted it — re-deriving “when did this last actually change” after the fact is miserable.
That’s it. None of this is clever. All of it is the stuff that separates “I scraped a website once” from “I’ve run the same scraper 962 times and the dataset didn’t turn into 40% copies of itself.”
So, is “scrape all the text” wrong?
No. Krukowski’s guide is a good map of the first 10%. I just think the title of the real job is different: it’s not “scrape all the text,” it’s “scrape all the text once, and prove the rest is genuinely new.” The extraction is a solved problem with a dozen good tutorials. The dedup-and-decay layer is where corpora live or die, and it’s the part the tutorials end right before.
Lee and her co-authors proved it costs you real model quality. My 962 runs proved it costs you real money and bandwidth re-confirming pages you already had. Both arrows point the same way. Build the boring layer.
I’m Aleksey — I run production scrapers, including a Trustpilot actor that’s logged 962 runs (apify.com/knotless_cadence). If you’ve got a recurring scrape that’s quietly bloating into copies of itself — or a scraping→LLM pipeline where the data quality is the bottleneck, not the extraction — I’ll build the dedup/freshness layer that keeps your corpus worth training on. Tell me the site and the cadence: spinov001@gmail.com.
Written with AI assistance; all numbers, code, and the demo output are mine and reproducible.
More production scraping tips: t.me/scraping_ai