Your Scraper Died at Row 12,000. The Rerun Pattern.
My scraper died at row 12,000 of 50,000, three hours in. The crash itself was cheap. A process gets OOM-killed, a quota trips, a machine reboots, it happens. The expensive part came next: I re-ran it. From zero. And paid, in time and in requests, for the 11,999 rows I already had sitting on disk.
That second bill is the one nobody writes code for. This post is the code. It’s about 40 lines of stdlib Python that let a crashed job pick up where it died, fetching only the missing rows and writing zero duplicates, plus the real captured output of a run that crashes and a rerun that finishes it cleanly.
To be clear about scope: this is the run after the crash — how to restart a long job so it finishes the work it lost without re-fetching what it already pulled and without writing a row twice. It is not retry/backoff inside a single request (that’s a different post of mine), not schema-drift detection (the post where I said “a crash loses the run” — this is the part where you get the run back), not a budget kill-switch that stops a runaway, and explicitly not conditional-GET / ETag / “skip unchanged pages” — that’s freshness, a separate question entirely. Just: your job died mid-way, the clock and the bill are still running, how do you resume cheap and clean.
TL;DR
- A long scrape that dies at hour 3 of 4 didn’t lose one request. It lost the whole run. Retry doesn’t help here; resume does.
- The fix is three small things: a stable idempotency key per item, a checkpoint cursor written atomically to disk, and an upsert instead of a blind append.
- I ran a 5,000-row local job, killed it at row 3,000, and reran it. The rerun fetched only the missing 2,000 and wrote zero duplicates. Final output: 5,000 rows, 5,000 unique. Real captured output below.
- Across 2,190 production runs (962 on a single Trustpilot source), long jobs do die mid-way. The cost that bites isn’t the crash — it’s paying to re-collect everything you already had.
- It’s stdlib Python. No DB, no framework, no paid API. You can reproduce both runs in about five seconds.
Retry is the wrong layer
Retry fixes a request. It does nothing for a job. That’s the whole confusion.
When a scraper flakes, the reflex advice is “add retries with backoff.” Good advice, for the layer it lives at. A 429, a connection reset, a slow socket: retry the request a few times with jitter and most transient failures evaporate. I’m a believer; I wrote a whole post on closing resources and retrying inside a run.
But think about what actually happened at row 12,000. The process is gone. The Python interpreter that held your in-memory list of results, your retry counter, your for loop: all of it, evaporated. There is no request to retry, because there is no process left to retry it in. Retry operates inside a run. The thing that died is the run.
So the recovery layer isn’t the request. It’s the job. And the canonical “just add retries” advice quietly skips that level, because at the request layer everything looks handled.
I bumped into this exact gap once and walked straight past it. In an earlier post about silent schema drift, I argued for making a data-shape check non-fatal, and the reason I gave was: “a crash loses the run. If you blow up on record 12,000 of 50,000, you’ve thrown away the 11,999 good records you already pulled.” True. But I used it only as an argument to not crash this run. I never said what happens on the next run after a crash you didn’t prevent. This post is that next run.
What makes a rerun cheap?
A rerun is cheap when it does only the work the first run didn’t finish. To get there you need three things, and none of them is fancy.
1. A stable idempotency key, per item. Not the row number. The row number is a lie the moment you skip something: skip 3,000 rows and item 3,001 is now “row 1” of the rerun. Key off something the source gives you: an id, a URL, a SKU. Mine is (source, item_id). So that the second run can ask “do I already have this exact item?” and get a truthful yes/no regardless of order.
2. A checkpoint cursor, written atomically. You want to flush progress to disk as you go, not hold it in memory where a crash takes it with you. The subtle part: writing the cursor itself can crash mid-write, leaving you with a truncated, useless file. The fix is write-to-temp-then-rename: os.replace() is atomic, so the cursor on disk is always either the old complete value or the new complete value, never a half-written one. So that even a crash during a checkpoint can’t corrupt your recovery state.
3. An upsert, not a blind append. A naive scraper opens its output and appends every row it scrapes. Rerun it and you get every row twice. The pattern instead reads which keys are already written, and skips them. So that the corpus stays clean no matter how many times the job restarts.
That’s it. Stable key, atomic cursor, skip-what-you-have. The cleverness is in not being clever.
The pattern
Pure stdlib. The “scrape” here is a deterministic generator of 5,000 items instead of a network call — on purpose, so you can run it yourself in seconds with no proxies, no keys, no target site. The mechanic being demonstrated (key + checkpoint + upsert + delta rerun) doesn’t depend on the transport; swap work_items() for your real fetch and the recovery logic is unchanged.
The output file is line-per-record JSON. That choice matters: each row is durably appended on its own line, so a crash costs you at most one half-written final line — and the loader below skips exactly that.
import os, json
TOTAL = 5000
CRASH_AT = 3000
OUT = "scrape_output.jsonl"
CURSOR = "cursor.json"
def work_items():
# The "scrape". Stable per-item id, NOT a row counter.
for i in range(TOTAL):
yield {"source": "demo", "item_id": i, "payload": f"value-{i:05d}"}
def idem_key(item):
return f"{item['source']}:{item['item_id']}" # stable across reruns
def load_done_keys(path):
# Rebuild what's already written. The output FILE is the source of truth.
done = set()
if not os.path.exists(path):
return done
with open(path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
rec = json.loads(line)
except json.JSONDecodeError:
continue # half-written final line from the crash — skip it
done.add(f"{rec['source']}:{rec['item_id']}")
return done
def checkpoint(path, last_index):
# Atomic: temp file + os.replace. A crash mid-write can't truncate it.
tmp = path + ".tmp"
with open(tmp, "w", encoding="utf-8") as f:
json.dump({"last_index": last_index}, f)
f.flush(); os.fsync(f.fileno())
os.replace(tmp, path)
The loop ties them together. On a resume, it loads the done-keys first, then walks the same item stream and skips anything already on disk — the upsert — appending only the delta and checkpointing the cursor as it goes:
def run(resume):
done = load_done_keys(OUT) if resume else set()
fetched_this_run = duplicates = 0
with open(OUT, "a", encoding="utf-8") as out:
for idx, item in enumerate(work_items()):
key = idem_key(item)
if key in done: # already have it -> skip (upsert)
duplicates += 1 # a blind-append script re-writes here
continue
if (not resume) and idx == CRASH_AT:
raise RuntimeError(f"simulated crash at index {idx}")
out.write(json.dumps({**item, "result": item["payload"].upper()}) + "\n")
out.flush()
done.add(key)
fetched_this_run += 1
if idx % 500 == 0:
checkpoint(CURSOR, idx)
Notice the duplicates += 1 line. On the first run it never fires. On the rerun it fires once for every item already on disk — that counter is the proof that a blind-append version of this script would have written those rows a second time, and this one didn’t.
The full runnable file (with the cursor reader, the summary print, and the --resume flag) is at the bottom.
Run it: crash, then resume
Two commands. First run starts fresh and dies at index 3,000. Second run resumes. Here’s the actual terminal, copy-pasted, not cleaned up:
########## RUN 1 (fresh, will crash) ##########
Traceback (most recent call last):
File "resume_demo.py", line 126, in <module>
run(resume=a.resume)
File "resume_demo.py", line 99, in run
raise RuntimeError(f"simulated crash at index {idx}")
RuntimeError: simulated crash at index 3000
exit code: 1
########## state on disk after crash ##########
rows in output: 3000
cursor: {"last_index": 2500}
########## RUN 2 (--resume) ##########
=== RUN 2 (--resume) summary ===
resumed from cursor index : 2500
items already on disk : 3000
fetched this run : 2000
duplicate writes avoided : 3000
final rows in output : 5000
exit code: 0
And the independent check on the file afterward, because a summary that prints its own numbers is a summary you shouldn’t trust:
total lines in output: 5000
unique item_ids: 5000
duplicate lines (should be 0): 0
Read the second run’s numbers. fetched this run: 2000, not 5,000. The rerun touched only the rows the crash lost. duplicate writes avoided: 3000 means every row that was already on disk got skipped instead of re-written. final rows: 5000, and unique item_ids: 5000 from the independent count, so the job is genuinely complete with nothing doubled. The crash cost me the last 2,000 rows and exactly nothing else.
One honest wrinkle, because I’d rather point it out than have you spot it. The crash happened at index 3,000, but the cursor on disk said 2500. They disagree by 500. That’s not a bug, it’s the design. The cursor is checkpointed every 500 rows, but every row is flushed to the output file the instant it’s written. So the output file, not the cursor, is the real source of truth: load_done_keys() rebuilds progress from the 3,000 rows actually on disk, and the cursor is just a cheap hint. If I trusted the cursor alone I’d have re-fetched 500 rows I already had. Trusting the durable output instead, I re-fetched zero. Pick the more durable record as your truth.
What this costs you in production
The reason I care about this isn’t the demo. It’s that long jobs really do die, and I have the run counter to say so.
I run scrapers in production: 2,190 runs across 32 published actors, with one Trustpilot review scraper at 962 runs by itself (that’s a lifetime run meter on my Apify profile, knotless_cadence, as of mid-2026; not a controlled study, just a long-running counter). When you run one source nearly a thousand times, you stop asking if a multi-hour job will get interrupted and start assuming it will. OOM on a big page batch, a proxy pool hiccup, a quota wall, a deploy that restarts the worker: any of them ends the process, and the process is the run.
Here’s the cost math, on the numbers from my opening. A 50,000-row job that dies at row 12,000. Rerun-from-zero re-pays for all 50,000, and 24% of that work (the 12,000 you’d already done) is pure waste. But flip the crash point. Most jobs die late, not early, because the longer they run the more chances they have to hit something. A job that dies at row 40,000 of 50,000 and reruns from zero re-collects 40,000 rows you already had: you pay 80% of the bill a second time to recover the last 20%. Resume-the-delta pays for 10,000. That’s the whole pitch: the later the crash, the more brutal the rerun-from-zero penalty, and the more a stable key plus a durable output saves you.
You might reasonably say: 2,000 rows on one laptop isn’t a distributed production crawl. Fair. It isn’t. The mechanic is identical, but a single-machine flat file is the simplest place it lives, not the only one — and the next section is exactly where this version stops being enough.
Where this breaks
I’d rather hand you the failure modes than let you find them at row 12,000 of your own job.
- The source gives you no stable id. The whole pattern hangs off the key. If items have no id, URL, or natural unique field, you’re stuck choosing a worse one: content hash (breaks the instant the content legitimately changes), position (breaks the instant you skip), or fuzzy match (breaks in ways you won’t notice for weeks). This is the genuinely hard part, and I haven’t solved it cleanly.
- One machine, one file. Reading done-keys from a local file is fine for a single worker. Run the same job across a pool of workers and a flat file becomes a race: two workers can both read “not done,” both fetch, both write. At that point the done-keys set has to live somewhere shared and atomic — Redis
SETNX, a unique constraint in Postgres, anINSERT ... ON CONFLICT DO NOTHING. Same idea, different home for the key. - Flat-file upsert is not a database.
load_done_keys()reads the whole output to rebuild the set. At a few thousand rows that’s instant. At tens of millions it’s a startup cost you’ll feel, and the right move is a real keyed store, where “have I seen this key” is an index lookup, not a file scan. - It doesn’t help if the source died, not your job. Resume assumes the data is still there to re-fetch. If the site is down, rate-limiting you to a crawl, or has removed the rows since your first pass, resuming cleanly still gets you an incomplete corpus. The pattern recovers your failure, not the world’s.
That last one is the boundary I keep relearning. A clean resume is a promise about your bookkeeping, not about the source.
What I’d change on Monday
If you run anything that takes more than a few minutes, do these three before the next long job:
- Pick the idempotency key before you write the scraper. It’s the one decision the whole recovery story depends on, and it’s free to get right up front and expensive to retrofit. If the source has a stable id, use it. If it doesn’t, that’s a design problem to solve now, not at row 12,000.
- Make your output durable per-row and treat it as the source of truth. Append line-per-record and flush. Then “what have I already done” is a question you answer from disk, not from a process that might not exist anymore. The cursor is a hint; the output is the truth.
- Make the rerun the default way you finish a job, not the emergency. A job you can stop and resume at any row is a job you can also run in cheap chunks, pause for a deploy, or split across a maintenance window. Resume isn’t just crash insurance — it’s what makes a long job something you can actually operate.
Open question I haven’t solved cleanly, and I’d genuinely like your answer: what’s your idempotency key when the source gives you no stable id? Content hash, scroll position, fuzzy match on a few fields — every option I’ve tried has a failure mode that shows up weeks later in a way that’s painful to debug. What do you actually use in prod?
Full script (resume_demo.py, stdlib only — run python3 resume_demo.py then python3 resume_demo.py --resume):
#!/usr/bin/env python3
"""resume_demo.py — resume a crashed job WITHOUT re-fetching or double-writing.
The "scrape" is a deterministic local generator (no network, no browser, no paid
API) — the mechanic shown (idempotency key + atomic checkpoint + upsert +
delta-only rerun) does not depend on the transport, so anyone can reproduce this.
python3 resume_demo.py # run 1: crashes at CRASH_AT
python3 resume_demo.py --resume # run 2: finishes only the missing delta
"""
import os, sys, json, argparse
TOTAL = 5000 # items in the whole job
CRASH_AT = 3000 # run 1 dies right before processing this index
CHECKPOINT_EVERY = 500 # flush the cursor to disk this often
OUT = "scrape_output.jsonl" # durable, line-per-record output
CURSOR = "cursor.json" # last-known-good progress marker
def work_items():
"""The 'scrape'. Yields 5,000 records with a STABLE per-item id."""
for i in range(TOTAL):
yield {"source": "demo", "item_id": i, "payload": f"value-{i:05d}"}
def idem_key(item):
"""Idempotency key = (source, item_id). Stable across reruns, not position."""
return f"{item['source']}:{item['item_id']}"
def load_done_keys(path):
"""Rebuild the set of keys already written. The output file is the truth."""
done = set()
if not os.path.exists(path):
return done
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line:
continue
try:
rec = json.loads(line)
except json.JSONDecodeError:
continue # half-written final line from the crash — skip it
done.add(f"{rec['source']}:{rec['item_id']}")
return done
def checkpoint(cursor_path, last_index):
"""Atomic write: temp file + os.replace. Crash mid-write can't truncate it."""
tmp = cursor_path + ".tmp"
with open(tmp, "w", encoding="utf-8") as f:
json.dump({"last_index": last_index}, f)
f.flush()
os.fsync(f.fileno())
os.replace(tmp, cursor_path)
def read_cursor(cursor_path):
if not os.path.exists(cursor_path):
return -1
with open(cursor_path, "r", encoding="utf-8") as f:
return json.load(f).get("last_index", -1)
def process_one(item):
"""The 'expensive' step. In prod this is the network fetch you pay for."""
return {**item, "result": item["payload"].upper()}
def run(resume):
done = load_done_keys(OUT) if resume else set()
if not resume:
for p in (OUT, CURSOR):
if os.path.exists(p):
os.remove(p)
resumed_from = read_cursor(CURSOR) if resume else -1
fetched_this_run = 0
duplicates = 0
out = open(OUT, "a", encoding="utf-8")
try:
for idx, item in enumerate(work_items()):
key = idem_key(item)
if key in done: # idempotency: already have it, skip
duplicates += 1 # a blind-append script would re-write here
continue
if (not resume) and idx == CRASH_AT:
raise RuntimeError(f"simulated crash at index {idx}")
rec = process_one(item)
out.write(json.dumps(rec) + "\n")
out.flush()
done.add(key)
fetched_this_run += 1
if idx % CHECKPOINT_EVERY == 0:
checkpoint(CURSOR, idx)
checkpoint(CURSOR, TOTAL - 1)
finally:
out.close()
final_rows = len(load_done_keys(OUT))
label = "RUN 2 (--resume)" if resume else "RUN 1 (fresh)"
print(f"=== {label} summary ===")
print(f"resumed from cursor index : {resumed_from}")
print(f"items already on disk : {len(done) - fetched_this_run}")
print(f"fetched this run : {fetched_this_run}")
print(f"duplicate writes avoided : {duplicates}")
print(f"final rows in output : {final_rows}")
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("--resume", action="store_true",
help="continue an existing output instead of starting fresh")
a = ap.parse_args()
run(resume=a.resume)
Written by Aleksey Spinov. I write up the cost and failure math from real production scraping — 2,190 runs and counting. Follow for the next one, and tell me your idempotency key for a source with no stable id — I read every comment.
AI disclosure: drafted with AI assistance; the pattern, the script, and every number in this post were produced and verified by me. The Python here was run locally (stdlib, no third-party deps); the crash, the resume, and the independent file check shown are the real output, not a mock-up.