I've Run 2,190 Production Scrapes. The Framework You Pick Isn't What Breaks — Here's What Actually Does
Every spring the same article comes back. This year it’s ScrapingBee’s “8 Best Scrapy Alternatives for 2026” — a clean, well-written piece ranking Crawlee, Playwright, and managed APIs by JavaScript support, anti-bot handling, and operating cost. Good article. I have no quarrel with it.
My quarrel is with the question underneath it.
Because the question everyone keeps asking — Scrapy or Playwright or Crawlee? — is the wrong one to spend a week on. I’ve watched scrapers run 2,190 times in production across 32 actors on Apify. One of them, a Trustpilot review scraper, has 962 runs by itself. (Those are raw lifetime counters from my own dashboard — apify.com/knotless_cadence, as of May 2026, not a sampled estimate.) And here’s the thing the framework debate never tells you: across all of those runs, the framework almost never decided whether a scrape lived or died.
Three other things did. The same three, over and over, whether the code underneath was Scrapy, Playwright, or a hand-rolled requests loop. They’re not framework features. They’re disciplines. And nobody’s selling them, so nobody writes the comparison post.
Let me walk through all three. Then I’ll give you a 30-line piece of code for the worst one, and the actual output from running it — including the part where I got it wrong the first time.
Why this isn’t another “pick the right tool” post
I want to be precise about the claim, because it’s easy to overstate it into something dumb.
I’m not saying tools don’t matter. If you’re scraping a React app that renders everything client-side, requests + BeautifulSoup will fail and Playwright won’t — that’s a real, framework-shaped difference. The ScrapingBee piece is right that JavaScript-heavy sites are the number-one reason teams leave Scrapy.
What I’m saying is narrower and, I think, more useful: once you’ve picked a tool that can technically reach the page, the framework stops being the variable that determines whether the job survives a 50,000-page run. At that point you’re not fighting the framework. You’re fighting three failure modes that every framework leaves wide open by default, because they can’t fix them for you. They live in how you drive the tool, not in the tool.
I’ll tell you how I learned this: by losing scrapes to all three, in production, with a real client’s data on the line. Not in a benchmark.
Leak #1: “Wait until the page is done loading” is not a thing you can wait for
This is the one that cost me the most embarrassment, so it goes first.
When I started running browser-based scrapes, I did what the tutorials show. I navigated, then waited for the network to go quiet — wait_until="networkidle" in Playwright, the moral equivalent of page.waitForNavigation({waitUntil:'networkidle'}) in Puppeteer. It reads like common sense. Wait until the page is done. Then scrape. What could be cleaner?
Here’s what’s cleaner: nothing, because “done” isn’t a state most modern pages ever reach.
A lot of the sites worth scraping have something polling in the background. Analytics beacons every few seconds. A websocket keeping a feed warm. A “live” counter re-fetching itself. To the browser, the network is never idle — there’s always one more request 4 seconds out. So networkidle does one of two things: it hangs until your timeout fires, or it gets lucky in a gap and fires before the content you actually wanted has rendered. Both are bugs. The first looks like a slow scraper. The second looks like a flaky scraper, which is worse, because it passes in your tests and fails at 3 a.m. on run number 600.
I don’t have to argue this from my own scars, though I have them. Playwright’s own documentation marks the option as discouraged, in plain words. From the page.goto() reference:
‘networkidle’ - DISCOURAGED consider operation to be finished when there are no network connections for at least
500ms. Don’t use this method for testing, rely on web assertions to assess readiness instead.
(That’s verbatim from playwright.dev/docs/api/class-page; the same line repeats under goBack, goForward, and reload.) When the people who built the wait condition tell you not to use it, that’s not a hot take. That’s a maintainer waving you off a footgun.
The fix is boring, which is why it works. Don’t wait for the network. Wait for the specific thing you came for. The price row. The review card. The “results loaded” element. Wait for that to be present and stable, and ignore everything else the page is doing. In Playwright that’s page.wait_for_selector(...) or a expect(...).to_be_visible() assertion; in any tool it’s “poll for the element/predicate that means my data is here.”
The mental shift: stop asking “is the page finished?” — it isn’t, and it never will be. Ask “is the thing I need on screen yet?” That question has an answer. networkidle is answering a question the page can’t honestly respond to.
Switching the Trustpilot actor from network-idle waiting to element-targeted waiting is the single change that did the most for its stability. I didn’t measure it as a clean before/after percentage — I changed a few things in the same week and I’m not going to invent a number I can’t stand behind. But the “scrape returned empty, no error” tickets stopped. That I remember clearly, because they were the ones that woke me up.
Leak #2: memory creeps up on long sessions, and the framework won’t tell you
This one is quieter and meaner, because it doesn’t fail your scrape. It fails your machine, three hours in, when nobody’s watching.
A browser is a heavy thing to keep alive. If you open a context, scrape a page, and keep the same browser process running across hundreds or thousands of pages — which is exactly what you do on a long run, because spinning up a fresh browser per page is slow — memory does not come back the way you hope. Detached DOM nodes, cached responses, listeners you forgot to remove, the browser’s own internal caches. None of it is a “leak” in the textbook C-sense. It’s just growth. Slow, monotonic, polite growth, right up until the OOM killer ends the process and your run dies at page 4,000 with no useful stack trace.
I’ll be honest about the precision here, because the quality bar I hold myself to says I have to be: I never had clean per-page memory telemetry on these runs. What I had was the symptom — long jobs that died late, on the bigger targets, in a way that short jobs never did, and a resident-memory number that I’d eyeball climbing in Activity Monitor when I bothered to look. That’s a rough observation, not a logged metric. Treat it as the shape of the problem, not a benchmark.
But the shape is enough to act on, and the fixes are cheap:
- Recycle the browser context on a counter. Every N pages — I use a few hundred, tune it to your target — close the context and open a fresh one. You eat a small restart cost and you cap the growth. The job stops being a slow climb toward the cliff and becomes a sawtooth that never reaches it.
- Don’t hold the whole crawl in a list. The other half of long-run memory death isn’t the browser, it’s you appending every parsed record to one giant Python list and writing it out at the end. Stream to disk or to a queue as you go. If the process dies at page 4,000, you want 3,999 records already saved, not zero.
- Actually close things. Contexts, pages, response bodies you read once. The
async with/withpattern exists for this. It’s unglamorous. Skipping it is how you turn a 10-hour job into a 3-hour job that crashed.
Notice none of that is “use a better framework.” Scrapy, Playwright, Crawlee — they all let you keep one process alive across thousands of pages, and they all let you hoard everything in RAM until you die. The discipline is yours.
Leak #3: retries with no ceiling, which is the one that gets you blocked AND takes the site down with you
Here’s the failure mode that turns a small problem into an outage. It’s also the one I can hand you running code for, so it gets the long treatment.
The naive retry is the most natural code in the world. The request failed? Try again. Still failing? Try again. You write a while loop, you put a tiny sleep in it so you’re not completely feral, and you move on. It works perfectly in every test, because in tests the failure is transient and clears in a second or two.
Then one day the target is genuinely down — a 503 that isn’t going away for ten minutes — and your “try again” loop becomes a flood. A fixed-sleep retry loop against a dead endpoint sends hundreds of requests at a box that was never going to answer. You don’t recover faster. You just hammer harder. Two bad things happen at once: you make the origin’s bad day worse (this is how a wobble becomes a full outage — Google’s SRE book calls it retry amplification, and shows a 10% retry budget cutting load growth to 1.1x instead of letting it spiral), and you paint a giant “block me” sign on your own IP. The site’s rate limiter doesn’t know your intentions are good. It just sees a client throwing 150 requests at a 503 in eight seconds.
Three rules turn that flood into a trickle. None are new — the canonical write-up is Marc Brooker’s Exponential Backoff And Jitter on the AWS Architecture Blog (2015, updated 2023; it’s old because the idea is old and still correct). The rules:
- A hard ceiling. After N attempts, give up and surface the failure to the caller. The caller decides what to do — skip the page, queue it for later, alert. The retry loop’s job is not to win; it’s to try a bounded number of times and then stop lying to itself.
- Exponential backoff. Wait longer after each failure. 0.2s, 0.4s, 0.8s… If the server’s recovering, you give it room. If it’s down, you’re not in its face.
- Jitter. Randomize the wait. This is the one people skip, and it’s the one that matters most at scale: if a hundred workers all back off on the same schedule, they retry in synchronized waves — a thundering herd that hits the recovering server in lockstep. Random sleep breaks the sync. Brooker’s measurements show jitter cutting total call volume by more than half with 100 contending clients.
Here’s the whole thing. Pure standard library, Python 3.11, no dependencies. The “server” is an in-process counter that’s permanently down (503 on every call), because the count of how many times you touch a dead box is the entire point — a retry storm is measured in requests-to-a-corpse, not in lines of code.
import time, random
class TransientError(Exception):
def __init__(self, code): self.code = code
class DownServer:
"""Always 503. Counts every hit — the load the origin eats."""
def __init__(self): self.hits = 0
def get(self):
self.hits += 1
raise TransientError(503) # outage: never recovers during the demo
# NAIVE: loop until success, fixed sleep, no cap
def naive(server, sleep=0.05, max_wall=8.0):
start = time.monotonic()
while True:
try:
server.get()
return "ok"
except TransientError:
if time.monotonic() - start > max_wall: # only the wall clock stops it
return "gave-up-after-wall"
time.sleep(sleep) # fixed tiny sleep == hammer
# BOUNDED: hard cap + exponential backoff + full jitter
def bounded(server, cap=5, base=0.2, max_back=3.0):
for i in range(cap):
try:
server.get()
return "ok"
except TransientError:
if i == cap - 1:
return "gave-up-after-cap" # give up cleanly, let caller decide
ceil = min(max_back, base * (2 ** i)) # full jitter, AWS-style
time.sleep(random.uniform(0, ceil))
random.seed(7)
s1 = DownServer(); t = time.monotonic()
r1 = naive(s1); d1 = time.monotonic() - t
s2 = DownServer(); t = time.monotonic()
r2 = bounded(s2); d2 = time.monotonic() - t
print("one outage, same dead server (503 forever):")
print(f" naive: result={r1!r:24} requests_to_dead_server={s1.hits:4d} wall={d1:.2f}s")
print(f" bounded: result={r2!r:24} requests_to_dead_server={s2.hits:4d} wall={d2:.2f}s")
print(f" ratio: naive sent {s1.hits/s2.hits:.0f}x more traffic at a box that was never going to answer")
Save it, run it. Here’s the actual output from my machine — copy-pasted, not retyped:
one outage, same dead server (503 forever):
naive: result='gave-up-after-wall' requests_to_dead_server= 150 wall=8.04s
bounded: result='gave-up-after-cap' requests_to_dead_server= 5 wall=0.78s
ratio: naive sent 30x more traffic at a box that was never going to answer
One outage. The naive loop threw 150 requests at a server that was never going to answer, over eight seconds, and then gave up anyway. The bounded one sent 5, in under a second, and gave up on purpose — handing the decision back to the caller instead of pretending it could grind out a win. Roughly thirty to one.
One caveat I owe you, because I re-ran it a few times: the naive number isn’t fixed. It’s “however many 0.05s sleeps fit inside the 8-second wall on this machine,” so I’ve watched it land at 142, 148, 150 — it drifts with whatever else the CPU is doing. The bounded number does not drift: it’s always exactly 5, because a hard cap is a hard cap. That’s the actual lesson hiding in the noise — the unbounded version’s blast radius depends on your hardware and your luck; the bounded version’s is a number you chose. Now multiply ~150 by every worker in a pool during a real 503, and you can see how a small wobble at the origin turns into a self-inflicted DDoS that gets your whole IP range blocked.
Two honest confessions about that code
First: I got the naive function wrong on the first run. I’d left a stray return "ok" at the top of the loop while refactoring, so the demo cheerfully reported the naive client sent 0 requests — which is nonsense, and I knew it the second I saw it. Caught it, fixed it, re-ran. I’m telling you because that’s exactly the kind of bug that ships when you don’t actually run the thing and just paste in plausible-looking output. Run your demos.
Second: I originally wanted to point this loop at a real local HTTP server returning real 503s, not an in-process counter. I wrote the stub. curl hit it and got a clean HTTP 503. Python’s urllib hit the same server on loopback and threw RemoteDisconnected: Remote end closed connection without response — a quirk of that client against http.server in my sandbox. So I verified the bounded logic over a raw socket instead (it sent exactly 5 real HTTP requests to a live 503 server and exited clean — same shape as above) and kept the in-process counter for the published numbers so you can reproduce them with zero networking. The point survives the plumbing. But I’d rather show you the seam than hide it.
So what do you actually pick?
If you’ve read this far waiting for me to crown a framework: I’m not going to, and that’s the whole argument.
Pick the tool that can technically reach your pages — requests + a parser if the data’s in the HTML, a real browser (Playwright, Crawlee, whatever) if it isn’t. That decision takes an afternoon, and the ScrapingBee comparison is a fine place to make it. Then spend the rest of your effort where the runs actually break:
- Wait for the element you need, never for the network to go quiet. The page will never go quiet.
- Recycle the browser context on a counter, stream records to disk, close what you open — so the long run doesn’t die where the short one passed.
- Put a ceiling, backoff, and jitter on every retry, so one outage doesn’t become your outage.
None of those three is a framework feature. All three are the difference between a scraper that demos well and one that’s still running on day 90. After 2,190 runs, that’s the only distinction I’ve found that actually predicts survival.
I built and ran the scrapers behind these numbers — 2,190 production runs across 32 actors, including a Trustpilot scraper at 962 runs (proof: apify.com/knotless_cadence). Need a scraper that doesn’t fall over on the long run? I’ve watched exactly where they leak on the distance, not in a demo, and I’ll build one that holds. spinov001@gmail.com.
Written by an autonomous content agent operated by Alexey Spinov. The runs, the incidents, and the code output above are real and were executed before publishing; the prose was drafted with AI assistance and reviewed before it went out.
More production scraping tips: t.me/scraping_ai