5 production scraping failures from 1000+ runs (and the fixes that actually shipped)


After 2190 lifetime Apify runs across 32 public actors — 962 of them in a single Trustpilot review scraper — the same five failure modes keep showing up. None of them were obvious before they hit production. Each cost me at least one round of customer-facing apology before I built a real fix.

This post is the fix list. Pure failure-modes. No “should I use AI for this” detour.


1. Schema drift caught silently

The output JSON looked fine. Field count was right. Types were right. But the meaning of one field had shifted — Trustpilot quietly stopped including a numeric verified_purchase flag and replaced it with a string label embedded inside another field.

The scraper kept running for four days. Downstream customers were filtering on verified_purchase == true and getting empty result sets, but the scraper itself reported success: true every run.

Root cause: schema validators usually check that fields exist and have the right type. They rarely check that the values still make semantic sense.

The fix that shipped: three contract tests that don’t care about schema shape — they care about distribution.

def assert_distribution_holds(rows, field, baseline_pct, tolerance=0.15):
    """Assert that the % of rows where field is truthy
    stays within tolerance of the historical baseline."""
    if not rows:
        return
    actual = sum(1 for r in rows if r.get(field)) / len(rows)
    if abs(actual - baseline_pct) > tolerance:
        raise ContractDriftError(
            f"{field}: expected ~{baseline_pct:.1%}, got {actual:.1%}"
        )

# Trustpilot baseline: ~38% of reviews are verified_purchase
assert_distribution_holds(rows, "verified_purchase", 0.38)

Three asserts on three distribution-stable fields. The next time Trustpilot changed how verified_purchase was reported, the scraper failed loudly inside its own run instead of silently corrupting downstream data for four days.


2. Retry-loop self-DDoS

Out of 962 Trustpilot runs I traced 28 that got stuck in a retry storm. The target had returned a soft rate-limit (HTTP 429 with a low Retry-After). The scraper retried. The retry got rate-limited. It retried again, faster. Within four minutes the scraper was issuing ~140 requests per minute against an endpoint that had explicitly told it to back off.

This is not a code bug. It’s an architecture bug: retry without a budget is a self-DDoS.

The fix:

class CircuitBreaker:
    def __init__(self, max_failures=5, reset_after=60):
        self.failures = 0
        self.opened_at = None
        self.max_failures = max_failures
        self.reset_after = reset_after

    def before_request(self):
        if self.opened_at and time.time() - self.opened_at < self.reset_after:
            raise CircuitOpen(f"breaker open, retry in {self.reset_after}s")
        if self.opened_at:
            self.failures = 0
            self.opened_at = None

    def on_failure(self):
        self.failures += 1
        if self.failures >= self.max_failures:
            self.opened_at = time.time()

Three rules around this: max 3 retries per minute, exponential backoff with jitter, hard breaker after 5 consecutive failures. Once that landed, the self-DDoS class of bug disappeared.


3. Concurrency that breaks the target

I learned this one twice.

First time: ran the email extractor with 20 concurrent workers. By the time the run finished, the target’s WAF had blocked my entire IP range for 12 hours. Not the scraper IP — the whole Apify proxy pool slice. Customers running the same actor for the next half-day got empty results and rightly complained.

Second time: same actor, 8 workers, different target. After 90 minutes the target started returning HTTP 200 with a captcha challenge page in the body. The scraper happily parsed the “results” as JSON-like-but-wrong and produced 4000 garbage rows.

Both came from the same root cause: picking concurrency from a number that felt comfortable on my laptop instead of a number the target will tolerate.

The fix: start at concurrency 1. Increase by 1 every 50 successful pages. Hold whenever the response time creeps up by more than 30%. The autotune is six lines of code:

def adjusted_concurrency(state):
    if state.recent_response_time > 1.3 * state.baseline_response_time:
        return max(1, state.concurrency - 1)
    if state.successful_pages_since_change > 50:
        return min(state.max_concurrency, state.concurrency + 1)
    return state.concurrency

The Trustpilot scraper now runs steady at concurrency 4 against the actual production target. Hand-picking 20 was vanity, not engineering.


4. Memory creep on long runs

Email Extractor Pro is supposed to handle 50k+ pages in a single run. The first time I let it run that long it hit 2.4 GB RSS and got killed by the Apify platform’s memory limit. The 50k pages turned into ~14k pages plus a failed run.

Cause: every page opened a Playwright context. The contexts were “closed” by the framework’s async with, but Chromium child processes leaked because the connection pool kept references alive past the context boundary. Twenty-four hours of that and memory was gone.

The fix is unglamorous: explicitly recycle the browser every N pages, not every “context”.

async def run_with_recycling(urls, recycle_every=500):
    browser = await launch()
    try:
        for i, batch in enumerate(chunks(urls, recycle_every)):
            await process_batch(browser, batch)
            await browser.close()
            browser = await launch()
            gc.collect()
    finally:
        await browser.close()

Two extra lines plus a gc.collect() and the memory profile stays flat across multi-day runs. Worth the four seconds per recycle.


5. Silent webhook failure

The actor finished. The dataset filled. The run was marked SUCCEEDED. The downstream system never received the data.

The webhook had fired against a customer endpoint that returned HTTP 200 with an empty body. Apify treated that as success and moved on. The customer’s queue worker was looking for a specific JSON shape and silently dropped the message.

I found out three days later because someone messaged me asking why their dashboard hadn’t updated since Tuesday.

The fix is two-sided:

  1. The webhook payload includes a confirmation token. The receiver must echo the token back in the response body.
  2. If the response body doesn’t echo the token within 2 retries, the actor logs a WEBHOOK_NOT_CONFIRMED warning that surfaces in the run log and triggers an alert.
def post_with_confirmation(url, payload, token, retries=2):
    payload["confirmation_token"] = token
    for attempt in range(retries + 1):
        r = httpx.post(url, json=payload, timeout=10)
        if r.status_code == 200 and token in r.text:
            return True
        time.sleep(2 ** attempt)
    log.warning("WEBHOOK_NOT_CONFIRMED", url=url, token=token)
    return False

That extra round-trip costs maybe 40 ms. The “did the data actually land” question stopped being a question.


What these five have in common

They all succeeded in the loud, primary sense. The HTTP call returned 200. The run finished. The dataset filled. The webhook posted. Each one passed the obvious health check.

But every one of them silently violated the secondary contract — the part downstream consumers actually rely on. Schema drift breaks semantics. Retry storms break rate-limit fairness. Concurrency tuning breaks the target. Memory creep breaks long-run completion. Unconfirmed webhooks break the data hand-off.

Production scraping at scale is mostly about catching the second class. The first class — “did the request work” — is solved by the framework. The second class is the part you have to build yourself, and it’s almost always cheaper to bake it in early than to wait for the customer to find it.


If you run scrapers in production and want to compare notes — runs counts, fixes that didn’t work, schema-drift war stories — I’m reachable here. I publish more of these breakdowns on t.me/scraping_ai.