Three operational rules I added after my Trustpilot scraper crossed 100 runs


When my Trustpilot reviews scraper crossed its first 100 production runs, I thought I’d built something stable. By the time it crossed 500, I had a small list of things I wished I’d done from the start. By the time it crossed 950, that list had hardened into three rules I now apply to every actor before I let it run unattended in production.

This post is about those three rules — the operational layers I missed on day one and added the hard way. None of them are clever. All of them would have saved me incidents.

Rule 1: A schema-drift detector that fails the run

The most painful incident I’ve had wasn’t a 500 error. It was a Trustpilot DOM redesign where the review-text selector silently shifted from [data-service-review-text-typography] to a slightly different attribute. The scraper kept running. Every row “succeeded.” But every row had text="". I noticed three days later — the dataset went out to the customer with empty review bodies in roughly 60% of rows.

What I added afterward:

from pydantic import BaseModel, ValidationError

class TrustpilotReview(BaseModel):
    stars: int
    headline: str
    text: str           # required, non-empty
    author: str
    date: str

REQUIRED_FILL_RATE = 0.85   # 85% of rows must have all 5 fields

def validate_dataset(rows: list[dict]) -> None:
    valid = 0
    for r in rows:
        try:
            m = TrustpilotReview(**r)
            if m.text.strip():
                valid += 1
        except ValidationError:
            pass
    rate = valid / len(rows) if rows else 0
    if rate < REQUIRED_FILL_RATE:
        raise RuntimeError(
            f"Schema drift: {rate:.0%} fill rate, below {REQUIRED_FILL_RATE:.0%}"
        )

This runs at the end of every scrape. If fewer than 85% of rows pass the schema, the actor fails. The customer gets a clear error instead of a quiet bag of empty strings, and I get an alert before the dataset is consumed downstream.

The threshold matters. 100% would be too brittle (Trustpilot legitimately has reviews without text — just a star rating). 85% is the empirically-determined line where a real scrape settles after enough samples. You’ll need to tune this per source.

Rule 2: An IP-budget that the scraper enforces on itself

The second rule I learned the slow way. Trustpilot’s frontend would 403 me intermittently when I hammered too fast — but only sometimes, and only on certain regional pages. I’d rotate proxies, the 403 would clear, and I’d push harder. Eventually I got a soft block that lasted six hours.

What I added: a per-IP request budget that the scraper itself watches.

from collections import defaultdict
from time import time, sleep

BUDGET_PER_IP_PER_MINUTE = 8
ip_log: dict[str, list[float]] = defaultdict(list)

def request_under_budget(ip: str) -> None:
    now = time()
    # drop entries older than 60s
    ip_log[ip] = [t for t in ip_log[ip] if now - t < 60]
    if len(ip_log[ip]) >= BUDGET_PER_IP_PER_MINUTE:
        # sleep until oldest entry rolls out of the window
        wait = 60 - (now - ip_log[ip][0]) + 0.5
        sleep(max(0, wait))
    ip_log[ip].append(time())

Eight requests per IP per minute is conservative. For Trustpilot specifically I’d seen blocks start above ~12/minute, so 8 leaves headroom. The crucial part is that the scraper enforces this on itself — you can’t trust upstream rate limits to be consistent across regions, and one region softblocking an IP is enough to ruin a whole run.

I don’t trust per-process counters either. If the actor restarts mid-run (Apify will resume), the in-memory counter resets. For runs that span more than ~30 minutes, persist the IP log to the actor’s key-value store and rehydrate on start.

Rule 3: A “did anything change?” snapshot that runs before I commit a code change

The third rule isn’t operational so much as it’s a habit. I keep a folder called golden/ with one CSV per actor — the last known-good output of a 50-row scrape, captured manually after a clean run. Whenever I change the scraper’s selector logic, parsing, or output schema, I run a fresh 50-row scrape and diff column-by-column against golden/<actor>.csv.

python3 tools/golden_diff.py \
    --baseline golden/trustpilot.csv \
    --candidate runs/trustpilot-$(date +%s).csv \
    --columns stars,headline,text,author,date

The diff isn’t fancy. It compares fill rate, value distribution, and average string length per column. Anything that drifts more than 15% on any column, I have to look at before merging.

The first time this caught a real bug was a refactor where I reordered field assignment in a dict comprehension — silently swapped headline and text for one of the locales. All my unit tests passed. The golden diff caught it in 4 seconds because the average length of text had collapsed from 280 chars to 35 chars (a headline length).

Why these three and not others

The pattern across all three: they catch silent failures. The kind of failures where the scraper returns a value, the system pipes it downstream, and nobody notices for a day, a week, or a month — until the customer reads the dataset and finds out.

Loud failures are easy. The scraper crashes, you get an alert, you fix it. The runs that hurt are the ones where everything looks fine and the data is wrong.

Schema-drift, IP-budget, and golden-diff are the three I’ve personally been bitten by. Yours will be different per source. The meta-rule is the same: build a layer that can fail the run, not just one that can log a warning.


If you’re running scrapers in production and want a second pair of eyes on your gate logic, my Apify Store has 32 production actors with these three rules layered in (Trustpilot 950 runs, Reddit 81, Walmart-reviews shipped this week): https://apify.com/knotless_cadence

Pilot rate for a custom scraper or a tutorial article: $100 for one, $150 for three. spinov001@gmail.com

More posts at https://blog.spinov.online and https://t.me/scraping_ai