How my Trustpilot scraper survived 949 production runs (and the 3 things that almost killed it)


How my Trustpilot scraper survived 949 production runs (and the 3 things that almost killed it)

I shipped a Trustpilot reviews scraper to the Apify Store about a year ago. It now sits at 949 production runs as of late April 2026. It’s not the most-run scraper on the platform, but it’s mine, it works, and it’s outlasted at least two upstream redesigns of the Trustpilot frontend.

This post is the post-mortem I never wrote. Three production failures, the actual code that broke, and the cost-math behind why I didn’t migrate to a residential proxy provider even when the temptation was strongest.

If you’re building scrapers for the Apify Store (or for any “we run this on a schedule for paying customers” workflow), the patterns below cost me 4 weekends to learn. They’re free here.

The shape of the actor

The actor takes a Trustpilot business URL and a maxPages integer. It returns a flat array of review objects: reviewId, rating, title, body, author, country, dateOfExperience, replyFromBusiness. Output formats: JSON, CSV, Excel.

Initial version was straight requests + BeautifulSoup against the public review pages:

import httpx
from selectolax.parser import HTMLParser

async def fetch_reviews_page(url: str, page: int, client: httpx.AsyncClient):
    resp = await client.get(f"{url}?page={page}", timeout=20.0)
    resp.raise_for_status()
    tree = HTMLParser(resp.text)
    cards = tree.css("article[data-service-review-card-paper]")
    return [parse_card(c) for c in cards]

Worked fine at 5–10 runs/day. Then I crossed about 200 runs and the failures started.

Failure #1 — Trustpilot rebuilt the React tree, my CSS selectors died silently

In late summer the actor started returning empty arrays for ~30% of runs. No error, no exception — just an empty list. CSS selector article[data-service-review-card-paper] was returning zero matches because the frontend team had renamed it to article[data-review-card]. The old attribute was still emitted on some pages (server-side cached) but not all.

The lesson: never rely on a single CSS selector when scraping a SPA. The fix was a fallback chain plus a sanity-check on output count:

SELECTORS = [
    "article[data-service-review-card-paper]",
    "article[data-review-card]",
    "[data-service-review-typography]",
    "section[data-business-unit-id] article",
]

def parse_review_cards(tree: HTMLParser):
    for sel in SELECTORS:
        cards = tree.css(sel)
        if cards:
            return cards, sel
    return [], None

cards, used_selector = parse_review_cards(tree)
if not cards:
    raise RuntimeError(
        f"No review cards on {url}. Page size={len(tree.html)}. "
        f"Tried selectors: {SELECTORS}"
    )

Two extra signals that catch silent failures:

  1. len(tree.html) — if the page is suspiciously small (< 30 KB), it’s probably a soft-blocked stub, not the real listing.
  2. used_selector — log which selector matched. When 100% of yesterday’s runs used selector #1 and today 100% use selector #3, that’s your warning that the next rebuild is coming.

Failure #2 — JSON payload schema drift on dateOfExperience

A few months later, Trustpilot started returning two formats for dateOfExperience:

"dateOfExperience": "2024-11-13T00:00:00.000Z"   # old, ISO 8601
"dateOfExperience": "Nov 13, 2024"               # new, locale string for some markets

datetime.fromisoformat blew up on the second format. Customers in Excel saw #VALUE! for half their rows. I got 4 angry emails in one day.

The fix is boring but matters: a defensive parser with a fallback to dateutil.parser.parse, and a sentinel value (None) instead of crashing the whole run:

from dateutil import parser as dt_parser

def parse_date(raw: str | None) -> str | None:
    if not raw:
        return None
    try:
        return dt_parser.parse(raw).isoformat()
    except (ValueError, TypeError):
        return None  # log and continue — never poison the output

Production rule: a scraper that returns 95% of the data with 5% nulls is infinitely better than one that returns 0% and an exception. Customers can filter nulls; they can’t run an Excel macro on a stack trace.

Failure #3 — Apify free-tier proxy, the 429 cliff

The actor ran on Apify’s auto-rotating shared datacenter proxy. For the first 600 or so runs this was fine. Then sometime in early 2026 Trustpilot started returning HTTP 429 + a Cloudflare interstitial after about page 7 on any single business URL. Datacenter IPs got pattern-flagged.

I priced the obvious migration: residential proxies via Bright Data or Oxylabs at roughly $10–14 per GB. A typical run scrapes 8 pages × ~120 KB = ~1 MB per business → ~$0.012 per run. With 949 runs and current run-rate of ~3/day, residential would have cost roughly $11 over the actor’s full lifetime, or about $1/month going forward. Affordable. But it makes the actor more expensive to publish at $0/run on the Free Plan, which is what most of my Apify Store traffic uses.

What actually fixed it without paying for residential:

  1. Cap maxPages default at 5 (down from 20). Most users only want recent reviews.
  2. Add a 1.2–2.5s randomized delay between page requests (asyncio.sleep(random.uniform(1.2, 2.5))).
  3. On 429, exponential backoff with Retry-After honored: await asyncio.sleep(min(int(resp.headers.get("retry-after", 30)), 120)).
  4. Rotate user-agents from a pool of 12 modern Chrome/Firefox strings on every request.

After this, the 429 rate dropped from ~18% of pages to ~2%, and the 2% that still 429-d resolved within one retry. No residential proxy migration needed at current scale.

What 949 runs actually look like

  • Mean run duration: ~38 seconds (5-page default).
  • Median data returned: 87 reviews.
  • 7-day retention rate (run twice within 7 days): ~22% — meaning roughly 1 in 5 users come back, which I treat as the leading indicator that this is a real workflow tool, not a one-off curiosity scrape.
  • Failure rate post-fix: < 3% of runs end in error.
  • Active users: small but steady — 3–5/day.

It’s not a viral hit. It’s a quiet, durable tool that compounds.

Three rules I now apply to every Apify actor I ship

  1. Sentinel logging beats validation. If you can’t tell from logs which selector matched, which date format won, and how many bytes the page returned, you can’t debug a silent failure.
  2. Default conservative inputs. maxPages=5 not maxPages=50. Users who need more will configure it. Users who don’t will not 429 themselves into a refund.
  3. Resist the proxy-upgrade temptation until the math forces you. Residential proxies are great. They’re also the fastest way to turn a $0/run actor into a $0.05/run actor that nobody runs.

Want a custom scraper that doesn’t break this way?

I take pilot work at $100/article or $150 for a 3-article series (real production-grade content with code that runs and case studies with verifiable numbers — like this post). I also build custom Apify actors on commission.

The Trustpilot scraper above is a public actor — you can run it for free on Apify’s Free Plan today.