How my Trustpilot scraper survived 949 production runs (and the 3 things that almost killed it)
How my Trustpilot scraper survived 949 production runs (and the 3 things that almost killed it)
I shipped a Trustpilot reviews scraper to the Apify Store about a year ago. It now sits at 949 production runs as of late April 2026. It’s not the most-run scraper on the platform, but it’s mine, it works, and it’s outlasted at least two upstream redesigns of the Trustpilot frontend.
This post is the post-mortem I never wrote. Three production failures, the actual code that broke, and the cost-math behind why I didn’t migrate to a residential proxy provider even when the temptation was strongest.
If you’re building scrapers for the Apify Store (or for any “we run this on a schedule for paying customers” workflow), the patterns below cost me 4 weekends to learn. They’re free here.
The shape of the actor
The actor takes a Trustpilot business URL and a maxPages integer. It returns a flat array of review objects: reviewId, rating, title, body, author, country, dateOfExperience, replyFromBusiness. Output formats: JSON, CSV, Excel.
Initial version was straight requests + BeautifulSoup against the public review pages:
import httpx
from selectolax.parser import HTMLParser
async def fetch_reviews_page(url: str, page: int, client: httpx.AsyncClient):
resp = await client.get(f"{url}?page={page}", timeout=20.0)
resp.raise_for_status()
tree = HTMLParser(resp.text)
cards = tree.css("article[data-service-review-card-paper]")
return [parse_card(c) for c in cards]
Worked fine at 5–10 runs/day. Then I crossed about 200 runs and the failures started.
Failure #1 — Trustpilot rebuilt the React tree, my CSS selectors died silently
In late summer the actor started returning empty arrays for ~30% of runs. No error, no exception — just an empty list. CSS selector article[data-service-review-card-paper] was returning zero matches because the frontend team had renamed it to article[data-review-card]. The old attribute was still emitted on some pages (server-side cached) but not all.
The lesson: never rely on a single CSS selector when scraping a SPA. The fix was a fallback chain plus a sanity-check on output count:
SELECTORS = [
"article[data-service-review-card-paper]",
"article[data-review-card]",
"[data-service-review-typography]",
"section[data-business-unit-id] article",
]
def parse_review_cards(tree: HTMLParser):
for sel in SELECTORS:
cards = tree.css(sel)
if cards:
return cards, sel
return [], None
cards, used_selector = parse_review_cards(tree)
if not cards:
raise RuntimeError(
f"No review cards on {url}. Page size={len(tree.html)}. "
f"Tried selectors: {SELECTORS}"
)
Two extra signals that catch silent failures:
len(tree.html)— if the page is suspiciously small (< 30 KB), it’s probably a soft-blocked stub, not the real listing.used_selector— log which selector matched. When 100% of yesterday’s runs used selector #1 and today 100% use selector #3, that’s your warning that the next rebuild is coming.
Failure #2 — JSON payload schema drift on dateOfExperience
A few months later, Trustpilot started returning two formats for dateOfExperience:
"dateOfExperience": "2024-11-13T00:00:00.000Z" # old, ISO 8601
"dateOfExperience": "Nov 13, 2024" # new, locale string for some markets
datetime.fromisoformat blew up on the second format. Customers in Excel saw #VALUE! for half their rows. I got 4 angry emails in one day.
The fix is boring but matters: a defensive parser with a fallback to dateutil.parser.parse, and a sentinel value (None) instead of crashing the whole run:
from dateutil import parser as dt_parser
def parse_date(raw: str | None) -> str | None:
if not raw:
return None
try:
return dt_parser.parse(raw).isoformat()
except (ValueError, TypeError):
return None # log and continue — never poison the output
Production rule: a scraper that returns 95% of the data with 5% nulls is infinitely better than one that returns 0% and an exception. Customers can filter nulls; they can’t run an Excel macro on a stack trace.
Failure #3 — Apify free-tier proxy, the 429 cliff
The actor ran on Apify’s auto-rotating shared datacenter proxy. For the first 600 or so runs this was fine. Then sometime in early 2026 Trustpilot started returning HTTP 429 + a Cloudflare interstitial after about page 7 on any single business URL. Datacenter IPs got pattern-flagged.
I priced the obvious migration: residential proxies via Bright Data or Oxylabs at roughly $10–14 per GB. A typical run scrapes 8 pages × ~120 KB = ~1 MB per business → ~$0.012 per run. With 949 runs and current run-rate of ~3/day, residential would have cost roughly $11 over the actor’s full lifetime, or about $1/month going forward. Affordable. But it makes the actor more expensive to publish at $0/run on the Free Plan, which is what most of my Apify Store traffic uses.
What actually fixed it without paying for residential:
- Cap
maxPagesdefault at 5 (down from 20). Most users only want recent reviews. - Add a 1.2–2.5s randomized delay between page requests (
asyncio.sleep(random.uniform(1.2, 2.5))). - On 429, exponential backoff with
Retry-Afterhonored:await asyncio.sleep(min(int(resp.headers.get("retry-after", 30)), 120)). - Rotate user-agents from a pool of 12 modern Chrome/Firefox strings on every request.
After this, the 429 rate dropped from ~18% of pages to ~2%, and the 2% that still 429-d resolved within one retry. No residential proxy migration needed at current scale.
What 949 runs actually look like
- Mean run duration: ~38 seconds (5-page default).
- Median data returned: 87 reviews.
- 7-day retention rate (run twice within 7 days): ~22% — meaning roughly 1 in 5 users come back, which I treat as the leading indicator that this is a real workflow tool, not a one-off curiosity scrape.
- Failure rate post-fix: < 3% of runs end in error.
- Active users: small but steady — 3–5/day.
It’s not a viral hit. It’s a quiet, durable tool that compounds.
Three rules I now apply to every Apify actor I ship
- Sentinel logging beats validation. If you can’t tell from logs which selector matched, which date format won, and how many bytes the page returned, you can’t debug a silent failure.
- Default conservative inputs.
maxPages=5notmaxPages=50. Users who need more will configure it. Users who don’t will not 429 themselves into a refund. - Resist the proxy-upgrade temptation until the math forces you. Residential proxies are great. They’re also the fastest way to turn a $0/run actor into a $0.05/run actor that nobody runs.
Want a custom scraper that doesn’t break this way?
I take pilot work at $100/article or $150 for a 3-article series (real production-grade content with code that runs and case studies with verifiable numbers — like this post). I also build custom Apify actors on commission.
- Browse my actors: apify.com/knotless_cadence
- More like this: t.me/scraping_ai
- Email: spinov001@gmail.com
The Trustpilot scraper above is a public actor — you can run it for free on Apify’s Free Plan today.