Token Economics of Agent-Driven Scraping: When LLM Agents Cost 50× More Than a Cron Job
Token Economics of Agent-Driven Scraping: When LLM Agents Cost 50× More Than a Cron Job
A lot of “agent-driven scraping” blog posts pitch the idea as inevitable progress — “let the LLM see the page, decide what to click, handle the schema drift.” Six months running production scrapers (970 lifetime runs on a single Trustpilot actor as of last week) taught me a different lesson: agent-driven scraping is 30–80× more expensive than a deterministic crawler for any workload above ~50 pages per run, and the cost grows linearly with every page you visit, not just with schema changes.
This post walks through the actual token math on a real workload. Then I’ll show the two narrow cases where an LLM agent does pay off — and the broader pattern of using LLMs as a fallback, not a primary loop.
Setup: the workload we’re costing
Three real workloads I run weekly:
- Trustpilot reviews scraper — 970 production runs over ~14 months. Average run hits 12 pages, extracts ~250 reviews with 9 fields each. Pure HTTP + BeautifulSoup, no JS.
- Reddit subreddit threads scraper — 92 lifetime runs. Average run: 5 pages, ~150 posts with comment trees up to depth 3.
- Email extractor pro — 138 lifetime runs. Crawls a domain up to 50 pages deep, extracts mailto links and inline-text emails with simple regex.
Total: 2,190 runs across 32 public actors. None of them use an LLM in the critical path. Two of them (Trustpilot and email-extractor) use a 20-line GPT-4o-mini post-processor that runs on the result, not on each page. That’s the cost-line we’re going to defend.
The naive agent loop and what it actually costs
Standard “agent-driven scraper” loop, simplified:
for page_url in target_urls:
html = fetch(page_url)
prompt = f"""
You are scraping reviews from Trustpilot. Given this HTML, extract a JSON array of
reviews with fields: author, rating, title, body, date, helpful_count, country,
review_id, response_from_company.
HTML:
{html}
"""
response = llm.complete(prompt, model="gpt-4o-mini")
reviews.extend(json.loads(response))
The math, using OpenAI gpt-4o-mini list prices as of writing ($0.15 / 1M input tokens, $0.60 / 1M output):
- A typical Trustpilot review page (HTML, gzip-stripped, formatted): ~85,000 input tokens after the boilerplate strips you can reasonably do without breaking layout.
- Expected output (20 reviews × ~80 tokens of JSON): ~1,600 output tokens.
- Cost per page:
85,000 × $0.15/1M + 1,600 × $0.60/1M ≈ $0.01275 + $0.00096 ≈ $0.0137. - A single 12-page Trustpilot run: ~$0.165.
- A weekly 970 runs cadence (hypothetical resumed weekly): 970 runs × $0.165 = $160/week, or ~$8,300/year.
Now compare against the deterministic crawler I actually run:
- Apify compute unit for the Trustpilot actor: ~0.003 CU per page × 12 pages = 0.036 CU.
- At Apify’s $0.40/CU price: $0.0144 per run.
- 970 lifetime runs cost roughly $14 total compute plus proxy bandwidth (residential proxies for ~14 GB ≈ $42 over 14 months).
The LLM-agent version would cost ~$160 to do what a cron-job does for $0.014. That’s a ~11,000× cost multiplier on per-run compute before you factor in proxies (which are similar for both approaches). Even after proxies normalize, the LLM agent runs 30–80× more expensive per useful scraped row.
”But schema drift!” — what actually happens
The strongest argument for agent-driven scraping is resilience to schema changes. The pitch: “When Trustpilot redesigns its review card, you don’t rewrite a selector, the LLM just adapts.”
I tracked schema changes on Trustpilot over 14 months across 970 runs:
- 3 visible HTML changes that broke selectors.
- 2 of the 3 were detected by my monitor (a sanity-check that warns when the percentage of reviews with empty
ratingdrops above 5% across a window of 50 runs). - Mean time to fix: 8 minutes for the first incident, 4 minutes for the second once I had the pattern (it’s almost always a class-name change on the rating element).
So: 12 engineering minutes total over 14 months, vs. $8,300/year for the LLM-agent fallback that would have absorbed those failures silently. Not a great trade.
What’s worse: an LLM agent also fails on schema drift, just less visibly. When Trustpilot moved the country flag from a <span class="country"> to an SVG <title> attribute, the agent kept extracting reviews — but it started returning null for country on 40% of rows because the model didn’t see the country signal in the new markup. A deterministic crawler would have raised an empty-field rate alarm in the next run. The agent silently degraded for two weeks before I noticed.
Schema drift on agents isn’t free — it just becomes a quality drift you can’t easily monitor.
When LLM agents are worth their tokens
There are two cases where I’ve seen agent-driven scraping pay off:
1. Long-tail one-shot extractions
You need to extract a specific value from 50 different vendor PDFs, each with a different layout. Building 50 deterministic parsers is a week of work; one prompt over pdfplumber text is 30 minutes and ~$0.20.
The math flips when N < ~200 pages, the layouts are unique, and the task runs once or twice. Anything that runs weekly with >50 pages should be deterministic.
2. Schema-flexible post-processing
This is the pattern I run in production. The crawler is deterministic — it extracts everything it can identify by selector. Then a downstream gpt-4o-mini pass on the extracted text fields normalizes things like:
- “5 stars / 5”, “5/5”, ”★★★★★” → integer
rating: 5 - “Great service, fast shipping!” → sentiment label
- Free-text country → ISO-3166 code
This pattern uses LLM tokens at ~5% of the “agent-on-each-page” cost because the inputs are short structured strings, not full HTML pages. On the Trustpilot actor: 970 runs × 250 reviews × ~30 input tokens per normalization = 7.3M tokens total = $1.10 in LLM cost over 14 months.
That’s the ratio that works. Deterministic extraction, LLM normalization downstream of the parser.
The cost model in one paragraph
The single rule that’s saved me thousands: token cost scales with raw HTML bytes you feed the model, deterministic-crawler cost scales with logical fields you extract. A modern Trustpilot review page has ~250 KB of HTML and produces ~20 useful reviews. The crawler reads it once and outputs 20 × 9 = 180 fields. The agent reads it once and pays for every byte of layout, navigation, ads, and footers in its input tokens — but produces the same 180 fields. You’re paying for the chrome of the page, not the data.
This is also why “just strip the HTML before sending” doesn’t save you as much as you think. The selector logic that knows what to strip is the deterministic parser. Once you’ve written it, you’ve already done 90% of the scraping work; calling an LLM after that is paying for the same job twice.
What I do instead (the boring stack)
For workloads above ~50 pages/run:
- Deterministic crawler with BeautifulSoup or Playwright selectors, written once.
- Quality sentinel: a check that runs after each scrape — “what percentage of rows have empty critical fields?” If it crosses 5%, alert.
- LLM post-processor on extracted text fields only (rating normalization, sentiment, country code).
- Manual fix window: when the sentinel fires, I check the diff against the last working selector. Median fix: 5 minutes.
For ad-hoc extractions under ~200 pages with unique layouts: a single LLM call per document, no fancy looping.
I haven’t paid more than $4/month in LLM costs across all 32 production actors. The same workload run through an agent-loop would be ~$700/month. The reliability difference, in my experience, has been zero or slightly negative for the deterministic stack.
If you’re sizing up “agent-driven scraping” for a real production workload, do the per-page token math first. Most of the time the answer is: a cron job, a selector, and a five-line sentinel script will get you 99% of the value at 1% of the cost — and the cost difference compounds every week.
Originally published at blog.spinov.online. More production scraping tips: t.me/scraping_ai
This article was drafted with AI assistance and edited by a human author.