Apify vs. self-hosted: the three numbers I use to decide
When I get a “should we use Apify or just write our own scraper?” question from a client, my honest answer is “it depends on three numbers.” Most teams ask the question wrong and then spend two engineering weeks building the wrong thing. This post is the decision framework I actually use, backed by run data from 31 public Apify actors I maintain, including one (trustpilot-review-scraper) at 949 production runs across six months.
The framework is short. The cases where it breaks are interesting.
The three numbers that matter
Before you write a line of code, get rough answers to:
- Run frequency — how often does the scrape need to fire? Once a month? Hourly? Per HTTP request from a product?
- Volume per run — 50 records? 50 thousand? 5 million?
- Failure tolerance — if the scrape silently returns garbage for 24 hours, does someone get paged, lose money, or just shrug?
The mistake I see most often: teams optimize for case (2) — volume — when their actual constraint is (1) frequency and (3) tolerance. They build a beautiful Scrapy + Redis + Kubernetes pipeline for a cron that fires twice a week, and then nobody on the team can debug it when the target site adds a CAPTCHA.
The decision matrix
Here’s the rough cut I use after building 78 actors and watching where my own clients hit the wall:
│ Run freq │ Volume/run │ Failure tol │ Default pick
─────────────────┼──────────────┼─────────────┼──────────────┼────────────────
Marketing-ops │ weekly │ <10K │ high │ Apify
Price monitoring │ hourly │ <50K │ medium │ Apify
Investor data │ on-demand │ <100K │ high │ Apify
ETL pipeline │ daily cron │ >500K │ low │ Self-hosted
LLM training set │ one-shot │ >5M │ low │ Self-hosted
Product feature │ per request │ any │ low │ Self-hosted
The diagonal is the boring case. The interesting cases are the off-diagonal moments where teams pick wrong.
Where Apify wins (and the real cost)
Apify wins when failure tolerance is high. Translation: a human will look at the data before it goes anywhere consequential. A marketer pulling competitor pricing once a week. A growth team enriching a list before a campaign. An investor pulling Trustpilot reviews before a portfolio call.
In those cases, the platform’s value isn’t speed or scale — it’s the boring stuff:
- The scraper still works tomorrow. When Trustpilot rotated their DOM in March, my actor needed a 14-line patch. The 8 paying users didn’t notice the rotation; they just got their CSV the next day. Self-hosted scrapers tend to fail silently — the cron returns “0 records” and nobody notices for a week.
- Proxies are managed. I don’t think about residential IP rotation, I don’t pay Bright Data $500/mo, I don’t run a proxy pool. The platform amortizes that cost across thousands of users.
- The output schema is stable. Customers can build downstream pipelines on
dataset.jsonand not have it break when I refactor internals. - The cost line is predictable. A run that pulls 1,000 reviews costs the customer roughly the same today as it did in November. They can put it in a spreadsheet and approve it.
The real cost of Apify is per-run overhead. You pay for compute time you didn’t use, you pay for memory you didn’t need, you pay platform margin. For my Trustpilot actor, the customer-facing price is about $0.012 per review extracted. If a team is running 10M reviews/month, that’s $120K/year — at which point self-hosting absolutely wins, even after you account for proxies and engineering time.
Where self-hosted wins (and the real cost)
Self-hosted wins when the scrape is on the critical path of a product, not a batch job. If your application makes an HTTP request and waits for scraped data to render a page, you cannot tolerate Apify’s cold-start latency (1.5–4 seconds depending on the actor). You need a long-running worker pool that’s already warm.
Self-hosted also wins for massive one-shot training sets. Pulling 5M arxiv abstracts to fine-tune a domain model is a one-week job for a single Scrapy worker on a $20 VPS. You don’t need a platform; you need a screen session and patience.
Here’s the real cost of self-hosted, line by line, that teams chronically under-estimate:
# Hidden costs of "rolling your own", honest tally:
HIDDEN_COSTS = {
"Initial build (one engineer, ~2-3 weeks)": 12_000, # USD
"Proxy provider (Bright Data residential)": 6_000, # /year
"Maintenance (4-8 hrs/month, target site changes)": 7_200, # /year
"Failed-run alerting infrastructure": 1_200, # /year
"DOM-rotation patches (3-5x/year)": 2_400, # /year
"On-call when production scrape breaks at 2am": "priceless",
}
TOTAL_YEAR_ONE = 28_800 # USD, before you write your first patch
If your team is running scrapes that gross less than ~$30K/year of business value, self-hosting is almost always the wrong financial choice. You’re paying engineers $150/hour to babysit a cron that costs $50/month on Apify.
The case I see go wrong most often
A startup builds a self-hosted scraper for a product feature that “needs to be fast and we can’t afford platform fees.” Six months later: the scraper breaks every Tuesday because the target adds JS rendering. Customer support is fielding angry tickets. The senior engineer who built it has left. The replacement engineer takes three weeks to onboard the codebase. The total cost of ownership is now ~$80K/year for what could have been a $200/month managed actor with a 99.4% reliability number.
The right question isn’t “Apify vs self-hosted.” It’s: what’s the cheapest way to deliver this data with the failure tolerance our customers actually need?
For most teams below $50M ARR, the answer is “managed first, self-host the bottlenecks.” You start on Apify or a similar platform, you instrument run-time and failure rate, and you peel off the 1–2 highest-volume scrapes onto self-hosted workers when the math flips. You don’t roll your own from day one.
Honest caveat on this post
I run an Apify portfolio. I have a financial interest in people picking Apify. I’ve tried to write the trade-offs honestly — the volumes, the costs, the cases where self-hosting is correct. If you want to push back on any of the numbers, my email is at the bottom.
About the author: I maintain 31 public Apify actors (78 total in portfolio), including trustpilot-review-scraper at 949 production runs. Articles cross-posted from blog.spinov.online.
If you have a specific scraping problem and want a no-nonsense “Apify vs self-host” read on your case, email me. Or browse my Apify Store profile for the actor library.
More writing like this: t.me/scraping_ai — Python, scraping, OCR pipeline tips.
Disclosure: I maintain Apify actors related to this topic; links may direct to my Apify Store profile.
This article was drafted with AI assistance and edited by a human author.