When NOT to scrape: 3 patterns where I now reach for an API instead


After shipping 79 Apify actors over the last six months, I’ve learned something embarrassing: maybe a third of them shouldn’t have been scrapers at all. The site I targeted had a public JSON endpoint, or an Algolia index, or a documented REST API I could have called instead — usually for free.

This post is the rule I now apply before writing a single line of selector code. There are three patterns where you should stop, close your scraper, and reach for an API. Each one has cost me real time and proxy spend. Each one is a specific signal you can grep for in 60 seconds.

Pattern 1: the site already returns JSON if you ask nicely

The first place to look isn’t the HTML. It’s the network tab.

Open the page, hit Cmd+Opt+I (or F12), filter by XHR, and reload. Most modern SPAs render from a JSON payload that’s a single fetch() call away. You don’t need to render anything. You don’t need a headless browser. You don’t need to bypass Cloudflare.

Reddit is the canonical example. People scrape reddit.com/r/python with Selenium for hours. Meanwhile:

curl -s "https://www.reddit.com/r/python/top.json?limit=25&t=week" \
  -H "User-Agent: my-agent/1.0" | jq '.data.children[].data.title'

That returns 25 titles in 200 ms. No login. No rate limits at human speeds. It’s documented. It’s been there since 2012. It will likely still be there in 2032.

Hacker News has a similar pattern through Algolia:

curl -s "https://hn.algolia.com/api/v1/search?query=python&tags=story" | jq '.hits[].title'

Free. No key. Returns up to 1000 hits per query.

Signal to look for: open DevTools → Network → XHR. If you see one or two requests returning the data already structured as JSON, the article is over. Stop scraping. Use the JSON endpoint directly. Even if it’s “internal” and undocumented, sites this large rarely change them without a deprecation window — and if they do, a bare fetch() call is far easier to fix than an HTML selector forest.

The actor in my portfolio that does Reddit doesn’t render anything. It hits the JSON endpoint, parses it, returns CSV. It’s run 80+ times in the last 90 days and has never broken. Compare to my early Selenium-based version, which broke roughly every two weeks.

Pattern 2: there’s a documented API with a generous free tier

Some categories of data have effectively unlimited free APIs that you only learn about by accident.

GitHub is the obvious one. People scrape repository listings. Meanwhile, the REST API gives you 60 requests per hour unauthenticated and 5,000 per hour with a free token. That’s enough to enumerate every repository in a 50-org space twice a day.

curl -s -H "Authorization: Bearer $GH_TOKEN" \
  "https://api.github.com/orgs/anthropics/repos?per_page=100" \
  | jq '.[].full_name'

The npm registry has no rate limit at all for read traffic — https://registry.npmjs.org/<package> returns the full metadata as JSON, no key needed.

crates.io exposes everything through https://crates.io/api/v1/crates.

The Open Library API has every book metadata for free.

The CIA World Factbook is on GitHub as JSON.

OpenStreetMap’s Overpass API will run a query for you across the entire planet’s geometry for nothing.

Signal to look for: before you start writing selectors, search <site name> API documentation. If they have a developer portal, read three things in this order: the rate limits, the auth method, and whether they require a credit card on the free tier. If all three are sane, stop. Use the API. You’ll spend an hour learning their auth and save weeks of selector maintenance.

I have a country-info-scraper in my Apify portfolio and an npm-package-scraper. Both call APIs internally. Neither one parses HTML. The “scraper” framing is just because that’s the search term users type — under the hood they’re doing what the SDK would do, just packaged for non-coders.

Pattern 3: the data is already on a public dataset host

The third pattern is the one I missed for the longest time. Sometimes the data already exists, cleaned, on Hugging Face, Kaggle, the Internet Archive, or as an Awesome list on GitHub.

Wikipedia has a complete database dump available via download.wikimedia.org — every article, every infobox, every redirect, available as a 20-GB compressed XML file refreshed weekly. People scrape Wikipedia. Don’t. Download the dump.

The Common Crawl dataset has every public webpage, indexed, on AWS S3, free to query. If your scraping target is “the open web” and you don’t need this week’s freshness, Common Crawl probably already has it.

For e-commerce price history, Keepa has Amazon product history going back years. For job listings, several aggregators publish CSV dumps under permissive licenses.

Signal to look for: before scraping, search "<dataset name>" site:huggingface.co and "<dataset name>" site:kaggle.com. If someone already extracted, cleaned, and re-shared the data, you don’t need to do it again. You need to use it.

When scraping really is the right answer

Scraping is the correct choice when one of three things is true:

  1. The site has no API, no JSON endpoint, and no public dump — and the data is small enough that a polite scraper at 2-second delays is feasible. (My Trustpilot scraper falls here. Trustpilot has no public review API for non-paying clients, and competitors all charge $200/month for access.)
  2. You need real-time freshness on a small surface — for example, monitoring 20 specific competitor pages every hour for price changes. APIs may not be granular enough; scraping a tiny surface at low frequency is fine.
  3. You’re aggregating across heterogeneous sources that have no common schema — for example, “every job board’s Python listings, in one CSV.” No API will give you that. A patient scraper will.

Outside those three, almost every scraping project I’ve seen would have been faster, cheaper, and more reliable as an API integration.

A 60-second decision rule

Before you write a single selector, run this checklist:

  1. Open DevTools → Network → XHR. Reload the page. Is the data returned as JSON in one or two requests?
  2. Search <site> API documentation. Is there a free tier with reasonable rate limits?
  3. Search "<dataset>" site:huggingface.co and site:kaggle.com. Has someone already published this?

If any answer is yes, close the scraper file. Use the API or the dump. Save yourself the proxy bill, the Cloudflare arms race, and the maintenance debt.

If all three are no, then by all means — scrape carefully, respect robots.txt, throttle politely, and ship.


If you’re choosing between building a scraper or using an API and want a second opinion, email me at spinov001@gmail.com — I’ll spend 10 minutes telling you which way to go for free. If you want me to actually build it, the pilot rate is $100 for a one-off scraper or $150 for a three-actor series.

Code samples I’ve shipped live at apify.com/knotless_cadence. More articles like this on blog.spinov.online and at @scraping_ai on Telegram.