I've Run 2,190 Production Scrapes — "Ethical" Isn't a robots.txt Question, It's a Rate-Limit One


I’ve Run 2,190 Production Scrapes — “Ethical” Isn’t a robots.txt Question, It’s a Rate-Limit One

There’s a good post going around this week — Federico Trotta on The Web Scraping Club, “How to Scrape Open-Source Datasets Ethically,” published May 24, 2026. It’s the kind of post the field needs more of: check the license, respect the infrastructure, prefer official APIs and bulk downloads, handle PII carefully, don’t take what isn’t offered.

I agree with almost all of it. One line in there stuck with me — Trotta’s point that “a scraper that would barely register as noise on Amazon’s servers could genuinely degrade performance for a public data portal.” That’s exactly the thing the robots.txt-vs-ToS debate keeps skipping over. So I want to add the part you only learn after the theory survives contact with production.

Here’s my claim, and it’s a slightly annoying one: on a real schedule, over months, the thing people call “ethics” and the thing people call “not getting banned” stop being two questions. They become one question. And that one question is almost never answered by robots.txt. It’s answered by how many requests you send, how often, and whether you bother to ask the server “did this even change since last time?”

I’ve now run 2,190 production scrapes across 32 published scrapers on my Apify account. The busiest single one — a Trustpilot review scraper — has 962 runs on its own. Where those numbers come from, since you should always ask: my own Apify dashboard (public profile: apify.com/knotless_cadence), as of May 2026. The 2,190 is the total run count summed across my 32 published actors; the 962 is the Trustpilot review scraper alone, read straight off its run history. No sampling, no extrapolation — it’s the raw lifetime counter the platform keeps. That’s not a benchmark I set up to prove a point. It’s just what the dashboard says, and it’s the cleanest evidence I have for an argument that’s otherwise easy to wave away as “be nice to servers.” So let me be specific about what “be nice” actually looks like in code.

The theory is fine. The theory is also where people stop.

Quick recap of the standard ethical-scraping checklist, because it’s correct and I’m not here to dunk on it:

  • Check robots.txt. If a path is Disallow-ed, don’t crawl it.
  • Read the Terms of Service. If scraping is forbidden, that’s a legal and reputational risk you’re choosing.
  • Prefer open datasets, public APIs, or data the owner has explicitly published for reuse.
  • Identify yourself with a real User-Agent and a contact.

All true. All worth doing. Here’s the gap: none of those four things tells you what to do on run number 200, or 600, or 900. They’re decisions you make once, at the start, before a single request goes out. They’re the ethics of whether. They say nothing about the ethics of how often — which is the part that actually lands you on a blocklist, and the part that actually loads someone else’s server.

robots.txt is a sign on the door. It does not tell you how hard to knock.

And here’s the uncomfortable bit. You can be perfectly “ethical” by the checklist — public page, no Disallow, no API to use instead — and still behave like a small DDoS because you re-fetch the same 40,000 review pages every night, in full, with twenty concurrent workers, on a site that updates maybe 2% of those pages per day. Robots.txt is green. The server admin still hates you. And eventually a WAF rule or a Cloudflare challenge decides the question for you.

So the contrarian version of the headline: stop arguing about robots.txt as if it’s the hard part. The hard part is the request you didn’t have to send.

What 962 runs of the same site actually taught me

The Trustpilot scraper is my best teacher here precisely because it ran so many times against the same targets. When you scrape something once, you learn nothing about politeness — you just took the data and left. When you scrape it 962 times, the cost of being sloppy compounds, and it compounds on someone else’s infrastructure as much as on yours.

A few things I’m confident about from that volume, and a couple I’m still not sure about. (I’ll flag which is which, because anyone who tells you they’re 100% sure about scraping at scale hasn’t done it long enough.)

Confident: the single biggest source of waste in a recurring scrape is re-downloading content that didn’t change. Review pages, company profiles, listing pages — the overwhelming majority of them are identical between two consecutive runs. Pulling them again in full is pure load on the source with zero new data for you. I don’t have a clean per-source uptime table to wave around — I’m not going to invent one — but the qualitative pattern across hundreds of runs was blunt: the runs that re-fetched everything were the ones that got throttled, challenged, or quietly slowed. The runs that asked “changed since last time?” mostly sailed.

Confession, because I didn’t always know this. The first version of one of my recurring scrapers had no conditional-GET layer at all. I skipped it on purpose — I remember thinking “it’s just a couple thousand pages, who cares, I’ll add caching later if it matters.” It mattered. Somewhere around run 200 — and I want to be honest that ~200 is my rough memory of when it started, not a number I logged precisely — that scraper started catching throttling it hadn’t caught before. Slower responses, the occasional challenge. My first instinct was to blame the site. Then I added the ETag / If-None-Match layer so I’d stop re-pulling pages that hadn’t changed, the per-run request count dropped hard, and the throttling quietly stopped. I’d been the problem. “I’ll add caching later” cost me a week of pretending it was someone else’s bug.

Confident: the sources that “stay up” for you — the ones where your scraper keeps working month after month without a fight — are the ones where you were boring. Low concurrency. A real delay between hits. A User-Agent that says who you are. The flashy “I parallelized 50 workers and got 10x throughput” runs are also the runs that showed up in someone’s rate-limit dashboard as an anomaly. Anomalies get rules written about them.

Not sure: I can’t cleanly separate “this source banned me because I was rude” from “this source tightened its bot defenses for everyone that week.” I want to be careful here, because this is exactly the kind of claim people inflate into a fake statistic. So I’ll keep it to what I actually saw: across my own runs I can’t cleanly attribute which blocks were my fault versus a site-wide change, and a fair number of my “the source changed” incidents probably had nothing to do with my behavior at all. I’m not going to dress that up as an industry trend with a percentage attached — I don’t have the data to. So I won’t claim my politeness prevented every block. I’ll only claim the impolite runs reliably made things worse, faster.

The throughline: the polite version and the un-bannable version are the same version. That’s the whole argument. Ethics here isn’t a tax you pay on top of a working scraper. It’s the thing that keeps the scraper working.

The one mechanism that does most of the work: conditional GET

If you take one thing from this post, take this. Most HTTP servers will tell you whether a resource changed — for free, before they send you the body — if you ask correctly. The mechanism is conditional GET. It’s not a trick or a hack; it’s part of the HTTP standard, written down in RFC 9110 §13 (HTTP Semantics) and, in more focused form, the older RFC 7232 (HTTP/1.1: Conditional Requests). It’s been in the spec since forever. Almost nobody scraping uses it.

It works through two pairs of headers (the validator fields, in spec terms):

  • The server sends ETag (an opaque version tag) and/or Last-Modified (a timestamp) on the response.
  • On your next request you send those values back as If-None-Match (for the ETag) and/or If-Modified-Since (for the timestamp).
  • If nothing changed, the server replies 304 Not Modified with an empty body. No payload. You skip parsing entirely. The source barely did any work.

A 304 is the most ethical response you can possibly get: you confirmed there’s no new data without making the server render, serialize, and ship a page you already have. Everybody wins. The source saves bandwidth and CPU; you save parse time and you don’t add a row of duplicate garbage to your pipeline.

Here’s a fetcher that does it. It’s plain httpx, it persists the cache to disk so it survives across runs, and it throttles itself so you don’t hammer one host. I ran exactly this pattern against a public echo server while writing this — the run output is below the code so you can check I’m not making it up.

import time
import json
import os
import hashlib
import httpx


class PoliteFetcher:
    """Conditional-GET fetcher.

    Stores each URL's ETag / Last-Modified, sends them back as
    If-None-Match / If-Modified-Since on the next fetch, and sleeps
    `min_interval` seconds between hits to keep load on the source low.

    A 304 response means: nothing changed, no body sent, skip parsing.
    """

    def __init__(self, cache_path="cache.json", min_interval=1.0,
                 user_agent="polite-scraper/1.0 (+you@example.com)"):
        self.cache_path = cache_path
        self.min_interval = min_interval
        self.user_agent = user_agent
        self._last_hit = 0.0
        self.cache = {}
        if os.path.exists(cache_path):
            with open(cache_path) as f:
                self.cache = json.load(f)

    def _throttle(self):
        wait = self.min_interval - (time.monotonic() - self._last_hit)
        if wait > 0:
            time.sleep(wait)
        self._last_hit = time.monotonic()

    def get(self, url):
        meta = self.cache.get(url, {})
        headers = {"User-Agent": self.user_agent}
        if meta.get("etag"):
            headers["If-None-Match"] = meta["etag"]
        if meta.get("last_modified"):
            headers["If-Modified-Since"] = meta["last_modified"]

        self._throttle()
        r = httpx.get(url, headers=headers, timeout=20)

        if r.status_code == 304:
            # No new data. The server did almost no work. Reuse what we have.
            return {"status": 304, "changed": False,
                    "body_hash": meta.get("body_hash")}

        if r.status_code == 200:
            body_hash = hashlib.sha256(r.content).hexdigest()
            self.cache[url] = {
                "etag": r.headers.get("etag"),
                "last_modified": r.headers.get("last-modified"),
                "body_hash": body_hash,
            }
            with open(self.cache_path, "w") as f:
                json.dump(self.cache, f)
            return {"status": 200, "changed": True,
                    "body_hash": body_hash, "content": r.content}

        # 4xx / 5xx — let the caller decide on retry/backoff.
        return {"status": r.status_code, "changed": None, "body_hash": None}

Test it yourself in under five minutes. httpbingo.org has an /etag/{tag} endpoint that hands back an ETag and honors If-None-Match:

f = PoliteFetcher(min_interval=0.5)
url = "https://httpbingo.org/etag/demo123"

print(f.get(url)["status"])   # 200  -> first time, full download
print(f.get(url)["status"])   # 304  -> server says "you already have it"
print(f.get(url)["status"])   # 304  -> still nothing new

When I ran it just now:

run 1: {'status': 200, 'changed': True,  'body_hash': '<your-hash>'}
run 2: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}
run 3: {'status': 304, 'changed': False, 'body_hash': '<your-hash>'}

Your body_hash will differ — httpbingo echoes your request headers (User-Agent, timestamps) into the body, so the exact hex is yours, not mine. What’s reproducible is the status sequence 200 → 304 → 304, not the hash.

requests works identically — swap httpx.get for requests.get, same header names, same 304. There’s nothing exotic here. That’s the point. The “ethical scraping” upgrade most pipelines are missing is about fifteen lines of caching.

The rate limit is the other half — and it’s a courtesy, not a config value

The _throttle() method above is deliberately dumb: one fixed delay between hits to a host. People want a clever adaptive algorithm. You usually don’t need one. You need a delay that’s large enough that a human looking at the server’s access log would not flinch.

What “large enough” means depends on the site, and I’m not going to hand you a magic number because there isn’t one. But a couple of rules I actually follow:

  • One host at a time, or close to it. Concurrency across different domains is fine. Twenty parallel workers hammering one domain is the behavior that gets a rule written about you. My calmest, longest-surviving runs were low-concurrency-per-host. Boring wins.
  • Back off when the server asks. A 429 Too Many Requests or a Retry-After header is the source literally telling you the polite interval. Honor it. Ignoring Retry-After and retrying immediately isn’t “robust,” it’s the thing that escalates a soft throttle into a hard ban.
  • Slow down at night for the source, not for you. If you’re scraping on a schedule, spreading the run out costs you nothing (it’s a cron job, it’s not waiting on you) and it flattens the load spike on their side. A request budget spread over an hour is gentler than the same budget fired in ninety seconds.

Notice what none of these are: they’re not robots.txt rules. robots.txt can say Crawl-delay, sure, and you should honor it when present — but most sites don’t set one, and the absence of a Crawl-delay is not permission to go as fast as your bandwidth allows. The ethical rate limit lives in your code, not in their file.

So, “which sources stay up”?

I led with that promise, so let me be honest about what I can and can’t tell you.

I can’t give you a ranked uptime table of named sites. I don’t have clean enough per-source numbers to publish one without making things up, and making things up is the fastest way to lose the only thing that makes a scraping post worth reading. What I can tell you, from 2,190 runs, is the shape of it:

The sources that kept working for me, run after run, were not the “easy” sites or the “hard” sites in any technical sense. They were the ones where my own scraper behaved like a considerate guest — conditional GET so I wasn’t re-downloading static content, a delay that kept me off the anomaly radar, a User-Agent that didn’t pretend to be a browser when it didn’t need to. The sources where I lost access were, more often than not, the ones where I’d gotten greedy with concurrency or skipped the conditional-GET layer because “it’s just a few thousand pages.”

That correlation isn’t a controlled experiment, and I won’t pretend it is. Some of those lost-access incidents were probably nothing to do with me — a site changing its defenses on its own schedule — and I can’t separate those out cleanly. But across this many runs, the direction is not subtle: politeness and persistence track together. The scraper that’s kind to the source is the scraper that’s still running next quarter.

Trotta’s checklist gets you in the door ethically — license, infrastructure, APIs-first, PII. Conditional GET and a real rate limit are what keep you a welcome guest once you’re inside. Both halves matter. Most posts only ship the first one.

What to actually change on Monday

If your recurring scraper doesn’t already do this, here’s the short list, in priority order:

  1. Add conditional GET. Store ETag/Last-Modified, send If-None-Match/If-Modified-Since, treat 304 as “skip, reuse cached.” Fifteen lines. Biggest single win, for you and for the source.
  2. Cap per-host concurrency and add a real delay. Boring beats fast. Anomalies get blocked.
  3. Honor 429 and Retry-After. The server is telling you the polite interval. Don’t argue with it.
  4. Send a real User-Agent with a contact. If an admin wants to reach you instead of banning you, let them.

That’s it. None of it is clever. All of it is the difference between a scraper you babysit through monthly bans and one that just keeps quietly working — which, after 962 runs against the same site, is the only kind I want to maintain.


I’ve run 2,190 production scrapes across 32 scrapers (the Trustpilot one alone has 962 — profile: apify.com/knotless_cadence). If you need a recurring scraper that stays up instead of getting throttled on run 200 — conditional GET, sane rate limits, the whole considerate-guest setup — I build those. Tell me the source and the schedule: spinov001@gmail.com.

Drafted with AI assistance, edited and fact-checked by me. Run counts are from my own Apify dashboard, not estimates.


More production scraping tips: t.me/scraping_ai