Conditional GET in production scrapers: what I learned wiring it into 3 actors


After 2,190 lifetime runs across 32 public Apify actors I started looking at where bandwidth was being burned for no reason. Trustpilot scraper alone has processed ~150,000 page fetches over 968 runs. A re-scrape of the same company page returns the same HTML ~70% of the time when the customer hasn’t received new reviews in the window. We were paying for the bytes anyway.

The fix is older than most of us writing scrapers in 2026: conditional GET. RFC 9110 has supported it since the late 1990s under the names If-Modified-Since and If-None-Match. Done right, the server returns 304 Not Modified with an empty body, and we move on. Done wrong, you cache stale data and miss updates.

This post is what I learned wiring conditional GET into 3 production actors. Real numbers, code that ships, and the failure modes that bit me.

Why bother

For a single page request, conditional GET saves roughly:

  • 99% of response body bytes (304 has no body, just headers)
  • 60-90% of parser time (no HTML to parse on 304)
  • Some upstream cost if the origin’s CDN serves the 304 itself

The math gets interesting at volume. Trustpilot scraper’s average company page is ~85 KB gzipped. At 150k page fetches we transferred ~12 GB. If 70% of those were truly unchanged repeats, we wasted ~8.4 GB on re-fetching identical bodies. That’s ~$0.40-0.80 in egress alone for a single scraper, not counting parsing CPU on Apify.

Multiply across 32 actors and the savings stop being a rounding error.

The minimum viable implementation

import requests
from apify import Actor

async def fetch_with_validator(client, url, validators):
    """validators: {'etag': str|None, 'last_modified': str|None}"""
    headers = {}
    if validators.get('etag'):
        headers['If-None-Match'] = validators['etag']
    if validators.get('last_modified'):
        headers['If-Modified-Since'] = validators['last_modified']

    resp = client.get(url, headers=headers, timeout=30)

    if resp.status_code == 304:
        return {'status': 'unchanged', 'body': None, 'validators': validators}

    if resp.status_code == 200:
        new_validators = {
            'etag': resp.headers.get('ETag'),
            'last_modified': resp.headers.get('Last-Modified'),
        }
        return {'status': 'changed', 'body': resp.text, 'validators': new_validators}

    resp.raise_for_status()

Validators get stored alongside scraped data. On the next run we look them up by URL and pass them in. The KV store on Apify works for this — one record per URL, ~80 bytes per entry.

The first failure: weak vs strong ETags

This is where I lost 4 hours the first time. Trustpilot returns ETags like W/"abc123def456". The W/ prefix means weak — the validator only guarantees “semantically equivalent,” not byte-identical.

If you pass a weak ETag back as If-None-Match, RFC 9110 says the server MAY still return 200 if it determines the resource has been weakly modified (e.g. a timestamp updated but content is the same). Some CDNs strip the W/ prefix on the way back to you, and then your If-None-Match sends a value that doesn’t match anything.

The fix:

def normalize_etag(raw_etag):
    if not raw_etag:
        return None
    # Some origins return W/"..." others return "..." — store both forms
    if raw_etag.startswith('W/'):
        return raw_etag  # send back exactly as received
    return raw_etag

The mistake I made: stripping W/ to “clean up” the value. That broke conditional GET on Trustpilot for a week. Logs showed 200 responses with full bodies every time — defeats the entire purpose.

Rule: store the ETag exactly as the server sent it. Don’t trim, don’t transform, don’t lowercase.

The second failure: clock skew on Last-Modified

Two of our actors hit sites that don’t return ETags at all. We fell back to Last-Modified + If-Modified-Since. Initially it didn’t save anything — every fetch returned 200.

The cause: site timestamps were in their server timezone, and our If-Modified-Since header sent it back in a slightly different format. The HTTP date spec (RFC 7231 §7.1.1.1) requires exactly:

Sun, 06 Nov 1994 08:49:37 GMT

Python’s requests library doesn’t enforce this when you build headers manually. If you store Last-Modified as a datetime object and call .isoformat(), you’ll send 2026-05-17T04:38:00+00:00. The server doesn’t parse that → it falls back to 200 every time.

The fix:

from email.utils import format_datetime, parsedate_to_datetime

def store_last_modified(header_value):
    # Store as datetime for sanity, but ALSO keep the raw string
    return {
        'raw': header_value,  # this is what we send back
        'parsed': parsedate_to_datetime(header_value),  # for logging only
    }

def build_conditional_headers(stored):
    if stored.get('raw'):
        return {'If-Modified-Since': stored['raw']}
    return {}

Rule: don’t reformat HTTP dates. Echo back the exact byte sequence the server sent.

The third failure: per-URL validators across IP rotation

When we rotate proxies (common for ~30% of our actors), some sites tie ETags to the requesting IP. The previous run got ETag "abc" via proxy A. This run hits via proxy B and the server says “I’ve never seen you, here’s a fresh 200 with ETag "xyz".”

Solutions in order of how often we use them:

  1. Disable conditional GET on this URL if the savings don’t justify the complexity. Often the right call for sites with aggressive per-IP caching.
  2. Stick to a session-IP pair when possible. Some proxy providers offer sticky sessions of 5-30 minutes.
  3. Track validators per-(URL, proxy-cluster) instead of just per-URL. More state, but works.

For Trustpilot specifically: we use residential proxies with sticky sessions of 10 minutes, and the validators behave consistently within a session. Across sessions we get a ~40% hit rate on 304s instead of the theoretical 70%. Still a win.

Where it doesn’t help

Conditional GET is useless when:

  • The page is heavily JS-rendered and you’re using Playwright. The “304” semantics live in the HTTP layer; the rendered DOM is downstream.
  • The site sets Cache-Control: no-store aggressively (some auth-walled pages do). Server won’t 304.
  • You’re already doing differential scraping (only fetch URLs from a sitemap delta). The wins compound but the marginal benefit shrinks.
  • You change User-Agent between runs. Some CDNs vary on UA and won’t 304.

For ~12 of our 32 actors, conditional GET genuinely doesn’t help. For the other 20 it saves 30-70% of bandwidth. We turn it on per-actor with a config flag rather than globally.

The audit step nobody mentions

After wiring this in, I added a tiny audit log that every conditional fetch writes:

async def log_conditional_outcome(actor, url, status, validators_sent, validators_received):
    Actor.log.info({
        'event': 'conditional_get',
        'url': url,
        'status': status,
        'sent_etag': validators_sent.get('etag'),
        'sent_last_mod': validators_sent.get('last_modified'),
        'received_etag': validators_received.get('etag'),
        'received_last_mod': validators_received.get('last_modified'),
    })

Once a week I grep through Apify run logs (apify-cli run-logs) and count 304 vs 200. The ratio is the real KPI. If 304-rate suddenly drops from 60% to 5%, something changed upstream — they switched CDN, rotated their caching layer, or started fingerprinting our requests. That signal has caught 3 silent regressions for me already.

Numbers from 3 actors after 30 days

ActorPages/run avg304-rate before304-rate afterBandwidth saved
trustpilot-review-scraper1550%38%~32 MB/run
exchange-rate-scraper140%71%~1.2 MB/run
npm-package-scraper220%56%~3.4 MB/run

The exchange-rate scraper’s number is the highest because most currency pairs don’t change between hourly polls. The 71% number matches my expectation; the cost savings aren’t huge in absolute terms (small pages) but the parser time saved is real.

When to add it

Skip conditional GET on your first version. Get the scraper to ship, get real users, gather a week of run logs. Once you know which URLs get re-fetched (grep "url=" run.log | sort | uniq -c | sort -rn | head -20), you’ll see whether the long tail is dominated by repeat reads or one-shot fetches. If repeat reads > 30% of total fetches, conditional GET pays back in a few hours of dev work. Otherwise skip it and invest the time in parsing reliability instead.

The pattern is older than half the people writing scrapers today. But the failure modes — weak ETags, date formatting, IP-tied validators — are exactly the kind of thing nobody documents until they’ve burned 4 hours on it.


Production scraping & data engineering: blog.spinov.online 32 Apify actors, 2,190 lifetime runs: apify.com/knotless_cadence Recent paid sponsored work: dev.to/0012303/how-to-scale-web-scraping-to-100k-pages-without-getting-blocked-1ll8 More production scraping tips: t.me/scraping_ai