Why your retry logic is broken (and the 30-line fix)


Almost every Python service I have seen in the last year retries network calls wrong. Not “could be better” wrong — wrong in a way that quietly multiplies failures during exactly the moments you do not want them multiplied. Backups, deploys, partial outages, that one Tuesday afternoon when a third-party API is sluggish. The retry logic kicks in, makes things worse, and the postmortem ends up blaming the third party.

I run a few production scrapers — the largest one has 950 runs on Apify and writes to a downstream Postgres + a webhook fan-out. Two months ago I had a four-minute incident where a single upstream blip turned into 47 duplicated webhook posts and a Postgres lock that lasted long enough to time out two unrelated workers. The cause was not the upstream. It was my retry logic.

This post is the 30-line fix. It is not a library. It is a pattern you can paste into any service today.

The three things every broken retry has in common

Before the code, the diagnosis. Almost every retry bug I have debugged in the last 18 months has at least two of these three properties:

  1. Fixed delay. The code waits 1 second between attempts. So when the upstream is overloaded, every client retries in lockstep, the upstream gets a wave of identical traffic 1 second apart, and the recovery takes 5x longer than it should.
  2. Retry on the wrong errors. The code retries any exception. So a 400 Bad Request — which is a permanent client error — gets retried 5 times, burning quota, before failing the same way.
  3. No upper bound on total time. The code says “retry up to 5 times with 30 seconds between” but does not say “give up after 60 seconds total.” So a slow path stacks 5 × 30s = 150 seconds of latency under load, while the upstream caller has already given up at 30s.

The combination of these three is what produced my 4-minute incident. The retry loop was holding a Postgres connection while exponential-but-uncapped backoff stretched the call to 90 seconds. Two such retries simultaneously held connections long enough to exhaust the pool. Once the pool was exhausted, unrelated background jobs started failing with timeouts, which got retried, which made everything worse.

The 30-line pattern

Here is the version I now use everywhere. It is deliberately small — about 30 lines including imports — and it solves all three failure modes above:

import random
import time
from typing import Callable, TypeVar

import httpx

T = TypeVar("T")

RETRYABLE_STATUS = {408, 425, 429, 500, 502, 503, 504}


def retry(
    fn: Callable[[], T],
    *,
    max_attempts: int = 4,
    base_delay: float = 0.5,
    max_delay: float = 8.0,
    deadline: float = 30.0,
) -> T:
    started = time.monotonic()
    last_exc: Exception | None = None
    for attempt in range(max_attempts):
        try:
            return fn()
        except httpx.HTTPStatusError as e:
            if e.response.status_code not in RETRYABLE_STATUS:
                raise
            last_exc = e
        except (httpx.ConnectError, httpx.ReadTimeout) as e:
            last_exc = e
        if time.monotonic() - started > deadline:
            break
        sleep = min(max_delay, base_delay * (2**attempt))
        sleep *= 0.5 + random.random()  # full jitter
        time.sleep(sleep)
    raise last_exc  # type: ignore[misc]

Three things to notice.

The RETRYABLE_STATUS set is small and explicit. 4xx codes are almost never retryable — the only exceptions are 408 (request timeout), 425 (too early), and 429 (rate limited). 5xx codes are usually retryable but not always; 501 (not implemented) and 505 (HTTP version) are not. Having the explicit set means you do not retry a 401 four times before noticing your token expired.

Backoff uses full jitter, not deterministic 2^attempt. AWS published the math on this years ago, and it is worth re-reading: full jitter (random() * cap) gives a meaningfully better spread than equal jitter or no jitter. With many concurrent clients all retrying, deterministic backoff produces synchronized waves; jitter breaks the wave.

The deadline parameter is the most important line. This is the upper bound on total wall time across all attempts. If you have a caller that times out at 30 seconds, your retry must respect that — there is no point completing the retry sequence after the caller has already given up and started its own retry. The deadline is the contract you have with whoever called you.

What this code does not do

It does not implement a circuit breaker. For most services, full jitter + deadlines is enough; circuit breakers add complexity and the failure modes are usually covered by setting a sane deadline.

It does not retry on connection-pool exhaustion or DNS failures specially. Those are usually transient and the standard retry path catches them via httpx.ConnectError. If you see DNS-specific failures repeatedly, the fix is your DNS resolver config, not retry logic.

It does not log. I deliberately leave logging out of the helper because every team wants different log shapes. Wrap the call site, not the helper.

A reproducible test in five minutes

If you want to convince yourself the deadline matters, here is a test you can run in 5 minutes against httpbin.org/status/503:

import time
import httpx

client = httpx.Client(timeout=2.0)

def call():
    r = client.get("https://httpbin.org/status/503")
    r.raise_for_status()
    return r

started = time.monotonic()
try:
    retry(call, max_attempts=10, base_delay=0.5, max_delay=8.0, deadline=10.0)
except Exception as e:
    elapsed = time.monotonic() - started
    print(f"Failed after {elapsed:.1f}s: {type(e).__name__}")

Run it twice — once with deadline=10.0, once with deadline=120.0. The first will fail at ~10 seconds. The second will run for nearly two minutes before giving up. In production, the first behavior is almost always what you want; the second is what you get from default tenacity configurations and what produced my 4-minute incident.

Where this fits in the stack

Retries belong on the call to a single dependency. Wrap one HTTP call. Do not wrap a whole business operation that does five things — if step 3 fails, you do not want to redo steps 1 and 2.

If your call mutates state on the receiver — POST, PUT, DELETE — you also need an idempotency key on the request, or your retries will produce duplicates on the receiver side. That is the other half of this problem; I wrote about the receiver-side fix in Idempotent webhook receivers in 50 lines of Python. Together, the two patterns close the loop.

What to change in your code today

Three concrete actions you can do in under an hour:

  1. Grep your codebase for tenacity.retry, backoff.on_exception, and any while-loop with time.sleep inside. For each result, check whether there is a stop_after_delay or equivalent total-time cap. If not, add one.
  2. For each retry block that catches Exception or requests.RequestException, narrow the catch to the specific transient errors you actually want to retry on. Most code retries far too many things.
  3. If you do not already have full jitter, add it. The diff is one line: replace time.sleep(delay) with time.sleep(delay * (0.5 + random.random())).

These three changes alone would have prevented my 4-minute incident, and based on the production scrapers and webhook handlers I have looked at this year, they would prevent most of the retry-related incidents I see.


If you are running production scrapers or APIs and want a second pair of eyes on retry/idempotency in your own code, you can reach me at spinov001@gmail.com — I do small paid reviews and write-ups in this domain.

More notes on web scraping, retries, and Apify production patterns → https://t.me/scraping_ai