Your AI Agent Will Double-Charge on a Lost Response


If your agent calls a tool that charges a card, and the transport drops the response, your agent didn’t fail safely. It double-charged the customer, and it has no idea.

That’s the whole bug. The money already moved. The agent never heard “ok,” so it did what every well-behaved retry loop does: it tried again. Same prompt, same tool, same arguments. A second charge.

TL;DR

  • A retry is not a network event. It’s a semantic decision about a side effect that may have already happened.
  • Backoff, jitter, and a retry ceiling make retries polite. They do nothing to stop a double-charge.
  • A tiny idempotency ledger (a dedup key mapped to a recorded result) gives you at-most-once tool calls deterministically, for effects you own. For a third-party charge you don’t own (Stripe), it’s the provider’s idempotency key that stops the double-charge, not a ledger in front of your own process. The article keeps those two cases apart.
  • In the demo below, the naive runtime fires 120 side-effect calls for 100 orders and overcharges customers by $399.80. The ledger fires exactly 100 and overcharges by $0.00, on the same number of retries.
  • This is at-most-once, not exactly-once. I’ll be honest about where it breaks.

The retry your agent framework gives you is the wrong retry

Open any agent framework and you’ll find retry logic. Exponential backoff. Jitter. A max-attempts ceiling. All of it built for one failure mode: the request didn’t arrive.

That’s a fine default. It’s also the wrong default the moment a tool has a side effect.

That logic is correct for reads. If GET /reviews?page=4 times out, retrying is free and obviously right. Read it again, no harm.

It is quietly wrong for writes. There are two different ways a tool call can fail, and they look identical to the caller:

  1. The request was lost. The side effect never happened. Retrying is safe and necessary.
  2. The response was lost. The side effect already happened. Retrying does it again.

From the agent’s seat, both look like the same thing: a tool call with no result. A timeout. A dropped socket. A 502 from a proxy that already forwarded your POST upstream. The agent cannot tell case 1 from case 2 by looking at the failure. The information it needs is on the other side of the wire, and that’s exactly the side it couldn’t reach.

So backoff doesn’t help you here. Backoff decides when to retry. It never decides whether the side effect already fired. That second question is the only one that matters for a charge, a send_email, a create_refund, a POST /orders. The contrarian bit, said plainly: a write retry is a question about semantics, not about the network. Tuning the network knobs harder just makes you double-charge on a slower, more polite schedule.

What “at-most-once” actually means

Distributed systems people have three delivery guarantees, and the names are worth getting right because agent docs use them loosely.

  • At-least-once: the action runs one or more times. You never lose it, but you might repeat it. This is what a naive retry loop gives you. Fine for idempotent reads, dangerous for charges.
  • At-most-once: the action runs zero or one time. You might lose it (rare), but you will never repeat it. This is what you want for money.
  • Exactly-once: runs precisely once. Everybody wants this. In a system that can lose messages it’s the hard one: you don’t get it for free, you approximate it by combining at-most-once delivery with at-least-once retries plus dedup. The ledger below is the dedup half.

An idempotency ledger buys you the middle one cleanly: at-most-once for the side effect, on top of at-least-once attempts, at the boundary where the ledger sits. The attempts can fire as often as the network forces them to. The side effect fires once, because the second attempt finds a recorded result and replays it instead of re-running. The catch, which I unpack later: if the side effect lives on the other side of a wire you don’t own, the boundary that matters is the provider’s, not yours.

Reads stay at-least-once. Writes with a real side effect move to at-most-once. That’s the whole design decision.

This is not my invention. It’s the same mechanism Stripe ships in its public API. Their words:

“Stripe’s idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeds or fails. Subsequent requests with the same key return the same result, including 500 errors.”

Source: Stripe API docs, idempotent requests.

Read that twice. They save the result, status and body, and replay it. They don’t re-run the charge. An AI agent’s write tool needs the exact same contract, and most of them don’t have it yet.

A boundary, because this looks like three things it isn’t

I’ve written before about resuming a scraper that died at row 12,000 without re-writing rows, about conditional GET to skip re-downloading unchanged pages, and about an agent re-reading every page it already saw. Those are all about reading or rewriting your own data: making a resume clean, making a read cheap.

This is a different animal. This is about a tool call with an external side effect you do not own and cannot undo by truncating a file: a payment, an email, a refund, an order placed in someone else’s system. You can’t “resume” a charge by checking which rows you already wrote. The money is gone the instant the side effect fires. The fix isn’t a file offset. It’s a key that recognizes “I already did this exact action” and hands back the original answer.

And here’s the part that decides where the key goes. If the side effect is external and you don’t own it (Stripe charging a card), the dedup has to happen on the callee’s side. The provider has to see your key, recognize the repeat, and refuse to charge again. A ledger sitting in front of your own process can’t help with the lost-response case: the remote charge already fired, your ledger recorded nothing, and the retry walks right past it into a second charge. That’s exactly the opening bug. For your own side effects (a row you write, a job you enqueue, a service you control end to end), a ledger you own is the whole fix, because you control the boundary the key is checked at. Keep those two cases apart; the demo below collapses them on purpose, into one process, to make the mechanism visible. Same family as retry hygiene, completely different failure and completely different fix.

The demo: naive retry vs. ledger, same retries, different bill

Here’s a self-contained simulation. No network, no dependencies, just hashlib and json from the standard library, so you can run it in five seconds and watch the numbers. A toy PaymentAPI has a real side effect (a balance and a call counter). We run 100 orders at $19.99. The transport “loses the response” on every 5th call, so 20 of the 100 calls get retried.

The naive runtime retries by just calling charge again. The ledger runtime keys each logical action and replays the recorded result on a retry.

"""
at-most-once tool calls for AI agents: naive retry vs idempotency ledger.

Deterministic, stdlib-only (hashlib, json). No network, no external deps.
Run:  python3 idempotency_ledger_demo.py

Scenario: an agent calls a write tool (charge a card) 100 times. The transport
loses the RESPONSE on every 5th call -- the side effect already happened, the
agent just never heard back. The naive runtime retries the action; the ledger
runtime replays the recorded result instead of re-running the side effect.
"""

import hashlib
import json


class PaymentAPI:
    """Toy external service with a REAL side effect (balance + call counter)."""

    def __init__(self):
        self.balance_cents = 0
        self.side_effect_calls = 0  # every real charge increments this

    def charge(self, order_id, amount_cents):
        # This is the side effect. It runs on EVERY call -- that's the danger.
        self.side_effect_calls += 1
        self.balance_cents += amount_cents
        return {"order_id": order_id, "charged_cents": amount_cents, "status": "ok"}


def idem_key(workflow_id, step, args):
    """Stable key for one logical action. Same inputs -> same key, always."""
    payload = json.dumps([workflow_id, step, args], sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()[:16]


def run_naive(orders, lost_every):
    """Naive runtime: on a lost response, just retry the call. Double-spend."""
    api = PaymentAPI()
    duplicate_charges = 0
    for i, (order_id, amount) in enumerate(orders, start=1):
        api.charge(order_id, amount)          # first attempt: side effect fires
        if i % lost_every == 0:
            # response was lost -> retry -> side effect fires AGAIN
            api.charge(order_id, amount)
            duplicate_charges += 1
    return api, duplicate_charges


def run_ledger(orders, lost_every):
    """Ledger runtime: key the action; replay recorded result on retry."""
    api = PaymentAPI()
    ledger = {}  # idem_key -> recorded result (in-memory; see caveat in article)
    duplicate_charges = 0

    def call_once(workflow_id, step, order_id, amount):
        key = idem_key(workflow_id, step, [order_id, amount])
        if key in ledger:
            return ledger[key], True   # replay: no side effect
        result = api.charge(order_id, amount)
        ledger[key] = result           # record BEFORE the response can be lost
        return result, False

    for i, (order_id, amount) in enumerate(orders, start=1):
        call_once("wf-checkout", "charge", order_id, amount)
        if i % lost_every == 0:
            _, replayed = call_once("wf-checkout", "charge", order_id, amount)
            if not replayed:
                duplicate_charges += 1  # would mean a real double-spend
    return api, duplicate_charges


def main():
    N = 100
    PRICE_CENTS = 1999      # $19.99
    LOST = 5                # response lost on every 5th call -> 20 retries
    orders = [(f"order-{i:03d}", PRICE_CENTS) for i in range(N)]
    expected_cents = N * PRICE_CENTS
    retries = N // LOST

    naive_api, naive_dup = run_naive(orders, LOST)
    ledger_api, ledger_dup = run_ledger(orders, LOST)

    def dollars(c):
        return c / 100

    print(f"Scenario: {N} orders @ $19.99, response lost on every {LOST}th "
          f"(so {retries} retries)\n")
    print(f"NAIVE    orders={N}  side_effect_calls={naive_api.side_effect_calls:>4}  "
          f"balance=${dollars(naive_api.balance_cents):>8.2f}  "
          f"expected=${dollars(expected_cents):>8.2f}  duplicate_charges={naive_dup}")
    print(f"LEDGER   orders={N}  side_effect_calls={ledger_api.side_effect_calls:>4}  "
          f"balance=${dollars(ledger_api.balance_cents):>8.2f}  "
          f"expected=${dollars(expected_cents):>8.2f}  duplicate_charges={ledger_dup}")

    overcharge = naive_api.balance_cents - expected_cents
    print(f"\nNAIVE overcharged customers by ${dollars(overcharge):.2f} "
          f"({naive_dup} duplicate charges).")
    print(f"LEDGER overcharge: $0.00 ({ledger_dup} duplicate charges).  "
          f"Same retries, zero double-spend.")

    # Willison gate: assert the numbers, so the demo can't silently drift.
    assert naive_dup == retries
    assert naive_api.side_effect_calls == N + retries
    assert naive_api.balance_cents == expected_cents + retries * PRICE_CENTS
    assert ledger_dup == 0
    assert ledger_api.side_effect_calls == N
    assert ledger_api.balance_cents == expected_cents
    print("\nAll asserts passed (deterministic).")


if __name__ == "__main__":
    main()

Run it. This is the exact output on my machine, copied straight from stdout, not retyped:

Scenario: 100 orders @ $19.99, response lost on every 5th (so 20 retries)

NAIVE    orders=100  side_effect_calls= 120  balance=$ 2398.80  expected=$ 1999.00  duplicate_charges=20
LEDGER   orders=100  side_effect_calls= 100  balance=$ 1999.00  expected=$ 1999.00  duplicate_charges=0

NAIVE overcharged customers by $399.80 (20 duplicate charges).
LEDGER overcharge: $0.00 (0 duplicate charges).  Same retries, zero double-spend.

All asserts passed (deterministic).

Same 20 retries in both runs. The ledger didn’t retry less. It retried just as much, and still landed on the correct $1,999.00. The naive runtime sailed past it to $2,398.80 and never threw an error, because nothing errored. Every charge “succeeded.” That’s the part that makes this bug nasty: it’s invisible until a customer emails you.

How the ledger actually works, in three moves

The whole mechanism is in call_once. It’s three lines of logic with one ordering rule that’s easy to get wrong.

1. Derive a stable key for the logical action. idem_key("wf-checkout", "charge", [order_id, amount]) hashes the workflow, the step, and the arguments. The word that matters is stable: the same logical action must produce the same key on the retry. If your key includes a timestamp, a random UUID generated per-attempt, or a re-sampled LLM token, the retry gets a different key, the ledger misses, and you double-charge anyway. The ledger is only as good as the determinism of its key. Keep reading; this is the #1 thing people get wrong.

2. Look before you leap. If the key is already in the ledger, return the recorded result and do not touch the side effect. That’s the replay. This is the same contract as Stripe returning the saved status and body.

3. Record before the response can be lost. Look at the order in call_once: we call api.charge(...), then immediately ledger[key] = result, not after the response makes it back to the agent. Because the whole point is that the response doesn’t make it back. If you only record on a successful round-trip, you’ve recorded nothing in exactly the case you built this for. One honest caveat about the toy: charge-then-write is still two steps, so a crash in that gap re-charges on retry. In production the record has to commit atomically with the side effect (or you reserve a pending row first); I come back to that in the limits section. The ordering here gets you the replay; atomicity gets you the guarantee.

That’s it. No backoff change, no new framework. A dict in the demo; in production, a row with a unique constraint.

Why I bother: 2,190 runs of things-not-coming-back

I run scrapers and data tools in production: 32 published actors, 2,190 lifetime runs as of June 2026 (raw lifetime counter on my Apify profile, apify.com/knotless_cadence; the Trustpilot one alone is past 962 runs). None of those charge a card. So why am I writing about payments?

Because at that volume you stop believing the happy path. Over thousands of runs, “the request finished but the acknowledgement got lost” stops being a textbook edge case and becomes a Tuesday. Proxies hang after forwarding. A worker gets OOM-killed between doing the work and writing “done.” A 200 arrives for a body that never got read. The operational lesson that 2,190 runs beat into me isn’t “add retries”; every framework has retries. It’s “a retry without a notion of identity is a bet that nothing irreversible happened on the last attempt,” and on a long enough timeline that bet loses. For reads I lose nothing. The day an agent points that same naive retry at a charge, the bet costs real money.

That’s the bridge to agents. We’re now wiring LLMs directly to write tools (charge, refund, send, book) and handing them the same naive retry loop that was always lurking under the reads. The blast radius just changed from “re-downloaded a page” to “billed a human twice.”

Where this breaks (read this before you ship it)

I’d be lying if I sold you exactly-once. This is at-most-once, and it has sharp edges. Here’s the honest list.

  • It’s at-most-once, not exactly-once, and only when the record is atomic with the effect. Look at call_once: it calls api.charge(...), then writes ledger[key] = result. Those are two steps. If the process crashes in that gap, after the charge fired but before the ledger write lands, the retry finds no key and charges again. So the toy actually demonstrates the replay mechanism (key hit, recorded result, no re-run), not a crash-proof atomic commit. The at-most-once guarantee holds only if the record commits atomically with the side effect; the charge-then-write window is the exact hole where double-charge still lives, and it’s the boundary between at-most-once and exactly-once. In production you close it with a two-phase write (reserve a pending row before the call, finalize it after) or by pushing the key down to a callee that dedups for you. At-most-once means you accept “maybe zero” to guarantee “never two.” That’s the right trade for money. It is not free, and it is not automatic.

  • The key must be deterministic and stable, or none of this works. I said it above; it’s worth its own bullet because it’s the #1 way people break this in practice. An LLM that regenerates its tool arguments on retry (re-sampling, re-formatting, adding a fresh request_id) produces a new key and walks right past the ledger. Pin the key upstream, before the model can wobble it.

  • Concurrency needs an atomic check-and-record. My demo is single-threaded, so the if key in ledger / ledger[key] = result gap is safe. In real life two retries can race into that gap simultaneously and both miss. You need an atomic operation: a unique constraint in Postgres, a conditional put, INSERT ... ON CONFLICT DO NOTHING. Stripe is candid about this exact corner, and it’s worth quoting because it’s the failure people forget: “If incoming parameters fail validation, or the request conflicts with another request that’s executing concurrently, we don’t save the idempotent result… You can retry these requests.” The race is real; handle it at the storage layer.

  • The ledger must be persisted and pruned. An in-memory dict dies with the process, and your dedup history dies with it, so a retry after a restart double-charges. Persist it. It also grows forever, so prune it on a TTL. Stripe prunes keys after 24 hours: “You can remove keys from the system automatically after they’re at least 24 hours old. We generate a new request if a key is reused after the original is pruned.” Pick a TTL longer than your worst retry window. Too short and a late retry sails past a pruned key.

  • A too-coarse key fails the other direction: false dedup. The determinism bullet warns about a key that’s too fragile (different every attempt -> miss -> double-charge). The mirror image is just as real: a key that’s too coarse. If you build it from (amount, SKU) instead of the logical action, two genuinely different charges (the same customer buying the same item twice on purpose) collide on one key. The second one hits the ledger, looks like a duplicate, and gets silently swallowed. Now you’ve lost a legitimate charge. The key has to be unique per logical action (a stable order or checkout id), not per value. Stability and uniqueness both matter, and they cut in opposite directions.

  • Don’t record a transient failure as a final result. Stripe saves the result “regardless of whether it succeeds or fails, including 500 errors,” and that’s right for their boundary, where the key maps to one HTTP exchange. In your own ledger it’s a trap if you’re not deliberate. If the first attempt hit a timeout or a 500 that didn’t actually charge, and you record that failure as the result, every retry until the TTL expires replays the frozen error instead of trying again. You’ve turned a transient failure into a permanent one. Decide per error class what counts as “final”: a terminal result (charged, or hard-declined) gets recorded; a retryable transient one does not, so a fresh attempt can still run.

None of these are reasons to skip the ledger. They’re reasons to build the minimum version correctly: a key that’s both stable and unique per action, an atomic check-and-record, only terminal results recorded, persistence, a TTL.

What to do Monday

If your agent has any tool with an irreversible side effect (payment, email, refund, an external POST), do three things this week.

  1. Tag those tools. Decide which tool calls are at-most-once. Reads aren’t; writes with external side effects are.
  2. Pin a deterministic idempotency key at the call site, before the model can re-roll the arguments. Workflow id + step + canonicalized args, hashed. This is the load-bearing step. Then pass that key down to the provider if it supports an Idempotency-Key header (Stripe and many others do, so use theirs, don’t reinvent it). For an external side effect this is what actually stops the double-charge, because the dedup happens where the charge happens.
  3. For the side effects you own, put an atomic, persisted, TTL’d ledger in front of them. A single Postgres table with a unique constraint covers it. This gives at-most-once for your effects and stops your agent from re-issuing the same call. It does not make a provider’s charge idempotent; only the provider’s own key does that. If a tool has an external side effect and offers no idempotency key, treat auto-retrying it as unsafe and route it through a human or a manual confirm.

Backoff and jitter stay. They were never the problem. They just can’t see the side effect, and the side effect is the whole game.


Here’s my open question, and I don’t have a clean answer: where should the idempotency key actually live for an LLM agent? At the model layer (the agent commits to a key before it ever calls the tool), at the runtime layer (the framework derives it from the tool name + args), or only at the API boundary (let Stripe-style services own it and treat your own tools as unsafe)? I’ve shipped the runtime-layer version. I suspect the model-layer version is more correct and more fragile. If you’ve wired at-most-once tool calls into a real agent, I want to hear where you put the key, and what broke.

Follow for the next teardown from production, and if you’ve watched an agent double-fire a write tool, tell me what the side effect was and how you caught it. I read every comment.

AI-disclosure: drafted with an AI writing assistant, edited by a human. The Python above was run on my machine before publishing (Python 3, stdlib only); the output block is copied verbatim from stdout and the asserts pass deterministically.


More production scraping tips: t.me/scraping_ai