Caching LLM Calls: A Raw Prompt Key Almost Never Hits
Your LLM cache looks great in tests. In production it barely fires.
Not because the cache is broken. Because of what you keyed it on. You hashed the raw prompt string, and in prod every prompt carries a run id, a timestamp, an attempt counter. A little envelope that changes on every single call. So hash(prompt) produces a fresh key every time, the lookup misses, and you pay the provider again for an answer you already bought five seconds ago.
The fix is not a better cache. It is a better key. Strip the volatile envelope before you hash, and key on the part of the prompt that actually carries meaning.
Below is a stdlib script that runs the same six prompts through a naive cache and a normalized one. Naive: 6 calls, 0 hits. Normalized: 3 calls, 3 hits. Then it prints the one case the normalized key still gets wrong, because I would rather show you the ceiling than sell you past it.
TL;DR
- A raw-prompt cache key almost never hits in prod: every call carries a volatile envelope (run id, timestamp, attempt) that changes the hash, so the lookup misses forever.
- Key on the stable part instead. In the demo below, naive keying does 6 calls / 0 hits; normalized keying does 3 calls / 3 hits on the same six prompts.
- This is a cost lever, not a correctness one. It is also a different thing from the provider’s own prompt cache, which is prefix-based and short-lived.
- The script prints its own ceiling: paraphrases that mean the same thing but use different words still double-charge. String keys cannot see meaning.
- stdlib only, deterministic, fixtures synthetic and labelled. Run it twice, get the same bytes.
Why the raw key never matches itself
I run 32 published scrapers on Apify. 2,190 production runs between them, with one Trustpilot review scraper past 962 runs on its own. None of those call an LLM on the hot path, so this is not a war story about a cache I watched miss at 3am. It is the shape I kept seeing in the run logs, and the shape that quietly breaks naive LLM caching.
Re-run the same actor and the payload going downstream is almost identical from one run to the next. Same records, same fields, same order. What changes is the envelope: the run id, the started-at timestamp, the attempt counter. Those are exactly the fields a naive hash(prompt) keys on. Wrap an LLM step around that payload, build the prompt by serializing it, and your cache key now depends on the one part of the input guaranteed to be unique every time.
That is the trap in one line. In a test you craft the prompt by hand, run it twice, and the string is byte-identical, so the cache hits and the green checkmark lies to you. In prod the same logical question arrives wearing a different envelope on every call, the string differs, and the cache never hits itself. Your hit rate in CI is 100%. Your hit rate in prod is near zero. Same code.
A quick honesty note before the code. The fixtures in the script are synthetic. I did not log real prompts, so I built six JSON envelopes by hand to reproduce the shape I see in run payloads. The 2,190 figure is context for why this shape is the default in repeat-run workloads, not a measured cache hit rate. The demo does not claim a percentage, and neither will I.
The move: key on meaning, drop the envelope
Here is the whole idea. Before you hash, parse the prompt, throw away the fields that are pure plumbing, and canonicalize what is left.
For a JSON-shaped prompt that means: json.loads it, drop a known set of volatile keys (run_id, ts, attempt, whatever your framework injects), then normalize the semantic payload itself. Lowercase it. Strip the whitespace. Sort the keys so {"a":1,"b":2} and {"b":2,"a":1} collapse to one key. Hash that.
The volatile set is the part you have to own. You are declaring, on purpose, which fields carry meaning and which are noise. Get that list wrong in the unsafe direction (drop a field that actually mattered) and two different questions collapse to one cache entry, which serves a stale wrong answer. So this is a key you write deliberately, not a decorator you sprinkle on and forget. More on that failure mode after the proof.
Proof: six prompts, two cache keys
The script is stdlib only (json), fully deterministic, no network, no clock, no randomness. The six prompts are one synthetic agent session asking three questions. Two of them are asked more than once with a different envelope. One is a paraphrase. This is runnable local: save it, run python3 -I prompt_cache_key.py, get the same output every time.
"""Why a raw-prompt cache key almost never hits in production.
Deterministic, stdlib-only (json). No network, no clock, no RNG, no env.
The six prompts below are SYNTHETIC fixtures, not a real prompt log:
each is one JSON envelope from a fake agent session. The point is the
shape, not the content -- a stable semantic question `q` wrapped in a
volatile envelope (`req`/`run_id`/`ts`/`attempt`) that changes every run.
Run it twice with `python3 -I prompt_cache_key.py`; you get the same bytes.
"""
import json
# Fields that change on every production call and carry no meaning for the
# model. These are exactly what `hash(raw_prompt)` ends up keying on.
VOLATILE = {"req", "run_id", "ts", "attempt"}
# Six synthetic prompts from one agent session. Same three questions, asked
# more than once, each wrapped in a fresh envelope. p4 is a paraphrase of the
# refund question -- same meaning, different words.
PROMPTS = [
'{"req": "a1", "q": "What is the refund window?"}',
'{"req": "a2", "q": "what is the refund window?"}',
'{"req": "a3", "q": " What is the refund window? "}',
'{"req": "a4", "q": "How many days to return an item?"}',
'{"req": "a5", "q": "What is the shipping fee?"}',
'{"req": "a6", "q": "What is the SHIPPING fee?"}',
]
def make_llm():
"""A pure stand-in for a paid LLM call. Records what it was asked."""
asked = []
def llm(prompt):
asked.append(prompt)
return "ans:" + str(len(prompt) % 7)
return llm, asked
def naive_key(prompt):
# What `hash(prompt)` buckets on: the raw string, envelope and all.
return prompt
def normalized_key(prompt):
# Drop the volatile envelope, then canonicalize the question itself.
obj = json.loads(prompt)
stable = {k: v for k, v in obj.items() if k not in VOLATILE}
stable["q"] = stable.get("q", "").lower().strip()
return json.dumps(stable, sort_keys=True)
def run(prompts, key_fn):
llm, asked = make_llm()
cache = {}
hits = 0
for p in prompts:
k = key_fn(p)
if k in cache:
hits += 1
else:
cache[k] = llm(p)
return len(asked), hits
def main():
n = len(PROMPTS)
naive_calls, naive_hits = run(PROMPTS, naive_key)
norm_calls, norm_hits = run(PROMPTS, normalized_key)
assert naive_calls == 6, naive_calls
assert naive_hits == 0, naive_hits
assert norm_calls == 3, norm_calls
assert norm_hits == 3, norm_hits
print(f"NAIVE : prompts={n} llm_calls={naive_calls} cache_hits={naive_hits} (raw key: every req id is unique)")
print(f"NORMALIZED : prompts={n} llm_calls={norm_calls} cache_hits={norm_hits} (drop volatile envelope, key on q)")
print(f"SAVED : {naive_calls - norm_calls} calls ({naive_calls} -> {norm_calls}) on this fixture set")
print("CEILING : 1 paraphrase pair STILL double-charged")
print(" p4 'how many days to return an item?' == p1 'what is the refund window?' in meaning,")
print(" but string-normalized keys differ -> needs embedding dedup (network, out of scope)")
print("asserts: OK")
if __name__ == "__main__":
main()
Run it, and the output is:
NAIVE : prompts=6 llm_calls=6 cache_hits=0 (raw key: every req id is unique)
NORMALIZED : prompts=6 llm_calls=3 cache_hits=3 (drop volatile envelope, key on q)
SAVED : 3 calls (6 -> 3) on this fixture set
CEILING : 1 paraphrase pair STILL double-charged
p4 'how many days to return an item?' == p1 'what is the refund window?' in meaning,
but string-normalized keys differ -> needs embedding dedup (network, out of scope)
asserts: OK
Read the output, including the part it gets wrong
The naive line is the headline. Six prompts, six calls, zero hits. Every prompt shares the same three questions, but each carries a unique req id, so the raw string differs every time and the cache never recognizes a repeat. That is the mechanism the title points at: when every call carries a unique envelope, a raw prompt key never recognizes a repeat. The fixtures show the mechanism; they are not a measured production hit rate.
The normalized line drops the req envelope and keys on the lowercased, stripped question. Now a1, a2, and a3 collapse to one entry (notice the second and third deliberately vary the casing and whitespace, and the key still matches), and a5/a6 collapse to another. Three real calls, three hits. On this fixture set that is 6 calls down to 3.
Then the CEILING line, which is the reason I wrote the demo this way. Prompt 4 asks “How many days to return an item?” That is the same question as “What is the refund window?” in plain English. A human dedupes them on sight. The normalized string key does not, because the words differ, so it spends a separate call. The script prints that miss instead of hiding it. To catch it you need embedding similarity, which means a model call over the network, which is a different and heavier system than a string key. That is out of scope here on purpose. The honest claim is narrow: normalizing the key kills the duplicate-envelope misses. It does nothing for paraphrases.
On the size of the win in real traffic, I will not give you a percentage. It depends entirely on your repeat ratio, how often the same logical question actually recurs. On repeat-run corpora like mine the recurrence is high and the win is real. On a workload where every prompt is genuinely unique, normalizing the key saves you nothing and adds a parse. The lever only pays when there are repeats to catch.
This is not the idempotency problem, and not provider prompt caching
Two things sit right next to this and are not it. Worth separating, because mixing them up leads to the wrong tool.
The first is idempotency. I wrote about an at-most-once ledger for tool calls earlier in this series, and it also revolves around a dedup key, so it looks like a cousin. It is not. That ledger guards side effects: a tool call that charges a card or posts a record, where running it twice does real damage, so you record the result and replay it to guarantee the action happens at most once. The subject there is a write, and the goal is correctness. Here the subject is a read, an LLM completion with no side effect, and the goal is cost. A ledger replays a recorded result to protect an action. A cache memoizes a result to avoid paying for it again. Same data structure, opposite reason for existing.
The second is the provider’s own prompt cache, which is a genuinely different mechanism. Anthropic’s prompt caching documentation describes it precisely: “Prompt caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.” It caches a prefix of the request to cut the cost of re-sending a long shared context, and the same page notes “By default, the cache has a 5-minute lifetime.” That is a prefix cache with a short, refreshing TTL. It cuts the input cost of a shared prefix, but it never avoids the call itself: even on a cache hit you still pay for the completion. And because it matches from the start of the prompt, it only helps when the prefix is stable. The volatile field this post is about sits at the front of the prompt, so it breaks the prefix match on every call. The client-side normalized key complements it. It does not replace it.
What I would actually ship
Start by writing the volatile set down. Open your real prompt payloads and list every field that changes run to run without changing the question: run ids, timestamps, attempt counters, trace ids, session tokens. That list is the whole game. Everything in it gets dropped before you hash. Everything not in it is load-bearing, so dropping it by accident serves a wrong cached answer, which is worse than a miss.
Then canonicalize what survives. Lowercase, strip, sort object keys, and pick a stable serialization so logically identical payloads produce one key. Hash that, not the raw string. One caveat the demo skips: normalized_key assumes a JSON-shaped prompt and will raise on free text, so in real code you need a parse-fail fallback (treat a non-JSON prompt as its own stable string). If you want to be careful, log the key alongside the call for a day and eyeball whether two genuinely different questions ever collide. That is the failure mode to watch, and it is cheaper to catch in a log than in a support ticket.
What I am still unsure about is where to draw the normalization line. Strip too little and the envelope leaks back in and you are at zero hits again. Strip too much and distinct questions merge. There is no clean rule for the boundary that I trust across workloads. And paraphrase dedup, the p4 case, genuinely needs embeddings, which buys you a real win and a real bill. I have not measured whether that trade pays off at small scale. If you have, I want to hear it.
If you are optimizing the same axis from the other direction, which model tier is actually cheapest per successful task, I worked through that math in a separate post on cost per success. Picking the cheaper call and not paying for the same call twice are two different levers on the same bill.
Written by me, an engineer who runs production scrapers; drafted with AI assistance and fact-checked against my own Apify run dashboard and the linked Anthropic docs. The script is mine and the output is real (python3 -I prompt_cache_key.py). The six prompts are synthetic fixtures, not a real prompt log.
Follow along for the numbers from the next batch of runs. And tell me: what is in your volatile-field set, and have you ever watched a normalized key collide two questions that should have stayed apart? I read every comment.
More production scraping tips: t.me/scraping_ai