Your Agent Doesn't Run Out of Context. It Degrades at 79%


The first time one of my long agent sessions went sideways, I went looking for the wrong thing. I grepped for context window exceeded. For a stack trace. For a 400 from the API. There was nothing. The run finished. It just finished worse than it started — sloppier tool calls, a step that re-derived something it had already figured out, an answer that ignored a constraint stated twenty steps earlier.

No crash. No error. The agent didn’t run out of context. It got dumber on the way there.

Here’s the part that took me too long to accept: the damage starts well before the window is full. On a synthetic model of step success vs. how full the window is, the reliability floor gets crossed at 79% occupancy — not at 100% overflow. By the time you see exceeded, the agent has already spent a chunk of the session quietly making weaker steps.

The short version

If you build agents that run for many steps, watch the fraction of the window in use, not just whether it overflows. Step quality holds flat while the window fills, then bends down past a knee around 70–80%. A deterministic handoff (summarize, checkpoint, compact occupancy back down) at a threshold below that knee keeps the whole session above the line. In the model below, it turns 31 good steps out of 50 into 50 out of 50, at the cost of 2 compactions.

That’s the whole idea. The rest is the artifact and the honesty about what it does and doesn’t prove.

The artifact first

Here’s the output I’m going to talk about. One synthetic session, two runs. Naive first, then with a handoff gate:

=== Naive long session (no handoff, 'hope it fits') ===
steps that held above the floor: 31/50
first step below floor at occupancy: 81%
compactions: 0

=== Deterministic handoff at 70% occupancy ===
steps that held above the floor: 50/50
first step below floor at occupancy: None
compactions: 2

reliability floor (0.80) crossed at occupancy: 78.6%
P(step ok) at 50% occupancy: 0.97
P(step ok) at 70% occupancy: 0.90
P(step ok) at 85% occupancy: 0.72
P(step ok) at 95% occupancy: 0.61

Read the bottom block bottom-up. At half a window, a step succeeds 97% of the time. At 70%, still 90%. At 85% it’s down to 72%, and at 95% it’s a coin-flip-and-a-half: 0.61. The curve doesn’t fall off a wall. It bends. That bend is the thing nobody puts in their logs.

The floor — the line where I’d actually trust a step in production — gets crossed at 78.6% occupancy. The naive session’s first failed step lands at 81%, because steps fall on a discrete occupancy grid and the first one past the line happens to sit at 81%, not exactly on 78.6%. Same story, one number is the curve, the other is where a real step happened to land.

The model, in full

This is the whole script. stdlib-only, no network, no randomness, no clock. Run python3 -I occupancy_handoff.py and you get the bytes above, every time.

#!/usr/bin/env python3
"""Occupancy -> step-success model for a long agent session.

Synthetic fixture. Numbers chosen to exercise the mechanism, NOT a vendor
benchmark of any model. The point is the SHAPE: an agent step's success
probability stays flat while the context window fills, then bends down past
a knee -- so reliability drops well before the window overflows.

Deterministic by construction: no network, no RNG, no clock. Re-running
this file produces byte-identical stdout. Run it yourself:

    python3 -I occupancy_handoff.py
"""


def step_success_prob(occ):
    """P(a single agent step succeeds) as a function of context occupancy.

    occ is the fraction of the window in use, 0.0 .. 1.0.
    Piecewise: flat until a knee, gentle slope, then a steep decline.
    """
    if occ <= 0.50:
        return 0.97
    if occ <= 0.70:
        # gentle: 0.97 -> 0.90 across 0.50 .. 0.70
        return 0.97 - (occ - 0.50) * (0.07 / 0.20)
    # steep: 0.90 -> 0.55 across 0.70 .. 1.00
    return 0.90 - (occ - 0.70) * (0.35 / 0.30)


WINDOW = 100_000          # synthetic token budget for the window
RELIABILITY_FLOOR = 0.80  # "a step we'd trust in prod"
STEPS = 50                # session length
TOKENS_PER_STEP = 2_600   # 50 * 2600 = 130k > 100k -> naive run overflows


def run_session(handoff_threshold=None, compact_to=0.30):
    """Walk a session of STEPS steps. Each step consumes TOKENS_PER_STEP.

    A step "holds" if step_success_prob(occ) >= RELIABILITY_FLOOR.
    With a handoff_threshold set, reaching it triggers a deterministic
    compaction (summary + checkpoint) that resets occupancy to compact_to
    BEFORE the step runs -- the long task becomes a short-context one again.
    """
    used = 0
    held = 0
    first_fail_occ = None
    handoffs = 0
    for _ in range(STEPS):
        occ = used / WINDOW
        if handoff_threshold is not None and occ >= handoff_threshold:
            used = int(WINDOW * compact_to)
            occ = used / WINDOW
            handoffs += 1
        p = step_success_prob(occ)
        if p >= RELIABILITY_FLOOR:
            held += 1
        elif first_fail_occ is None:
            first_fail_occ = occ
        used += TOKENS_PER_STEP
    return held, first_fail_occ, handoffs


def floor_crossing():
    """Occupancy where the success curve crosses the floor (bisection)."""
    lo, hi = 0.70, 1.00
    for _ in range(60):
        mid = (lo + hi) / 2
        if step_success_prob(mid) >= RELIABILITY_FLOOR:
            lo = mid
        else:
            hi = mid
    return lo


def fmt_occ(occ):
    return "None" if occ is None else f"{occ:.0%}"


def main():
    held_a, fail_a, hand_a = run_session(handoff_threshold=None)
    print("=== Naive long session (no handoff, 'hope it fits') ===")
    print(f"steps that held above the floor: {held_a}/{STEPS}")
    print(f"first step below floor at occupancy: {fmt_occ(fail_a)}")
    print(f"compactions: {hand_a}")
    print()

    held_b, fail_b, hand_b = run_session(handoff_threshold=0.70, compact_to=0.30)
    print("=== Deterministic handoff at 70% occupancy ===")
    print(f"steps that held above the floor: {held_b}/{STEPS}")
    print(f"first step below floor at occupancy: {fmt_occ(fail_b)}")
    print(f"compactions: {hand_b}")
    print()

    print(f"reliability floor ({RELIABILITY_FLOOR:.2f}) crossed at occupancy: "
          f"{floor_crossing():.1%}")
    for occ in (0.50, 0.70, 0.85, 0.95):
        print(f"P(step ok) at {occ:.0%} occupancy: {step_success_prob(occ):.2f}")


if __name__ == "__main__":
    main()

The mechanism that matters is in run_session. The naive run just keeps appending: 50 steps × 2,600 tokens = 130k against a 100k window, so it overflows, and 19 of its steps land in the degraded zone past the floor. The handoff run checks occupancy before each step. When it hits 70%, it compacts back to 30% (a summary plus a checkpoint, the long task folded back into a short one) and then takes the step on a half-empty window. Two compactions over 50 steps. Every step stays above the line.

The threshold is set below the knee on purpose. 70% gives you margin: by the time you’d start losing steps, you’ve already reset.

Why I trust the shape (and what I’m not claiming)

I’ll be blunt about what that script is. It’s a synthetic fixture. I chose the curve and the floor to make the mechanism legible (flat, knee, decline), not to measure GPT or Claude or anyone’s model. If you want a vendor benchmark, this isn’t it, and I’d be lying if I dressed it up as one.

What’s real is the shape, and where it came from. Across 2,190 production runs on our Apify actors, 962 of them on a single Trustpilot scraper that’s been hammered in prod, the pattern I kept seeing wasn’t a clean overflow crash. It was quality sliding before the wall: longer-running jobs producing weaker, less consistent steps while every log line stayed green. No exception, no 400, nothing to grep. Just worse.

I won’t hand you a single magic percentage from that, because I don’t have one. Across tasks and models the knee moved around, somewhere in the 70–80% band depending on the job. In my first version of this I assumed the failure mode was overflow and I instrumented for the wrong event entirely. I watched for the cliff. The damage was already happening on the ramp.

You could push back here, and you’d be right to: “a 50-step toy loop with a hand-drawn curve proves nothing about a real model.” Agreed. It doesn’t. What the script earns is a cheap, exact way to reason about the interaction: a fixed success curve plus a fill rate plus a gate, and you can watch the gate change the outcome without a single token spent. The shape is the claim I’m standing behind, sourced from production. The script is just the cleanest way I found to show what a gate does to it.

The part with an actual citation

This isn’t just my pattern-matching. In October 2025, Du, Tian, Ronanki and seven co-authors published Context Length Alone Hurts LLM Performance Despite Perfect Retrieval (arXiv:2510.05381). Their finding is the academic version of what the logs were telling me: performance drops 13.9% to 85% as input length grows, even when the model can perfectly retrieve every relevant fact. So it isn’t “the right info got lost.” The occupied space itself does the damage.

And their fix rhymes with the handoff. Their mitigation is to make the model “recite the retrieved evidence before attempting to solve the problem” — which, they note, “transforms a long-context task into a short-context one.” That’s compaction by another name. They proved length alone hurts; the gate above is one way to stop paying for it in a running agent, before overflow rather than after.

How this is different from the context tax

If you read my earlier piece on the context tax, this looks adjacent, and it’s worth saying how it’s not the same thing. That one was about cost: re-reading the same pages inflates your token bill on a long session. This one is about quality: the same occupied space lowers your odds of a correct step, and it does that earlier than it wrecks your budget. One drains the wallet. The other drains the accuracy, quietly, and you find out last.

So you actually have two reasons to compact, not one. The bill is the loud reason. The 79% is the quiet one.

What I’d do Monday

Track occupancy as a first-class metric on long sessions. Pick a handoff threshold below your knee. I’d start at 70% and tune. When you hit it, compact deterministically: summarize the state, checkpoint what matters, drop occupancy, keep going. Don’t wait for exceeded. By then you’ve already shipped a fistful of degraded steps with nothing in the logs to show for it.

Here’s what I still don’t know, and where I’d love a second opinion: at what occupancy does your agent start getting dumber, on your task? And when you find it: do you compact on a threshold, or are you still hoping it fits?


I write up the numbers from real production runs, the inconvenient ones included. Follow for the next one, and tell me in the comments where your agents start to slip: the occupancy number where step quality drops, on a task you actually run.


More production scraping tips: t.me/scraping_ai