Drop-in middleware for context window management in AI SDK agents

I’ve been building multi-turn agents (50-200+ turn sessions) with AI SDK and kept running into the same context window headaches — history blowing past the token budget, tool results eating up half the window, manually tracking token counts on every turn.

I tried handling these one at a time but they’re all interconnected, so I ended up building a middleware that wraps your model and handles it transparently. No changes to your existing code — generateText, streamText, agent tool loops all work the same:


import { withContextChef } from '@context-chef/ai-sdk-middleware';

import { openai } from '@ai-sdk/openai';

import { generateText } from 'ai';

const model = withContextChef(openai('gpt-4o'), {

contextWindow: 128_000,

compress: { model: openai('gpt-4o-mini') },

truncate: { threshold: 5000, headChars: 500, tailChars: 1000 },

});

// everything below stays exactly the same

const result = await generateText({

model,

messages: conversationHistory,

tools: myTools,

});

What happens under the hood

History compression — when conversation exceeds the token budget, older messages get summarized by a cheap model (gpt-4o-mini). Recent messages are preserved. There’s a circuit breaker built in — if the compression model fails 3 times in a row, it stops trying and passes history through unchanged instead of crashing your agent mid-session.

Tool result truncation — large outputs (terminal logs, API responses) are automatically truncated with head + tail preservation. The first 500 chars (command + initial output) and last 1000 chars (errors + final result) are kept, snapped to line boundaries. Optionally you can persist the full output to a storage adapter so the LLM can retrieve it later via a context://vfs/ URI.


truncate: {

threshold: 5000,

headChars: 500,

tailChars: 1000,

storage: new FileSystemAdapter('.context_vfs'), // optional

},

Automatic token tracking — extracts token usage from generateText/streamText responses and feeds it back to the compression engine. No manual reportTokenUsage() calls.

Compact — zero-LLM-cost pruning via AI SDK’s pruneMessages. Strips reasoning blocks and old tool calls mechanically:


compact: {

reasoning: 'all',

toolCalls: 'before-last-message',

},

The middleware is stateful — it tracks usage across calls to know when compression is needed, so create one wrapped model per conversation/session.

Beyond the middleware

If you need more control, the underlying @context-chef/core library also handles:

  • Dynamic state injection (Zod-validated task state at optimal message position to prevent state drift)

  • Tool namespaces + lazy loading (two-layer architecture for 30+ tools without hallucination)

  • Cross-session memory with TTL

  • Snapshot & restore for branching/error recovery

  • Multi-provider compilation (same code → OpenAI / Anthropic / Gemini payloads)

Links


Curious how others here are managing context in long-running AI SDK agents. Are you doing your own compression? Just truncating? Using pruneMessages directly? Would love to hear what’s working for you and if there are middleware hooks you’d want that this doesn’t cover yet.

2 Likes

This is really interesting and was wondering how this handles prompt caching (minimizing misses)?

Great question! Prompt caching was a key design consideration.

The message structure is layered: [System Prompt] → [Memory] → [Compressed History] → [Dynamic State]. Only the history layer is subject to compression — the system prompt is frozen and never touched, so your KV-cache prefix stays stable.

A few other mechanisms that help minimize misses:

  • Compression suppression — after compression fires, the next check is automatically skipped, preventing back-to-back compressions from thrashing the cache
  • Turn-boundary splitting — compression only splits on complete turns (user / assistant+tool groups), never breaks tool_call + tool_result pairs
  • Deterministic serialization — message keys are lexicographically sorted before serialization, so identical content always produces identical bytes
  • Mechanical compaction — thinking blocks and stale tool results are cleared at zero LLM cost before the heavier summarization step, freeing tokens without changing message structure

The honest tradeoff: when compression does fire, everything after the system prompt changes, so that turn will cache-miss. But with a high preserveRatio (default 80%) + compression suppression, compression is a rare event — not a per-turn one.

Thanks for sharing! :slight_smile:

1 Like