Drop-in middleware for context window management in AI SDK agents

daoquqiexing-7496 · April 9, 2026, 11:51am

I’ve been building multi-turn agents (50-200+ turn sessions) with AI SDK and kept running into the same context window headaches — history blowing past the token budget, tool results eating up half the window, manually tracking token counts on every turn.

I tried handling these one at a time but they’re all interconnected, so I ended up building a middleware that wraps your model and handles it transparently. No changes to your existing code — generateText, streamText, agent tool loops all work the same:


import { withContextChef } from '@context-chef/ai-sdk-middleware';

import { openai } from '@ai-sdk/openai';

import { generateText } from 'ai';

const model = withContextChef(openai('gpt-4o'), {

contextWindow: 128_000,

compress: { model: openai('gpt-4o-mini') },

truncate: { threshold: 5000, headChars: 500, tailChars: 1000 },

});

// everything below stays exactly the same

const result = await generateText({

model,

messages: conversationHistory,

tools: myTools,

});

What happens under the hood

History compression — when conversation exceeds the token budget, older messages get summarized by a cheap model (gpt-4o-mini). Recent messages are preserved. There’s a circuit breaker built in — if the compression model fails 3 times in a row, it stops trying and passes history through unchanged instead of crashing your agent mid-session.

Tool result truncation — large outputs (terminal logs, API responses) are automatically truncated with head + tail preservation. The first 500 chars (command + initial output) and last 1000 chars (errors + final result) are kept, snapped to line boundaries. Optionally you can persist the full output to a storage adapter so the LLM can retrieve it later via a context://vfs/ URI.


truncate: {

threshold: 5000,

headChars: 500,

tailChars: 1000,

storage: new FileSystemAdapter('.context_vfs'), // optional

},

Automatic token tracking — extracts token usage from generateText/streamText responses and feeds it back to the compression engine. No manual reportTokenUsage() calls.

Compact — zero-LLM-cost pruning via AI SDK’s pruneMessages. Strips reasoning blocks and old tool calls mechanically:


compact: {

reasoning: 'all',

toolCalls: 'before-last-message',

},

The middleware is stateful — it tracks usage across calls to know when compression is needed, so create one wrapped model per conversation/session.

Beyond the middleware

If you need more control, the underlying @context-chef/core library also handles:

Dynamic state injection (Zod-validated task state at optimal message position to prevent state drift)
Tool namespaces + lazy loading (two-layer architecture for 30+ tools without hallucination)
Cross-session memory with TTL
Snapshot & restore for branching/error recovery
Multi-provider compilation (same code → OpenAI / Anthropic / Gemini payloads)

Links

npm: @context-chef/ai-sdk-middleware
GitHub: MyPrototypeWhat/context-chef

Curious how others here are managing context in long-running AI SDK agents. Are you doing your own compression? Just truncating? Using pruneMessages directly? Would love to hear what’s working for you and if there are middleware hooks you’d want that this doesn’t cover yet.

urjit · April 11, 2026, 6:33am

This is really interesting and was wondering how this handles prompt caching (minimizing misses)?

daoquqiexing-7496 · April 13, 2026, 7:23am

Great question! Prompt caching was a key design consideration.

The message structure is layered: [System Prompt] → [Memory] → [Compressed History] → [Dynamic State]. Only the history layer is subject to compression — the system prompt is frozen and never touched, so your KV-cache prefix stays stable.

A few other mechanisms that help minimize misses:

Compression suppression — after compression fires, the next check is automatically skipped, preventing back-to-back compressions from thrashing the cache
Turn-boundary splitting — compression only splits on complete turns (user / assistant+tool groups), never breaks tool_call + tool_result pairs
Deterministic serialization — message keys are lexicographically sorted before serialization, so identical content always produces identical bytes
Mechanical compaction — thinking blocks and stale tool results are cleared at zero LLM cost before the heavier summarization step, freeing tokens without changing message structure

The honest tradeoff: when compression does fire, everything after the system prompt changes, so that turn will cache-miss. But with a high preserveRatio (default 80%) + compression suppression, compression is a rare event — not a per-turn one.

mayven · April 13, 2026, 10:07am

Thanks for sharing!

Topic		Replies	Views
How to implement conversation compaction with AI SDK v5 AI SDK nextjs , react	3	2050	February 5, 2026
Workflow help AI SDK ai-sdk	2	91	January 23, 2026
Question: I tried to implement compaction AI SDK	0	102	May 18, 2026
How to manage long-term memory for AI agents with Synap and Vercel AI SDK Open Source ai-sdk	0	178	April 12, 2026
Dynamic tool loading and mid-conversation refresh in the Vercel AI SDK AI SDK ai-sdk	0	180	March 2, 2026

Drop-in middleware for context window management in AI SDK agents

What happens under the hood

Beyond the middleware

Links

Related topics