I’ve been building multi-turn agents (50-200+ turn sessions) with AI SDK and kept running into the same context window headaches — history blowing past the token budget, tool results eating up half the window, manually tracking token counts on every turn.
I tried handling these one at a time but they’re all interconnected, so I ended up building a middleware that wraps your model and handles it transparently. No changes to your existing code — generateText, streamText, agent tool loops all work the same:
import { withContextChef } from '@context-chef/ai-sdk-middleware';
import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';
const model = withContextChef(openai('gpt-4o'), {
contextWindow: 128_000,
compress: { model: openai('gpt-4o-mini') },
truncate: { threshold: 5000, headChars: 500, tailChars: 1000 },
});
// everything below stays exactly the same
const result = await generateText({
model,
messages: conversationHistory,
tools: myTools,
});
What happens under the hood
History compression — when conversation exceeds the token budget, older messages get summarized by a cheap model (gpt-4o-mini). Recent messages are preserved. There’s a circuit breaker built in — if the compression model fails 3 times in a row, it stops trying and passes history through unchanged instead of crashing your agent mid-session.
Tool result truncation — large outputs (terminal logs, API responses) are automatically truncated with head + tail preservation. The first 500 chars (command + initial output) and last 1000 chars (errors + final result) are kept, snapped to line boundaries. Optionally you can persist the full output to a storage adapter so the LLM can retrieve it later via a context://vfs/ URI.
truncate: {
threshold: 5000,
headChars: 500,
tailChars: 1000,
storage: new FileSystemAdapter('.context_vfs'), // optional
},
Automatic token tracking — extracts token usage from generateText/streamText responses and feeds it back to the compression engine. No manual reportTokenUsage() calls.
Compact — zero-LLM-cost pruning via AI SDK’s pruneMessages. Strips reasoning blocks and old tool calls mechanically:
compact: {
reasoning: 'all',
toolCalls: 'before-last-message',
},
The middleware is stateful — it tracks usage across calls to know when compression is needed, so create one wrapped model per conversation/session.
Beyond the middleware
If you need more control, the underlying @context-chef/core library also handles:
-
Dynamic state injection (Zod-validated task state at optimal message position to prevent state drift)
-
Tool namespaces + lazy loading (two-layer architecture for 30+ tools without hallucination)
-
Cross-session memory with TTL
-
Snapshot & restore for branching/error recovery
-
Multi-provider compilation (same code → OpenAI / Anthropic / Gemini payloads)
Links
Curious how others here are managing context in long-running AI SDK agents. Are you doing your own compression? Just truncating? Using pruneMessages directly? Would love to hear what’s working for you and if there are middleware hooks you’d want that this doesn’t cover yet.