[▲ Vercel Community](/) · [Categories](/categories) · [Latest](/latest) · [Top](/top) · [Live](/live) [AI SDK](/c/ai-sdk/62) # Drop-in middleware for context window management in AI SDK agents 1 view · 0 likes · 2 posts m13t (@daoquqiexing-7496) · 2026-04-09 · ♥ 1 I've been building multi-turn agents (50-200+ turn sessions) with AI SDK and kept running into the same context window headaches — history blowing past the token budget, tool results eating up half the window, manually tracking token counts on every turn. I tried handling these one at a time but they're all interconnected, so I ended up building a middleware that wraps your model and handles it transparently. No changes to your existing code — `generateText`, `streamText`, agent tool loops all work the same: ```typescript import { withContextChef } from '@context-chef/ai-sdk-middleware'; import { openai } from '@ai-sdk/openai'; import { generateText } from 'ai'; const model = withContextChef(openai('gpt-4o'), { contextWindow: 128_000, compress: { model: openai('gpt-4o-mini') }, truncate: { threshold: 5000, headChars: 500, tailChars: 1000 }, }); // everything below stays exactly the same const result = await generateText({ model, messages: conversationHistory, tools: myTools, }); ``` ## What happens under the hood **History compression** — when conversation exceeds the token budget, older messages get summarized by a cheap model (gpt-4o-mini). Recent messages are preserved. There's a circuit breaker built in — if the compression model fails 3 times in a row, it stops trying and passes history through unchanged instead of crashing your agent mid-session. **Tool result truncation** — large outputs (terminal logs, API responses) are automatically truncated with head + tail preservation. The first 500 chars (command + initial output) and last 1000 chars (errors + final result) are kept, snapped to line boundaries. Optionally you can persist the full output to a storage adapter so the LLM can retrieve it later via a `context://vfs/` URI. ```typescript truncate: { threshold: 5000, headChars: 500, tailChars: 1000, storage: new FileSystemAdapter('.context_vfs'), // optional }, ``` **Automatic token tracking** — extracts token usage from `generateText`/`streamText` responses and feeds it back to the compression engine. No manual `reportTokenUsage()` calls. **Compact** — zero-LLM-cost pruning via AI SDK's `pruneMessages`. Strips reasoning blocks and old tool calls mechanically: ```typescript compact: { reasoning: 'all', toolCalls: 'before-last-message', }, ``` The middleware is stateful — it tracks usage across calls to know when compression is needed, so create one wrapped model per conversation/session. ## Beyond the middleware If you need more control, the underlying [`@context-chef/core`](https://github.com/MyPrototypeWhat/context-chef) library also handles: - Dynamic state injection (Zod-validated task state at optimal message position to prevent state drift) - Tool namespaces + lazy loading (two-layer architecture for 30+ tools without hallucination) - Cross-session memory with TTL - Snapshot & restore for branching/error recovery - Multi-provider compilation (same code → OpenAI / Anthropic / Gemini payloads) ## Links - npm: [`@context-chef/ai-sdk-middleware`](https://www.npmjs.com/package/@context-chef/ai-sdk-middleware) - GitHub: [MyPrototypeWhat/context-chef](https://github.com/MyPrototypeWhat/context-chef) --- Curious how others here are managing context in long-running AI SDK agents. Are you doing your own compression? Just truncating? Using `pruneMessages` directly? Would love to hear what's working for you and if there are middleware hooks you'd want that this doesn't cover yet. Urjit (@urjit) · 2026-04-11 This is really interesting and was wondering how this handles prompt caching (minimizing misses)?