[▲ Vercel Community](/) · [Categories](/categories) · [Latest](/latest) · [Top](/top) · [Live](/live)

[AI SDK](/c/ai-sdk/62)

# Drop-in middleware for context window management in AI SDK agents

1 view · 0 likes · 2 posts


m13t (@daoquqiexing-7496) · 2026-04-09 · ♥ 1

I've been building multi-turn agents (50-200+ turn sessions) with AI SDK and kept running into the same context window headaches — history blowing past the token budget, tool results eating up half the window, manually tracking token counts on every turn.

I tried handling these one at a time but they're all interconnected, so I ended up building a middleware that wraps your model and handles it transparently. No changes to your existing code — `generateText`, `streamText`, agent tool loops all work the same:

```typescript

import { withContextChef } from '@context-chef/ai-sdk-middleware';

import { openai } from '@ai-sdk/openai';

import { generateText } from 'ai';

const model = withContextChef(openai('gpt-4o'), {

contextWindow: 128_000,

compress: { model: openai('gpt-4o-mini') },

truncate: { threshold: 5000, headChars: 500, tailChars: 1000 },

});

// everything below stays exactly the same

const result = await generateText({

model,

messages: conversationHistory,

tools: myTools,

});

```

## What happens under the hood

**History compression** — when conversation exceeds the token budget, older messages get summarized by a cheap model (gpt-4o-mini). Recent messages are preserved. There's a circuit breaker built in — if the compression model fails 3 times in a row, it stops trying and passes history through unchanged instead of crashing your agent mid-session.

**Tool result truncation** — large outputs (terminal logs, API responses) are automatically truncated with head + tail preservation. The first 500 chars (command + initial output) and last 1000 chars (errors + final result) are kept, snapped to line boundaries. Optionally you can persist the full output to a storage adapter so the LLM can retrieve it later via a `context://vfs/` URI.

```typescript

truncate: {

threshold: 5000,

headChars: 500,

tailChars: 1000,

storage: new FileSystemAdapter('.context_vfs'), // optional

},

```

**Automatic token tracking** — extracts token usage from `generateText`/`streamText` responses and feeds it back to the compression engine. No manual `reportTokenUsage()` calls.

**Compact** — zero-LLM-cost pruning via AI SDK's `pruneMessages`. Strips reasoning blocks and old tool calls mechanically:

```typescript

compact: {

reasoning: 'all',

toolCalls: 'before-last-message',

},

```

The middleware is stateful — it tracks usage across calls to know when compression is needed, so create one wrapped model per conversation/session.

## Beyond the middleware

If you need more control, the underlying [`@context-chef/core`](https://github.com/MyPrototypeWhat/context-chef) library also handles:

- Dynamic state injection (Zod-validated task state at optimal message position to prevent state drift)

- Tool namespaces + lazy loading (two-layer architecture for 30+ tools without hallucination)

- Cross-session memory with TTL

- Snapshot & restore for branching/error recovery

- Multi-provider compilation (same code → OpenAI / Anthropic / Gemini payloads)

## Links

- npm: [`@context-chef/ai-sdk-middleware`](https://www.npmjs.com/package/@context-chef/ai-sdk-middleware)

- GitHub: [MyPrototypeWhat/context-chef](https://github.com/MyPrototypeWhat/context-chef)

---

Curious how others here are managing context in long-running AI SDK agents. Are you doing your own compression? Just truncating? Using `pruneMessages` directly? Would love to hear what's working for you and if there are middleware hooks you'd want that this doesn't cover yet.


Urjit (@urjit) · 2026-04-11

This is really interesting and was wondering how this handles prompt caching (minimizing misses)?