Minimizing token usage and caching tool schemas with Vercel AI

rohit-4733 · November 12, 2025, 11:50am

Hey everyone

I’m building a chat interface using Vercel AI SDK, where users can select from different models (Anthropic, OpenAI, etc.).

I’ve also added multiple custom tools to extend the chat’s functionality — these tools can be invoked by the LLM to perform manual operations based on the user’s prompt.

The issue I’m facing is with the Anthropic (Claude) API:
When a large or complex prompt (with tool schemas included) is sent, I often get a 429 error (rate limit / token limit), and the SDK stops processing abruptly — no partial or fallback response is returned.

I have a few questions around this:

Minimizing input tokens:
- Is there a recommended way to reduce token usage when tool schemas are large?
- Can I cache or reuse tool definitions/schemas between messages so they don’t have to be re-sent on every request?
- Does the Vercel AI SDK support any internal caching or compression mechanism for such scenarios?
Error handling:
- When Claude returns a 429 or other API errors, I see the errors in the console, but not in my code.
- How can I capture these API errors programmatically in the SDK so I can show a proper UI error state or retry logic?

I’ve tried wrapping the stream call in a try/catch, but it doesn’t seem to give the raw error message that appears in the console.

Any suggestions, patterns, or examples on:

Handling Anthropic rate limits / token overflows
Reusing tool schemas efficiently
Catching API errors at runtime

would be super helpful

Thanks in advance!

jacobparis · November 14, 2025, 4:29am

There’s still a lot of experimentation going on in the prompt optimization space, but the general idea is to assign a “budget” that’s under your token limits, and selectively add things to context based on available room in the budget. For example, if your chat history is too long, you may run it through an LLM initially to summarize, and then only pass the summary instead of the full history.

The company behind Cursor has created this package and I’d recommend reading the README for it regardless of whether you use the library, as its featureset will outline options available GitHub - anysphere/priompt: Prompt design using JSX.

Prompt caching is handled by the AI Provider, so you can check your response metadata to see if it’s returning usage.prompt_tokens_details.cached_tokens.

If you’re not getting it from Anthropic directly, you can consider going through AI Gateway

For error handling, you’ll need to use the onError callback

const result = await streamText({
  model: anthropic('claude-3-sonnet-20240229'),
  messages,
  tools,
  onError(error) {
    console.error('Stream error:', error)
    // Handle 429 errors specifically
    if (error.message?.includes('429')) {
      // Implement retry logic or show rate limit message
    }
  },
  onFinish(result)  {
    // Handle completion
  }
})

Topic		Replies	Views
Advanced AI SDK v6: Rate limiting, caching, and dev tools AI SDK ai-sdk , edge-cache	1	107	January 19, 2026
Can the tools content be processed and returned by the model again AI SDK ai-sdk	1	34	January 20, 2026
streamText Tool Invocation Failure AI SDK nextjs , ai-sdk	3	831	April 8, 2025
streamObject Not Truly Streaming - All Chunks Arrive Nearly Instantly AI SDK	1	163	October 3, 2025
Claude model in the AI Gateway experiences stream interruption issues AI SDK	3	64	September 30, 2025

Minimizing token usage and caching tool schemas with Vercel AI

Related topics