Minimizing token usage and caching tool schemas with Vercel AI

Hey everyone

I’m building a chat interface using Vercel AI SDK, where users can select from different models (Anthropic, OpenAI, etc.).

I’ve also added multiple custom tools to extend the chat’s functionality — these tools can be invoked by the LLM to perform manual operations based on the user’s prompt.

The issue I’m facing is with the Anthropic (Claude) API:
When a large or complex prompt (with tool schemas included) is sent, I often get a 429 error (rate limit / token limit), and the SDK stops processing abruptly — no partial or fallback response is returned.

I have a few questions around this:

  1. Minimizing input tokens:

    • Is there a recommended way to reduce token usage when tool schemas are large?

    • Can I cache or reuse tool definitions/schemas between messages so they don’t have to be re-sent on every request?

    • Does the Vercel AI SDK support any internal caching or compression mechanism for such scenarios?

  2. Error handling:

    • When Claude returns a 429 or other API errors, I see the errors in the console, but not in my code.

    • How can I capture these API errors programmatically in the SDK so I can show a proper UI error state or retry logic?

I’ve tried wrapping the stream call in a try/catch, but it doesn’t seem to give the raw error message that appears in the console.

Any suggestions, patterns, or examples on:

  • Handling Anthropic rate limits / token overflows

  • Reusing tool schemas efficiently

  • Catching API errors at runtime

would be super helpful

Thanks in advance!

There’s still a lot of experimentation going on in the prompt optimization space, but the general idea is to assign a “budget” that’s under your token limits, and selectively add things to context based on available room in the budget. For example, if your chat history is too long, you may run it through an LLM initially to summarize, and then only pass the summary instead of the full history.

The company behind Cursor has created this package and I’d recommend reading the README for it regardless of whether you use the library, as its featureset will outline options available GitHub - anysphere/priompt: Prompt design using JSX.

Prompt caching is handled by the AI Provider, so you can check your response metadata to see if it’s returning usage.prompt_tokens_details.cached_tokens.

If you’re not getting it from Anthropic directly, you can consider going through AI Gateway

For error handling, you’ll need to use the onError callback

const result = await streamText({
  model: anthropic('claude-3-sonnet-20240229'),
  messages,
  tools,
  onError(error) {
    console.error('Stream error:', error)
    // Handle 429 errors specifically
    if (error.message?.includes('429')) {
      // Implement retry logic or show rate limit message
    }
  },
  onFinish(result)  {
    // Handle completion
  }
})
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.