Tool Execution Super Unreliable After ~5 Messages in Conversation

Summary

When using streamText with toolChoice: 'auto', the AI model increasingly fails to execute tools after approximately 5 messages in a conversation, despite clear system prompts and explicit user requests (including slash commands that should guarantee tool usage). The model instead analyzes, plans, or describes what it would do rather than actually calling the tools.

Environment

  • AI SDK Core: ai@5.0.60 (using streamText)
  • Provider SDKs:
    • @ai-sdk/anthropic@2.0.23 (primary issue)
    • @ai-sdk/openai@2.0.43 (similar behavior)
    • @ai-sdk/google@2.0.17, @ai-sdk/xai@2.0.23 (for comparison)
  • Models Tested: Claude Sonnet 4.5, GPT-4.1, GPT-5
  • Runtime: Node.js 18+
  • Framework: Next.js 14.2.3 API routes with streaming

Expected Behavior

When a user explicitly requests tool usage (via slash commands or natural language), the AI should immediately call the appropriate tool in the first reasoning step, regardless of conversation length or message count.

Actual Behavior

After approximately 5 messages in a conversation:

  • Tools are frequently not called despite explicit requests
  • Model provides analysis/planning instead of tool execution
  • Slash commands (which should guarantee tool usage) are often ignored
  • Tool usage success rate drops dramatically compared to fresh conversations

Reproduction Steps

  1. Start a fresh conversation with tool-enabled streamText setup
  2. Make a request requiring a tool: "create a mindmap about AI"
  3. :white_check_mark: Tool is called successfully
  4. Continue conversation with 4-5 more back-and-forth messages
  5. Make another explicit tool request: "create a mindmap about blockchain"
  6. :cross_mark: Tool is often NOT called - instead get analysis/description like:
    "I'll help you create a mindmap about blockchain. Let me organize 
    the key concepts: cryptocurrencies, distributed ledgers, mining..."
    
    But no actual tool call happens.

Code Context

Configuration

const maxSteps = 8;

const systemPrompt = `
You are an AI agent provided with tools to complete user's request.
Be concise and efficient. You are only allowed a maximum of ${maxSteps} reasoning steps.
If you cannot answer within that, say so directly.

Use tools only when needed. Prioritize finishing within the allowed steps.
`;

// For slash commands, we append urgent instructions
if (detectedSlashCommand === 'mindmap') {
  systemPrompt += `
  
SLASH COMMAND DETECTED: CREATE MINDMAP
The user has requested to create a mindmap. You MUST use the generateMindmap tool 
to create a visual mindmap representation. This is a priority tool usage.
`;
}

return streamText({
  model,
  temperature: 0,
  system: systemPrompt,
  messages: conversationHistory,
  stopWhen: stepCountIs(maxSteps),
  toolChoice: 'auto',
  tools: {
    webSearch: webSearchTool,
    generateMindmap: generateMindmapTool,
    analyzeFile: analyzeFileTool,
    generateOrEditImage: generateOrEditImageTool,
    // ... other tools
  },
  // ... rest of config
});

Complete System Prompt

The full system prompt includes:

const systemPrompt = 'ALWAYS answer from the KNOWLEDGE BASE first for every request and cite sources.';

let generalSystemPrompt = `${systemPrompt}

IMPORTANT: Users can upload content (videos, documents, audio, YouTube, etc.). 
You receive the complete text/transcript of that content.

When users ask "Can you see this?" or "can you read this?" - 
they are asking you to work with the text content you already have access to.
Never say you "can't see" or "can't watch" content.

You are an AI agent provided with tools to complete user's request.
Be concise and efficient. You are only allowed a maximum of ${maxSteps} reasoning steps.
If you cannot answer within that, say so directly.
Use tools only when needed. Prioritize finishing within the allowed steps.
`;

// For slash commands, we append:
if (detectedSlashCommand === 'mindmap') {
  generalSystemPrompt += `

SLASH COMMAND DETECTED: CREATE MINDMAP
The user has requested to create a mindmap. You MUST use the generateMindmap tool 
to create a visual mindmap representation. This is a priority tool usage.
`;
}

Mindmap Tool Definition

The generateMindmap tool is defined as:

import { z } from 'zod';
import { tool, zodSchema } from 'ai';

const baseMindmapNodeSchema = z.object({
  id: z.string(),
  topic: z.string(),
  root: z.boolean().optional(),
});

const generateMindmapTool = tool({
  description:
    "Generate a structured mindmap from the user's request. Use this when the user " +
    "explicitly asks for a certain mindmap to be created. Also user must mention " +
    "either 'mindmap' or 'mindmap' (Accommodate spelling mistakes). Use the " +
    "knowledge base if it enhances the mindmap with relevant information.",
  inputSchema: MindMapResponseSchema,
  execute: async (state: any) => state,
});

The tool expects a hierarchical structure with nodes containing id, topic, and optional children arrays.

Message History Management

// Standard message filtering
const nonEmptyMessages = messages.filter(msg => {
  return msg.parts.some(part =>
    part.type === 'text' ? (part.text?.trim()?.length ?? 0 > 0) : true
  );
});

// Anthropic prompt caching for efficiency
const cachedMessages = addChunkCaching(nonEmptyMessages);

// Knowledge base content prepended if available
const filteredMessageList = filteredMessages([
  ...(attachments.length > 0 ? [userAttachments] : []),
  ...knowledgeBaseMessages,
  ...cachedMessages,
]);

const modelMessages = convertToModelMessages(filteredMessageList);

Observations

Behavior Examples

Early in conversation (works well):

User: "create a mindmap about AI"
Model: [immediately calls generateMindmap tool]
Model: "I've created a mindmap covering machine learning, neural networks..."

Later in same conversation (fails):

User: "create a mindmap about machine learning" 
Model: "I'll create a comprehensive mindmap about machine learning. 
       The main branches would include supervised learning, unsupervised 
       learning, reinforcement learning..."
[NO TOOL CALL - just describes what the mindmap would contain]

Even with explicit slash commands:

User: "/mindmap create a diagram about blockchain"
Model: "Let me help you understand blockchain for a mindmap. Key concepts 
       include distributed ledgers, consensus mechanisms..."
[NO TOOL CALL despite slash command]

What We’ve Tried

1. Strengthening System Prompt Language

Added “MUST”, “CRITICAL”, “IMMEDIATELY” language:

⚠️ CRITICAL: When tools are needed, call them IMMEDIATELY in your FIRST step.
Do NOT analyze first - call the tool FIRST.

Result: Marginal improvement, but degradation still occurs after ~5 messages

2. Using stepCountIs() to Limit Reasoning

stopWhen: stepCountIs(8)

Result: Model uses steps for analysis instead of tool calls

3. Setting temperature: 0

Result: More deterministic but doesn’t solve the issue

4. Keyword Detection & Additional Hints

Added pattern matching to detect tool-requiring requests:

if (/mindmap|diagram/.test(userPrompt)) {
  hints.push('Use generateMindmap tool immediately');
}

Result: Helps initially but reliability still degrades

Questions for Vercel AI SDK Team

  1. Is this a known limitation of toolChoice: 'auto' in multi-turn conversations?

  2. Are there SDK mechanisms to maintain tool execution priority across conversation length?

  3. Best practices for maintaining tool reliability in long conversations?

Desired Solution

Tools should execute reliably when prompted, regardless of conversation length.

The ideal behavior would eliminate the need for workarounds such as:

  • Adding emphatic language (“CRITICAL”, “URGENT”, “MUST”) to system prompts
  • Inserting tool reminders before user messages
  • Manually detecting commands to force tool usage
  • Implementing conversation-length-aware logic

When a user requests “create a mindmap” after 5 messages, it should execute with the same reliability as the first message. Tool execution should remain consistent throughout the conversation lifecycle.

Impact on Production

This significantly degrades user experience:

  • :red_circle: Users must repeat requests 2-3 times after several messages. Sometimes it never calles the tool.
  • :red_circle: Explicit slash commands (guaranteed tool triggers) become unreliable
  • :red_circle: User frustration when AI describes actions instead of performing them
  • :red_circle: Wasted tokens and API costs from unnecessary reasoning steps
  • :red_circle: Loss of trust in the tool execution system

Our users expect tools to work consistently regardless of conversation length. This reliability degradation is one of our top production issues.

Any guidance would be greatly appreciated! :folded_hands: