Summary
When using streamText with toolChoice: 'auto', the AI model increasingly fails to execute tools after approximately 5 messages in a conversation, despite clear system prompts and explicit user requests (including slash commands that should guarantee tool usage). The model instead analyzes, plans, or describes what it would do rather than actually calling the tools.
Environment
- AI SDK Core:
ai@5.0.60(usingstreamText) - Provider SDKs:
@ai-sdk/anthropic@2.0.23(primary issue)@ai-sdk/openai@2.0.43(similar behavior)@ai-sdk/google@2.0.17,@ai-sdk/xai@2.0.23(for comparison)
- Models Tested: Claude Sonnet 4.5, GPT-4.1, GPT-5
- Runtime: Node.js 18+
- Framework: Next.js 14.2.3 API routes with streaming
Expected Behavior
When a user explicitly requests tool usage (via slash commands or natural language), the AI should immediately call the appropriate tool in the first reasoning step, regardless of conversation length or message count.
Actual Behavior
After approximately 5 messages in a conversation:
- Tools are frequently not called despite explicit requests
- Model provides analysis/planning instead of tool execution
- Slash commands (which should guarantee tool usage) are often ignored
- Tool usage success rate drops dramatically compared to fresh conversations
Reproduction Steps
- Start a fresh conversation with tool-enabled
streamTextsetup - Make a request requiring a tool:
"create a mindmap about AI"
Tool is called successfully- Continue conversation with 4-5 more back-and-forth messages
- Make another explicit tool request:
"create a mindmap about blockchain"
Tool is often NOT called - instead get analysis/description like:
But no actual tool call happens."I'll help you create a mindmap about blockchain. Let me organize the key concepts: cryptocurrencies, distributed ledgers, mining..."
Code Context
Configuration
const maxSteps = 8;
const systemPrompt = `
You are an AI agent provided with tools to complete user's request.
Be concise and efficient. You are only allowed a maximum of ${maxSteps} reasoning steps.
If you cannot answer within that, say so directly.
Use tools only when needed. Prioritize finishing within the allowed steps.
`;
// For slash commands, we append urgent instructions
if (detectedSlashCommand === 'mindmap') {
systemPrompt += `
SLASH COMMAND DETECTED: CREATE MINDMAP
The user has requested to create a mindmap. You MUST use the generateMindmap tool
to create a visual mindmap representation. This is a priority tool usage.
`;
}
return streamText({
model,
temperature: 0,
system: systemPrompt,
messages: conversationHistory,
stopWhen: stepCountIs(maxSteps),
toolChoice: 'auto',
tools: {
webSearch: webSearchTool,
generateMindmap: generateMindmapTool,
analyzeFile: analyzeFileTool,
generateOrEditImage: generateOrEditImageTool,
// ... other tools
},
// ... rest of config
});
Complete System Prompt
The full system prompt includes:
const systemPrompt = 'ALWAYS answer from the KNOWLEDGE BASE first for every request and cite sources.';
let generalSystemPrompt = `${systemPrompt}
IMPORTANT: Users can upload content (videos, documents, audio, YouTube, etc.).
You receive the complete text/transcript of that content.
When users ask "Can you see this?" or "can you read this?" -
they are asking you to work with the text content you already have access to.
Never say you "can't see" or "can't watch" content.
You are an AI agent provided with tools to complete user's request.
Be concise and efficient. You are only allowed a maximum of ${maxSteps} reasoning steps.
If you cannot answer within that, say so directly.
Use tools only when needed. Prioritize finishing within the allowed steps.
`;
// For slash commands, we append:
if (detectedSlashCommand === 'mindmap') {
generalSystemPrompt += `
SLASH COMMAND DETECTED: CREATE MINDMAP
The user has requested to create a mindmap. You MUST use the generateMindmap tool
to create a visual mindmap representation. This is a priority tool usage.
`;
}
Mindmap Tool Definition
The generateMindmap tool is defined as:
import { z } from 'zod';
import { tool, zodSchema } from 'ai';
const baseMindmapNodeSchema = z.object({
id: z.string(),
topic: z.string(),
root: z.boolean().optional(),
});
const generateMindmapTool = tool({
description:
"Generate a structured mindmap from the user's request. Use this when the user " +
"explicitly asks for a certain mindmap to be created. Also user must mention " +
"either 'mindmap' or 'mindmap' (Accommodate spelling mistakes). Use the " +
"knowledge base if it enhances the mindmap with relevant information.",
inputSchema: MindMapResponseSchema,
execute: async (state: any) => state,
});
The tool expects a hierarchical structure with nodes containing id, topic, and optional children arrays.
Message History Management
// Standard message filtering
const nonEmptyMessages = messages.filter(msg => {
return msg.parts.some(part =>
part.type === 'text' ? (part.text?.trim()?.length ?? 0 > 0) : true
);
});
// Anthropic prompt caching for efficiency
const cachedMessages = addChunkCaching(nonEmptyMessages);
// Knowledge base content prepended if available
const filteredMessageList = filteredMessages([
...(attachments.length > 0 ? [userAttachments] : []),
...knowledgeBaseMessages,
...cachedMessages,
]);
const modelMessages = convertToModelMessages(filteredMessageList);
Observations
Behavior Examples
Early in conversation (works well):
User: "create a mindmap about AI"
Model: [immediately calls generateMindmap tool]
Model: "I've created a mindmap covering machine learning, neural networks..."
Later in same conversation (fails):
User: "create a mindmap about machine learning"
Model: "I'll create a comprehensive mindmap about machine learning.
The main branches would include supervised learning, unsupervised
learning, reinforcement learning..."
[NO TOOL CALL - just describes what the mindmap would contain]
Even with explicit slash commands:
User: "/mindmap create a diagram about blockchain"
Model: "Let me help you understand blockchain for a mindmap. Key concepts
include distributed ledgers, consensus mechanisms..."
[NO TOOL CALL despite slash command]
What We’ve Tried
1. Strengthening System Prompt Language
Added “MUST”, “CRITICAL”, “IMMEDIATELY” language:
⚠️ CRITICAL: When tools are needed, call them IMMEDIATELY in your FIRST step.
Do NOT analyze first - call the tool FIRST.
Result: Marginal improvement, but degradation still occurs after ~5 messages
2. Using stepCountIs() to Limit Reasoning
stopWhen: stepCountIs(8)
Result: Model uses steps for analysis instead of tool calls
3. Setting temperature: 0
Result: More deterministic but doesn’t solve the issue
4. Keyword Detection & Additional Hints
Added pattern matching to detect tool-requiring requests:
if (/mindmap|diagram/.test(userPrompt)) {
hints.push('Use generateMindmap tool immediately');
}
Result: Helps initially but reliability still degrades
Questions for Vercel AI SDK Team
-
Is this a known limitation of
toolChoice: 'auto'in multi-turn conversations? -
Are there SDK mechanisms to maintain tool execution priority across conversation length?
-
Best practices for maintaining tool reliability in long conversations?
Desired Solution
Tools should execute reliably when prompted, regardless of conversation length.
The ideal behavior would eliminate the need for workarounds such as:
- Adding emphatic language (“CRITICAL”, “URGENT”, “MUST”) to system prompts
- Inserting tool reminders before user messages
- Manually detecting commands to force tool usage
- Implementing conversation-length-aware logic
When a user requests “create a mindmap” after 5 messages, it should execute with the same reliability as the first message. Tool execution should remain consistent throughout the conversation lifecycle.
Impact on Production
This significantly degrades user experience:
Users must repeat requests 2-3 times after several messages. Sometimes it never calles the tool.
Explicit slash commands (guaranteed tool triggers) become unreliable
User frustration when AI describes actions instead of performing them
Wasted tokens and API costs from unnecessary reasoning steps
Loss of trust in the tool execution system
Our users expect tools to work consistently regardless of conversation length. This reliability degradation is one of our top production issues.
Any guidance would be greatly appreciated! ![]()