I’m experiencing two related issues with zai/glm-5.1 through AI Gateway:
Reasoning cannot be disabled. The documented controls (reasoning: { "effort": "none" }, reasoning: { "enabled": false }) are accepted but not honored by 5 of the 6 providers serving this model. Only baseten actually disables thinking.
Context: we’re benchmarking GLM-5.1 for a latency-critical voice agent, so ~100–250 hidden reasoning tokens before the first content token (2.5–10s of added TTFT/TTFAT) makes the model unusable.
Issue 1: reasoning disable not forwarded/honored
provider
reasoning option
total
out_tokens
reasoning_tokens
baseten
{"enabled": false}
449 ms
1
0
baseten
{"effort": "none"}
423 ms
1
0
togetherai
{"enabled": false}
3995 ms
176
158
togetherai
{"effort": "none"}
5711 ms
102
85
fireworks
{"enabled": false}
4444 ms
182
180
fireworks
{"effort": "none"}
2841 ms
189
178
deepinfra
{"enabled": false}
3863 ms
274
265
deepinfra
{"effort": "none"}
2552 ms
124
108
novita
{"enabled": false}
6290 ms
156
153
novita
{"effort": "none"}
9876 ms
188
185
zai
{"enabled": false}
4951 ms
159
156
zai
{"effort": "none"}
4731 ms
164
161
Notes:
Per the advanced configuration docs, effort: "none" “disables reasoning”. It doesn’t for this model on 5/6 providers.
{"enabled": false} is worse than a no-op: it hides the reasoning text from the stream but the tokens are still generated and billed (reasoning_chars=0 while reasoning_tokens=150+). That’s easy to misread as “thinking disabled” while you still pay the full latency and cost.
Hi,
While not GLM per se, I find it that setting it through providerOptions works better for most of the models that I used.
You can find the reasoning config for each model from their official documentation (For GLM: Thinking Mode - Overview - Z.AI DEVELOPER DOCUMENT ) and perhaps need to check if different providers have different config (e.g. Bedrock).
Your table makes this look less like a client-side syntax issue and more like provider-specific behavior behind the Gateway.
One thing I’d do for the benchmark is pin the provider instead of letting Gateway route across all providers, since you already found that Baseten is the only one honoring the disable option:
- first token latency
- total output tokens
- reasoning_tokens
- provider metadata / selected provider
If Baseten consistently returns reasoning_tokens: 0 and the others do not, then I’d avoid fallback routing for this latency-critical path unless each fallback provider has its own verified native “thinking disabled” option.
I’d also be careful with reasoning: { effort: "none" } as a portable assumption here. In the AI SDK docs, none is only supported for specific OpenAI GPT-5.1 models, so for GLM via multiple providers I’d expect provider-specific options to be more reliable than a single normalized field.
For this specific case, the strongest support report would be exactly what you already collected: same model, same prompt, same Gateway request shape, pinned provider, and reasoning_tokens still generated despite a disable option.