What if the Vercel AI SDK version of google/gemini-2.0-flash exposes the multimodal API
The Vercel AI SDK does expose multimodal capabilities for Gemini models. You can send images, audio, and files as input to Gemini models:
import { generateText } from "ai"
const result = await generateText({
model: "google/gemini-2.0-flash",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What's in this image?" },
{ type: "image", image: imageBuffer }, // or URL
],
},
],
})
Multimodal Output (Image Generation)
For models that support image generation (like Gemini 3.1 Flash Image Preview / “Nano Banana 2”), you can use generateImage:
import { generateImage } from "ai"
const { image } = await generateImage({
model: "google/gemini-3.1-flash-image-preview",
prompt: "A futuristic city at sunset",
})
Interleaved Text + Images
For models that generate interleaved text and images, you’d use the streaming response with multimodal parts:
import { streamText } from "ai"
const result = streamText({
model: "google/gemini-3.1-flash-image-preview",
prompt: "Create a step-by-step recipe with images",
})
for await (const part of result.fullStream) {
if (part.type === "text-delta") {
// Handle text
} else if (part.type === "file") {
// Handle generated image
}
}
- Gemini 2.0 Flash: Multimodal input (images, files, audio)

- Gemini 3.1 Flash Image Preview: Multimodal output (generates images inline)

- The AI SDK abstracts provider differences, so you use the same patterns across models