Multimodal RAG in 30 minutes: Gemini Embedding 2 + Claude Code

The first natively multimodal embedding

Google just released Gemini Embedding 2. It’s their first embedding model that natively understands text, images, video and audio. Not a text model with an image wrapper on top. One model, one vector space, all modalities.

Concretely, this means you can store a 68-page PDF, photos, videos and audio files in the same vector database. And when you ask a question, the model understands the semantic relationships between these different media types.

This isn’t just a technical increment. It’s the kind of change that radically simplifies pipelines that used to take days to build.

RAG in 30 seconds

For those just catching up: RAG stands for Retrieval Augmented Generation. The concept is simple.

Your AI model has knowledge limited to its training data. If you ask it about your private data (internal documents, client history, knowledge base), it doesn’t know. RAG solves this: before answering, the model fetches relevant information from your database, then integrates it into its response.

The classic workflow:

Your documents are split into chunks
Each chunk goes through an embedding model that transforms it into a vector
Vectors are stored in a vector database (Pinecone, Weaviate, etc.)
When you ask a question, it’s also converted to a vector
We search for the nearest vectors (= most relevant content)
The model generates a response based on that content

Until now, each media type needed its own pipeline. One for text, one for images (with an intermediate text description), one for audio (with transcription). Gemini Embedding 2 unifies all of that.

What I built in 30 minutes

I wanted to test the promise. Result: two working demos, each built in under 15 minutes with Claude Code.

Demo 1: chatbot on a technical PDF

I took a 68-page PDF (a vacuum cleaner manual, for the practical side). Dense text, technical diagrams, assembly images, multiple languages.

The prompt to Claude Code:

Here’s this PDF. I want to chat with it using Google’s new embedding model. Build me the complete pipeline.

Claude Code:

Analyzed the PDF
Split content into smart chunks (text + images separately)
Created the Pinecone index with correct dimensions
Ingested all content with Gemini embeddings
Built a local chat web app

Test question: “How do I clean the filter?”

Answer: step-by-step instructions + the corresponding technical diagrams, pulled directly from the PDF. Not a description of the diagram. The actual diagram, displayed in chat.

Test question: “What are the parts?”

Answer: main components (page 6), box contents (page 7), available accessories. Three different images, each with its confidence score.

Demo 2: image similarity search

Use case: a roofing company with a photographed project history. You receive a photo of a new roof to repair, you want to find similar projects in your database.

13 roof images with metadata (cost, duration, team size). Claude Code ingested everything, built the app, and within seconds I could:

Upload a roof photo
Get the 5 most similar projects with similarity scores
See metadata for each project (price range, roof type, etc.)
Ask follow-up questions (“Tell me about the Richmond project”)

The embedding model visually understands what looks similar. A roof with water damage gets matched with other water damage, not just “roof photos”.

How to reproduce this

Prerequisites

Three API keys:

Pinecone: vector database (free starter plan is enough)
Google AI Studio: access to Gemini Embedding 2
OpenRouter (or Anthropic/OpenAI): for the chat model

The build with Claude Code

Open Claude Code in an empty folder. Switch to plan mode:

I want to use Gemini Embedding 2 to create a multimodal Pinecone
vector database. The pipeline must support text, images and videos.
Create a .env with placeholders and an implementation plan.

Claude Code generates the project structure, dependencies, and a step-by-step plan. Fill in the .env with your keys, approve the plan, and it builds everything.

Then drop your files in the data/ folder:

Media is in data/. Ingest everything into Pinecone then
build me a chat app on localhost to test.

Claude Code creates the index, embeds all content, and builds the interface. You haven’t touched any Pinecone configuration, embedding code, or chunking logic.

What Claude Code does under the hood

This is where it gets interesting. Claude Code handles:

Smart PDF chunking (page splitting, image extraction)
Gemini Embedding 2 API calls for each chunk
Pinecone index creation and configuration
Vector upsert with metadata
Web app construction (frontend + backend + RAG logic)
Different media type handling (text vs image vs video)

In n8n, this same pipeline would have taken me several hours, possibly days. You need to configure each node, manage intermediate formats, debug connectors. The pipeline is fragile: one change in the input format breaks everything.

With Claude Code, you describe the objective and it adapts. If the PDF has a weird format, it adjusts its parsing. If an image is too low resolution, it detects it and warns you.

Current limitations

Let’s be honest about what doesn’t work perfectly yet.

Videos: max 120 seconds, mp4 and mov only. Sufficient for short clips but not for long videos. The text description accompanying the video is crucial for search quality.

Images: 6 per request max, png and jpeg only. For a massive product catalog, you’ll need to batch.

Descriptions: the embedding model is powerful, but search quality depends heavily on the metadata you associate with each piece of media. A domain expert who precisely describes their images will get much better results than an engineer who leaves default descriptions.

Cost: multimodal embedding consumes more tokens than text alone. For a large database (thousands of documents), the initial ingestion cost can be significant.

Why this is a game changer

What strikes me isn’t the model itself. It’s the combination of Claude Code + multimodal embedding.

Before: building a multimodal RAG pipeline required ML expertise, solid understanding of vector store architectures, and lots of glue code. You could spend a week on it.

Now: you describe your use case in plain language, provide your files, and in 30 minutes you have a working prototype. The barrier to entry has dropped dramatically.

Concrete use cases:

Technical support: chatbot on your product documentation (text + diagrams)
Real estate: similar property search by photo
Medical: searching radiology/scan archives
E-commerce: “find me products that look like this”
Training: chatbot on your course videos
Legal: searching scanned contract and document archives

The skill that matters now isn’t knowing how to code an embedding pipeline. It’s understanding your domain deeply enough to structure the right descriptions and metadata.

Gemini Embedding 2 benchmarks are available on the Google AI documentation. The demo code is reproducible by following the steps described above with Claude Code.