Skip to main content
Standard RAG extracts text and misses charts, figures, and tables. Multimodal RAG indexes both text and images, then passes relevant visuals to a vision LLM at query time.
PDF β†’ Reducto Parse β†’ S3 (images) β†’ Pinecone (embeddings) β†’ Vision LLM
This cookbook builds a pipeline that answers questions using both text context and document images.

Create Reducto API Key

1

Open Studio

Go to studio.reducto.ai and sign in. From the home page, click API Keys in the left sidebar.
Studio home page with API Keys in sidebar
2

View API Keys

The API Keys page shows your existing keys. Click + Create new API key in the top right corner.
API Keys page with Create button
3

Configure Key

In the modal, enter a name for your key and set an expiration policy (or select β€œNever” for no expiration). Click Create.
New API Key modal with name and expiration fields
4

Copy Your Key

Copy your new API key and store it securely. You won’t be able to see it again after closing this dialog.
Copy API key dialog
Set the key as an environment variable:
export REDUCTO_API_KEY="your-api-key-here"

Prerequisites

You’ll also need accounts and API keys from these services:
ServicePurposeSign up
AWS S3Permanent image storageaws.amazon.com
PineconeVector database for semantic searchpinecone.io
VoyageAIText embeddingsvoyageai.com
VoyageAI’s free tier has a rate limit of 3 requests per minute. For production use, add delays between embedding calls or upgrade to a paid plan.
You’ll also need a vision-capable LLM (Claude, GPT-4V, Gemini, etc.) for the final generation step. Install the required packages:
pip install reductoai pinecone voyageai requests boto3

Setup: AWS S3 bucket

We need an S3 bucket to store extracted images permanently. Reducto’s image URLs expire after 1 hour for security reasons. If we stored those URLs directly in our vector database, they’d be broken by the time someone queries the system tomorrow. S3 gives us permanent, publicly-accessible URLs that work indefinitely.
1

Create an IAM user

Never use your AWS root account for applications. Instead, create a dedicated IAM user with limited permissions.Go to IAM Console β†’ Users β†’ Create user.create aws user reducto demo
2

Attach S3 permissions

Select Attach policies directly, search for AmazonS3FullAccess, check it, and click Create user.This gives the user permission to read and write to S3 buckets, but nothing else in your AWS account.
3

Create access keys

Click on your new user β†’ Security credentials β†’ Create access key.Select β€œCommand Line Interface (CLI)”, confirm, and create.Save both keys now - you won’t see the secret again. You’ll need these for the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
4

Create an S3 bucket

Go to S3 Console β†’ Create bucket.make bucketChoose a unique name (e.g., my-multimodal-rag-images) and select a region close to you for lower latency.
5

Enable public access

By default, S3 buckets are private. We need to make objects publicly readable so your application can fetch images at query time.In your bucket β†’ Permissions tab:First, disable the block public access settings:
  • Click Block public access β†’ Edit β†’ Uncheck all 4 boxes β†’ Save aws uncheck public access
Then, add a bucket policy that allows anyone to read objects:
  • Click Bucket policy β†’ Edit β†’ Paste this policy:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::YOUR-BUCKET-NAME/*"
  }]
}
Replace YOUR-BUCKET-NAME with your actual bucket name.

Setup: Pinecone index

We need a vector index to store text embeddings. Each vector will also include metadata with the S3 URL, so we can fetch the corresponding image at query time.
1

Create index

make pinecone indexGo to Pinecone Console β†’ Create Index.
  • Name: multimodal-rag
  • Dimensions: 1024 (this must match VoyageAI’s voyage-3 model)
  • Metric: cosine
The dimension setting is critical. VoyageAI’s voyage-3 model outputs 1024-dimensional vectors. If you use a different embedding model, check its output dimensions.

Set environment variables

Before running any code, set these environment variables in your terminal:
export REDUCTO_API_KEY="your-reducto-key"
export PINECONE_API_KEY="your-pinecone-key"
export VOYAGEAI_API_KEY="your-voyageai-key"
export AWS_ACCESS_KEY_ID="your-aws-access-key"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-key"
export S3_BUCKET_NAME="your-bucket-name"

Sample document

For this cookbook, we’ll use an open-access research paper from PubMed Central about hepatitis B treatment. This paper has 16 pages with multiple figures, charts, and data tables, making it ideal for demonstrating multimodal RAG. pdf sample used Download the sample PDF:
https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/14/14/WJG-30-1911.PMC11036500.pdf
Save it as research_paper.pdf in your working directory.
You can use any PDF with charts or figures. Annual reports, scientific papers, and technical documentation work well for multimodal RAG.

Step 1: Parse the document with image extraction

Initialize the client

First, we create a Reducto client using our API key:
import os
from reducto import Reducto

client = Reducto(api_key=os.environ["REDUCTO_API_KEY"])

Upload the document

Before parsing, we need to upload the file to Reducto’s servers. The upload() method returns a file ID that we’ll use in the parse call:
with open("research_paper.pdf", "rb") as f:
    upload = client.upload(file=f)

print(f"File ID: {upload.file_id}")
File ID: reducto://c0584170-17a0-44f5-baf4-727467b71b84.pdf

Parse with image extraction

Now we parse the document. The key settings here are:
  • return_images: Tells Reducto to crop and return images for figures and tables
  • chunk_mode: Controls how text is grouped into chunks
result = client.parse.run(
    input=upload.file_id,
    settings={
        "return_images": ["figure", "table"]
    },
    retrieval={
        "chunking": {
            "chunk_mode": "section"
        }
    }
)

print(f"Parsed {result.usage.num_pages} pages")
print(f"Got {len(result.result.chunks)} chunks")
Parsed 16 pages
Got 39 chunks
Why chunk_mode: "section"? We use section-based chunking because it keeps figures together with their surrounding explanatory text. If a figure appears in the β€œResults” section, the chunk will include both the figure and the text that explains it. This improves retrieval quality because the embedding captures the full context. Other options:
  • page: One chunk per page. Simpler but may split related content.
  • variable: Adaptive chunking based on content density.

Step 2: Understand the response structure

Before we start uploading images, let’s look at what Reducto actually returns. This helps us understand what data we’re working with.

Exploring chunks and blocks

Each chunk contains multiple blocks. A block can be text, a figure, a table, or other content types:
# Look at the first chunk
chunk = result.result.chunks[0]
print(f"Chunk has {len(chunk.blocks)} blocks")
print(f"Embed text preview: {chunk.embed[:200]}...")
Chunk has 5 blocks
Embed text preview: # Inhibition of hepatitis B virus via selective apoptosis
modulation by Chinese patent medicine Liuweiwuling Tablet

## Core Tip
Liuweiwuling Tablet (LWWL) exhibits a selective pro-apoptotic effect...

Finding figures and tables

Let’s find all figures and tables in the document:
figure_count = 0
table_count = 0

for chunk in result.result.chunks:
    for block in chunk.blocks:
        if block.type == "Figure":
            figure_count += 1
        elif block.type == "Table":
            table_count += 1

print(f"Found {figure_count} figures and {table_count} tables")
Found 39 figures and 1 table

Examining a figure block

Each figure block has several important fields:
# Find the first figure
for chunk in result.result.chunks:
    for block in chunk.blocks:
        if block.type == "Figure":
            print(f"Type: {block.type}")
            print(f"Page: {block.bbox.page}")
            print(f"Image URL: {block.image_url[:80]}...")
            print(f"Content: {block.content[:150]}...")
            break
    else:
        continue
    break
Type: Figure
Page: 1
Image URL: https://prod-storage20241010144745140900000001.s3.amazonaws.com/6eb2fb19...
Content: - Title: World Journal of Gastroenterology
- Logo/abbreviation: "W J G" β€” three white script letters each inside a black square...
Key fields explained:
FieldWhat it contains
block.type”Figure”, β€œTable”, β€œText”, etc.
block.bbox.pagePage number (1-indexed)
block.image_urlTemporary URL to the cropped image. Expires in 1 hour.
block.contentAI-generated description of the figure
chunk.embedFull text optimized for embedding, includes figure descriptions
The content field contains Reducto’s AI-generated description of the figure. This is incredibly useful, as it means your vector search can find figures based on what they show, not just the surrounding text.

Step 3: Upload images to S3

Now we need to save these images permanently. Reducto’s URLs expire in 1 hour, so we’ll upload each image to S3 immediately.

Initialize the S3 client

import boto3

s3 = boto3.client("s3")
bucket_name = os.environ["S3_BUCKET_NAME"]

Determine the bucket region

S3 URLs include the region. If we get this wrong, the URLs won’t work:
region = s3.get_bucket_location(Bucket=bucket_name).get("LocationConstraint")
if region is None:
    region = "us-east-1"  # us-east-1 returns None for LocationConstraint

print(f"Bucket region: {region}")
Bucket region: ap-south-1

Create the upload function

This function downloads an image from Reducto and uploads it to S3:
import requests

def upload_image_to_s3(image_url, s3_key):
    """Download image from Reducto and upload to S3."""
    # Download from Reducto
    response = requests.get(image_url)
    response.raise_for_status()

    # Upload to S3
    s3.put_object(
        Bucket=bucket_name,
        Key=s3_key,
        Body=response.content,
        ContentType="image/png"
    )

    # Return the permanent S3 URL
    return f"https://{bucket_name}.s3.{region}.amazonaws.com/{s3_key}"
Why this URL format? S3 URLs follow the pattern https://{bucket}.s3.{region}.amazonaws.com/{key}. Including the region is important, otherwise S3 may redirect requests, which can cause issues with some clients.

Upload all images

Now we loop through all chunks. For each chunk, we check if it contains any figures or tables. If it does, we upload those images to S3 and store the URLs. If it doesn’t, we still index the chunk but without an image URL. This is the key difference from image-only indexing: we index everything, both text-only chunks and chunks with figures.
all_items = []

for chunk_idx, chunk in enumerate(result.result.chunks):
    # Find any figures/tables in this chunk
    images_in_chunk = []
    for block in chunk.blocks:
        if block.type in ["Figure", "Table"] and block.image_url:
            image_id = f"chunk-{chunk_idx}-{block.type.lower()}-page{block.bbox.page}"
            s3_key = f"multimodal-rag/{image_id}.png"
            s3_url = upload_image_to_s3(block.image_url, s3_key)
            images_in_chunk.append({
                "s3_url": s3_url,
                "block_type": block.type,
                "page": block.bbox.page
            })

    # Get the page number from the first block
    page = chunk.blocks[0].bbox.page if chunk.blocks else 1

    # Index this chunk (with or without images)
    item = {
        "id": f"chunk-{chunk_idx}",
        "text": chunk.embed,
        "page": page,
        "has_images": len(images_in_chunk) > 0
    }

    # If this chunk has images, include the first one
    # (for chunks with multiple figures, you could store all URLs)
    if images_in_chunk:
        item["image_url"] = images_in_chunk[0]["s3_url"]
        item["block_type"] = images_in_chunk[0]["block_type"]

    all_items.append(item)

# Count what we have
chunks_with_images = sum(1 for item in all_items if item.get("has_images"))
chunks_text_only = len(all_items) - chunks_with_images

print(f"Total chunks: {len(all_items)}")
print(f"  - With images: {chunks_with_images}")
print(f"  - Text only: {chunks_text_only}")
Total chunks: 39
  - With images: 33
  - Text only: 6
Now our index will contain the entire document. Text-only queries will find relevant text chunks, while queries about figures will find chunks that have associated images.

Verify an image is accessible

Let’s make sure our S3 URLs work:
# Find a chunk that has an image
test_item = next(item for item in all_items if item.get("image_url"))
test_url = test_item["image_url"]
print(f"Testing: {test_url}")

response = requests.head(test_url)
print(f"Status: {response.status_code}")
Testing: https://reducto-multimodal-rag-demo.s3.ap-south-1.amazonaws.com/multimodal-rag/chunk-0-figure-page1.png
Status: 200
If you get a 403 Forbidden error, your bucket policy isn’t set correctly. Go back to the S3 setup and make sure you’ve enabled public access and added the bucket policy.

Step 4: Index into Pinecone

With images safely stored in S3, we can now create vector embeddings and store them in Pinecone.

Initialize the clients

from pinecone import Pinecone
import voyageai

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index("multimodal-rag")

vo = voyageai.Client(api_key=os.environ["VOYAGEAI_API_KEY"])

Understanding what we’re indexing

For each chunk, we store:
  • Vector: Embedding of the text (from chunk.embed)
  • Metadata: Text preview, page number, and optionally an S3 image URL
The vector enables semantic search across the entire document. When a chunk has an associated image, the metadata includes the S3 URL so we can retrieve it at query time.

Create embeddings and upsert

for item in all_items:
    # Create embedding from the chunk's embed text
    embedding_response = vo.embed(
        [item["text"][:8000]],  # VoyageAI has input limits
        model="voyage-3"
    )
    embedding = embedding_response.embeddings[0]

    # Build metadata (image_url only included if chunk has images)
    metadata = {
        "text": item["text"][:1000],  # Preview for display
        "page": item["page"],
        "has_images": item["has_images"]
    }

    if item.get("image_url"):
        metadata["image_url"] = item["image_url"]
        metadata["block_type"] = item.get("block_type", "Figure")

    # Upsert to Pinecone
    index.upsert(vectors=[{
        "id": item["id"],
        "values": embedding,
        "metadata": metadata
    }])

print(f"Indexed {len(all_items)} items")
Indexed 39 items
Why do we truncate the text?
  • For embeddings ([:8000]): VoyageAI has input token limits
  • For metadata ([:1000]): Pinecone has metadata size limits (~40KB per vector)
We store enough text in metadata to display a preview, but the full context is captured in the embedding.
VoyageAI’s free tier has rate limits (3 requests per minute). For production use with many documents, add rate limiting or upgrade your plan.

Step 5: Query and retrieve

Now we can search our index. When someone asks a question, we:
  1. Embed their query using the same model (voyage-3)
  2. Find the most similar vectors in Pinecone
  3. Return the matches with their S3 image URLs

Create the search function

def search(query: str, top_k: int = 3):
    """Search for relevant chunks with images."""
    # Embed the query
    query_embedding = vo.embed(
        [query],
        model="voyage-3"
    ).embeddings[0]

    # Search Pinecone
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    return results.matches
matches = search("cell viability and treatment effects")

for match in matches:
    print(f"Score: {match.score:.3f}")
    print(f"Page: {match.metadata['page']}")
    print(f"Has image: {match.metadata.get('has_images', False)}")
    if match.metadata.get("image_url"):
        print(f"Image: {match.metadata['image_url']}")
    print(f"Text preview: {match.metadata['text'][:100]}...")
    print("---")
Score: 0.354
Page: 1
Has image: True
Image: https://reducto-multimodal-rag-demo.s3.ap-south-1.amazonaws.com/multimodal-rag/chunk-8-figure-page1.png
Text preview: # Inhibition of hepatitis B virus via selective apoptosis modulation by Chinese patent medicine Liuwe...
---
Score: 0.227
Page: 1
Has image: True
Image: https://reducto-multimodal-rag-demo.s3.ap-south-1.amazonaws.com/multimodal-rag/chunk-3-figure-page1.png
Text preview: # Inhibition of hepatitis B virus via selective apoptosis modulation by Chinese patent medicine Liuwe...
---
Score: 0.133
Page: 1
Has image: False
Text preview: The study investigated the effects of LWWL on hepatocyte apoptosis using flow cytometry analysis...
---
The search returns a mix of results. Some chunks have associated images, others are text-only. Both contribute to answering the question.

Step 6: Send to your LLM

You now have everything needed for multimodal generation:
  • Text context: match.metadata["text"]
  • Image URLs: match.metadata["image_url"] (for chunks with images)
Pass these to any vision-capable LLM (Claude, GPT-4V, Gemini). Most vision APIs accept either image URLs directly or base64-encoded images.

Reducto features for better results

These Reducto settings can improve your multimodal RAG pipeline:
Reducto automatically generates AI descriptions of figures and includes them in the embed field. This is why queries like β€œshow me the revenue chart” can find relevant figures even if β€œrevenue” doesn’t appear in surrounding text.This is controlled by summarize_figures (default: True). Keep it enabled for multimodal RAG.
For documents with complex charts, enable agentic figure extraction for higher accuracy:
result = client.parse.run(
    input=upload.file_id,
    settings={"return_images": ["figure", "table"]},
    enhance={
        "agentic": [{"scope": "figure"}]
    }
)
This uses vision LLMs to better understand chart content. It adds latency and cost but improves extraction quality.
Enable embedding_optimized for output specifically tuned for vector embeddings:
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "section"},
        "embedding_optimized": True
    }
)
If you only want certain content types, use filter_blocks to exclude others from the embed field:
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

Best practices

The chunk_mode setting affects how figures relate to surrounding text:
  • section: Groups content by document sections. Best for structured documents like research papers.
  • page: One chunk per page. Simple and predictable.
  • variable: Adaptive chunking based on content density.
For multimodal RAG, section usually works best because it keeps figures with their explanatory text.
Sending images to your LLM adds latency and cost. Consider routing:
  • Simple factual questions β†’ Text-only RAG
  • Questions about trends, comparisons, or visuals β†’ Multimodal RAG
You can implement this by checking has_images in retrieved chunks before deciding which path to take.
VoyageAI’s free tier has a 3 requests per minute limit. For production:
  • Batch multiple texts in a single embedding call
  • Add delays between calls
  • Upgrade to a paid plan for higher limits
More images means higher LLM costs and slower responses. For most questions, 2-3 images is sufficient. Use top_k=3 in your search function.

Complete example

Here’s the full pipeline in a single script:
import os
import requests
import boto3
from pinecone import Pinecone
import voyageai
from reducto import Reducto

# Initialize clients
reducto = Reducto()
s3 = boto3.client("s3")
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
voyage = voyageai.Client()

bucket_name = os.environ["S3_BUCKET_NAME"]
index_name = "multimodal-rag"

# Get bucket region for S3 URLs
region = s3.get_bucket_location(Bucket=bucket_name).get("LocationConstraint") or "us-east-1"

def upload_image_to_s3(image_url, s3_key):
    """Download image from Reducto and upload to S3."""
    response = requests.get(image_url)
    response.raise_for_status()
    s3.put_object(Bucket=bucket_name, Key=s3_key, Body=response.content, ContentType="image/png")
    return f"https://{bucket_name}.s3.{region}.amazonaws.com/{s3_key}"

def index_document(file_path):
    """Parse document, upload images to S3, and index in Pinecone."""
    # Parse with image extraction
    with open(file_path, "rb") as f:
        upload = reducto.upload(file=f)

    result = reducto.parse.run(
        input=upload.file_id,
        settings={"return_images": ["figure", "table"]},
        retrieval={"chunking": {"chunk_mode": "section"}}
    )

    # Process chunks and upload images
    records = []
    for i, chunk in enumerate(result.result.chunks):
        image_url = None
        for block in chunk.blocks:
            if block.type in ["Figure", "Table"] and block.image_url:
                s3_key = f"{result.job_id}/chunk_{i}.png"
                image_url = upload_image_to_s3(block.image_url, s3_key)
                break

        # Embed and prepare record
        embedding = voyage.embed([chunk.embed], model="voyage-3").embeddings[0]
        records.append({
            "id": f"{result.job_id}_{i}",
            "values": embedding,
            "metadata": {
                "text": chunk.embed,
                "page": chunk.blocks[0].bbox.page if chunk.blocks else 0,
                "image_url": image_url or "",
                "has_image": image_url is not None
            }
        })

    # Upsert to Pinecone
    index = pc.Index(index_name)
    index.upsert(vectors=records)
    return len(records)

def search(query, top_k=3):
    """Search for relevant chunks."""
    embedding = voyage.embed([query], model="voyage-3").embeddings[0]
    index = pc.Index(index_name)
    results = index.query(vector=embedding, top_k=top_k, include_metadata=True)
    return results.matches

def query_with_images(query):
    """Search and prepare context for LLM."""
    matches = search(query, top_k=3)
    context = "\n\n".join([m.metadata["text"] for m in matches])
    image_urls = [m.metadata["image_url"] for m in matches if m.metadata.get("image_url")]
    return {"context": context, "images": image_urls, "query": query}

# Usage
indexed = index_document("research-paper.pdf")
print(f"Indexed {indexed} chunks")

result = query_with_images("What do the experimental results show?")
print(f"Context: {len(result['context'])} chars, Images: {len(result['images'])}")
# Pass result to your vision LLM

Next steps