Chunking Methods

When Reducto parses a document, it extracts individual blocks: paragraphs, headers, tables, figures, list items. Chunking controls how these blocks are grouped together when returned in the API response. For a complete overview of the response structure, see Parse Response Format. This matters for RAG pipelines: most embedding models and vector databases work best with text segments of a specific size. Too small, and you lose context. Too large, and retrieval becomes imprecise. Chunking lets you control this tradeoff without post-processing the response yourself.

Basic Usage

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {
            "chunk_mode": "variable",
            "chunk_size": 1000
        }
    }
)

# Response contains grouped blocks
for chunk in result.result.chunks:
    print(chunk.content)  # Combined content of all blocks in this chunk
    print(chunk.blocks)   # Individual blocks with metadata

Chunking Modes

Best for: RAG, semantic searchGroups blocks to target a specific character count while keeping semantically related content together. This is the recommended mode for most RAG applications.

retrieval={"chunking": {"chunk_mode": "variable", "chunk_size": 1000}}

The algorithm:

Groups blocks by document structure (new group at each title/section header)
Splits oversized groups at natural boundaries (points where blocks are physically separated on the page)
Applies adjacency rules: keeps headers with content, figures with captions, list items together
Merges undersized groups to reach the target range

Size behavior: When you specify chunk_size: 1000, chunks will range from 750 to 1250 characters (±25%). Without a size specified, the default range is 750-1250.

Best for: Hierarchical documents, manuals, legal docsEach chunk starts at a title or section header and contains everything until the next header. No size limits.

retrieval={"chunking": {"chunk_mode": "section"}}

Use when document structure is meaningful and you want to preserve it. Chunks can be large if sections are long.

Best for: Presentations, page-specific analysis, spreadsheetsOne chunk per page. For spreadsheets, one chunk per sheet.

retrieval={"chunking": {"chunk_mode": "page"}}

Use when page boundaries matter or when each page/sheet is a self-contained unit.

Best for: Documents where both page and section context matterSplits by page first, then by sections within each page.

retrieval={"chunking": {"chunk_mode": "page_sections"}}

Useful when you need to know which page content came from while also preserving section structure.

Best for: Maximum granularity, custom chunking logicEach block becomes its own chunk. Gives you the finest granularity to implement your own chunking logic downstream.

retrieval={"chunking": {"chunk_mode": "block"}}

Best for: Small documents, no chunking neededReturns all blocks as a single chunk.

retrieval={"chunking": {"chunk_mode": "disabled"}}

Optimizing for Embeddings

Tables often embed poorly because their structure doesn’t translate well to vector representations. Enable embedding_optimized to generate natural language summaries of tables:

retrieval={
    "chunking": {"chunk_mode": "variable"},
    "embedding_optimized": True
}

With this enabled, each chunk has two content fields:

content: Original format (tables as HTML/markdown)
embed: Optimized for embeddings (tables converted to summaries like “This table shows quarterly revenue by region…”)

Use embed for vector search, content for display.

Filtering Block Types

Some content types (headers, footers, page numbers) add noise to search results. Filter them out:

retrieval={"filter_blocks": ["Header", "Footer", "Page Number"]}

Filtered blocks still appear in chunks[].blocks with full metadata, but they’re excluded from the content and embed text fields. Available types: Header, Footer, Title, Section Header, Page Number, List Item, Figure, Table, Key Value, Text, Comment, Signature

Response Structure

Each chunk in the response contains:

{
  "content": "Combined markdown content of all blocks in this chunk",
  "embed": "Embedding-optimized version (tables summarized if embedding_optimized=True)",
  "blocks": [
    {
      "type": "Text",
      "content": "Individual block content",
      "bbox": {"left": 0.1, "top": 0.2, "width": 0.8, "height": 0.05, "page": 1},
      "confidence": "high"
    }
  ]
}

The blocks array gives you access to individual elements with their bounding boxes and types, useful for citations or highlighting source locations. See Parse Response Format for complete field documentation.

Troubleshooting

Chunks are too large

Reduce chunk_size or switch to block mode and implement your own chunking logic on the individual blocks.

Chunks lack sufficient context

Increase chunk_size or use section mode if your document has well-defined sections.

Tables are split across chunks

Increase chunk_size to accommodate full tables. Alternatively, enable formatting.merge_tables to combine consecutive tables with the same column structure before chunking.

Retrieval quality is poor for tables

Enable embedding_optimized: True to generate natural language summaries of tables for the embed field.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

Basic Usage

Chunking Modes

Optimizing for Embeddings

Filtering Block Types

Response Structure

Troubleshooting

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​Basic Usage

​Chunking Modes

​Optimizing for Embeddings

​Filtering Block Types

​Response Structure

​Troubleshooting

Basic Usage

Chunking Modes

Optimizing for Embeddings

Filtering Block Types

Response Structure

Troubleshooting