Skip to main content
When Reducto parses a document, it extracts individual blocks: paragraphs, headers, tables, figures, list items. Chunking controls how these blocks are grouped together when returned in the API response. For a complete overview of the response structure, see Parse Response Format. This matters for RAG pipelines: most embedding models and vector databases work best with text segments of a specific size. Too small, and you lose context. Too large, and retrieval becomes imprecise. Chunking lets you control this tradeoff without post-processing the response yourself.

Basic Usage

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {
            "chunk_mode": "variable",
            "chunk_size": 1000
        }
    }
)

# Response contains grouped blocks
for chunk in result.result.chunks:
    print(chunk.content)  # Combined content of all blocks in this chunk
    print(chunk.blocks)   # Individual blocks with metadata

Chunking Modes

Best for: RAG, semantic searchGroups blocks to target a specific character count while keeping semantically related content together. This is the recommended mode for most RAG applications.
retrieval={"chunking": {"chunk_mode": "variable", "chunk_size": 1000}}
The algorithm:
  1. Groups blocks by document structure (new group at each title/section header)
  2. Splits oversized groups at natural boundaries (points where blocks are physically separated on the page)
  3. Applies adjacency rules: keeps headers with content, figures with captions, list items together
  4. Merges undersized groups to reach the target range
Size behavior: When you specify chunk_size: 1000, chunks will range from 750 to 1250 characters (±25%). Without a size specified, the default range is 750-1250.

Optimizing for Embeddings

Tables often embed poorly because their structure doesn’t translate well to vector representations. Enable embedding_optimized to generate natural language summaries of tables:
retrieval={
    "chunking": {"chunk_mode": "variable"},
    "embedding_optimized": True
}
With this enabled, each chunk has two content fields:
  • content: Original format (tables as HTML/markdown)
  • embed: Optimized for embeddings (tables converted to summaries like “This table shows quarterly revenue by region…”)
Use embed for vector search, content for display.

Filtering Block Types

Some content types (headers, footers, page numbers) add noise to search results. Filter them out:
retrieval={"filter_blocks": ["Header", "Footer", "Page Number"]}
Filtered blocks still appear in chunks[].blocks with full metadata, but they’re excluded from the content and embed text fields. Available types: Header, Footer, Title, Section Header, Page Number, List Item, Figure, Table, Key Value, Text, Comment, Signature

Response Structure

Each chunk in the response contains:
{
  "content": "Combined markdown content of all blocks in this chunk",
  "embed": "Embedding-optimized version (tables summarized if embedding_optimized=True)",
  "blocks": [
    {
      "type": "Text",
      "content": "Individual block content",
      "bbox": {"left": 0.1, "top": 0.2, "width": 0.8, "height": 0.05, "page": 1},
      "confidence": "high"
    }
  ]
}
The blocks array gives you access to individual elements with their bounding boxes and types, useful for citations or highlighting source locations. See Parse Response Format for complete field documentation.

Troubleshooting

Reduce chunk_size or switch to block mode and implement your own chunking logic on the individual blocks.
Increase chunk_size or use section mode if your document has well-defined sections.
Increase chunk_size to accommodate full tables. Alternatively, enable formatting.merge_tables to combine consecutive tables with the same column structure before chunking.
Enable embedding_optimized: True to generate natural language summaries of tables for the embed field.