> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Chunking Methods

> Control how parsed content is grouped in API responses

When Reducto parses a document, it extracts individual **blocks**: paragraphs, headers, tables, figures, list items. Chunking controls how these blocks are **grouped together** when returned in the API response. For a complete overview of the response structure, see [Parse Response Format](/parse/response-format).

This matters for RAG pipelines: most embedding models and vector databases work best with text segments of a specific size. Too small, and you lose context. Too large, and retrieval becomes imprecise. Chunking lets you control this tradeoff without post-processing the response yourself.

## Basic Usage

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      retrieval={
          "chunking": {
              "chunk_mode": "variable",
              "chunk_size": 1000
          }
      }
  )

  # Response contains grouped blocks
  for chunk in result.result.chunks:
      print(chunk.content)  # Combined content of all blocks in this chunk
      print(chunk.blocks)   # Individual blocks with metadata
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    retrieval: {
      chunking: {
        chunk_mode: 'variable',
        chunk_size: 1000
      }
    }
  });

  // Response contains grouped blocks
  for (const chunk of result.result.chunks) {
    console.log(chunk.content);  // Combined content of all blocks in this chunk
    console.log(chunk.blocks);   // Individual blocks with metadata
  }
  ```

  ```go Go theme={null}
  result, _ := client.Parse.Run(context.Background(), reducto.ParseRunParams{
      ParseConfig: reducto.ParseConfigParam{
          DocumentURL: reducto.F[reducto.ParseConfigDocumentURLUnionParam](
              shared.UnionString(upload.FileID),
          ),
          Options: reducto.F(shared.BaseProcessingOptionsParam{
              Chunking: reducto.F(shared.BaseProcessingOptionsChunkingParam{
                  ChunkMode: reducto.F(shared.BaseProcessingOptionsChunkingChunkModeVariable),
              }),
          }),
      },
  })
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "retrieval": {
        "chunking": {
          "chunk_mode": "variable",
          "chunk_size": 1000
        }
      }
    }'
  ```
</CodeGroup>

## Chunking Modes

<Tabs>
  <Tab title="variable">
    **Best for:** RAG, semantic search

    Groups blocks to target a specific character count while keeping semantically related content together. This is the recommended mode for most RAG applications.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "variable", "chunk_size": 1000}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'variable', chunk_size: 1000 } }
      ```

      ```go Go theme={null}
      Options: reducto.F(shared.BaseProcessingOptionsParam{
          Chunking: reducto.F(shared.BaseProcessingOptionsChunkingParam{
              ChunkMode: reducto.F(shared.BaseProcessingOptionsChunkingChunkModeVariable),
          }),
      })
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "variable", "chunk_size": 1000}}
      ```
    </CodeGroup>

    The algorithm:

    1. Groups blocks by document structure (new group at each title/section header)
    2. Splits oversized groups at natural boundaries (points where blocks are physically separated on the page)
    3. Applies adjacency rules: keeps headers with content, figures with captions, list items together
    4. Merges undersized groups to reach the target range

    **Size behavior:** When you specify `chunk_size: 1000`, chunks will range from 750 to 1250 characters (±25%). Without a size specified, the default range is 750-1250.
  </Tab>

  <Tab title="section">
    **Best for:** Hierarchical documents, manuals, legal docs

    Each chunk starts at a title or section header and contains everything until the next header. No size limits.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "section"}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'section' } }
      ```

      ```go Go theme={null}
      Options: reducto.F(shared.BaseProcessingOptionsParam{
          Chunking: reducto.F(shared.BaseProcessingOptionsChunkingParam{
              ChunkMode: reducto.F(shared.BaseProcessingOptionsChunkingChunkModeSection),
          }),
      })
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "section"}}
      ```
    </CodeGroup>

    Use when document structure is meaningful and you want to preserve it. Chunks can be large if sections are long.
  </Tab>

  <Tab title="page">
    **Best for:** Presentations, page-specific analysis, spreadsheets

    One chunk per page. For spreadsheets, one chunk per sheet.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "page"}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'page' } }
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "page"}}
      ```
    </CodeGroup>

    Use when page boundaries matter or when each page/sheet is a self-contained unit.
  </Tab>

  <Tab title="page_sections">
    **Best for:** Documents where both page and section context matter

    Splits by page first, then by sections within each page.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "page_sections"}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'page_sections' } }
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "page_sections"}}
      ```
    </CodeGroup>

    Useful when you need to know which page content came from while also preserving section structure.
  </Tab>

  <Tab title="block">
    **Best for:** Maximum granularity, custom chunking logic

    Each block becomes its own chunk. Gives you the finest granularity to implement your own chunking logic downstream.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "block"}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'block' } }
      ```

      ```go Go theme={null}
      Options: reducto.F(shared.BaseProcessingOptionsParam{
          Chunking: reducto.F(shared.BaseProcessingOptionsChunkingParam{
              ChunkMode: reducto.F(shared.BaseProcessingOptionsChunkingChunkModeBlock),
          }),
      })
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "block"}}
      ```
    </CodeGroup>
  </Tab>

  <Tab title="disabled">
    **Best for:** Small documents, no chunking needed

    Returns all blocks as a single chunk.

    <CodeGroup>
      ```python Python theme={null}
      retrieval={"chunking": {"chunk_mode": "disabled"}}
      ```

      ```javascript Node.js theme={null}
      retrieval: { chunking: { chunk_mode: 'disabled' } }
      ```

      ```go Go theme={null}
      Options: reducto.F(shared.BaseProcessingOptionsParam{
          Chunking: reducto.F(shared.BaseProcessingOptionsChunkingParam{
              ChunkMode: reducto.F(shared.BaseProcessingOptionsChunkingChunkModeDisabled),
          }),
      })
      ```

      ```bash cURL theme={null}
      "retrieval": {"chunking": {"chunk_mode": "disabled"}}
      ```
    </CodeGroup>
  </Tab>
</Tabs>

## Optimizing for Embeddings

Tables often embed poorly because their structure doesn't translate well to vector representations. Enable `embedding_optimized` to generate natural language summaries of tables:

<CodeGroup>
  ```python Python theme={null}
  retrieval={
      "chunking": {"chunk_mode": "variable"},
      "embedding_optimized": True
  }
  ```

  ```javascript Node.js theme={null}
  retrieval: {
    chunking: { chunk_mode: 'variable' },
    embedding_optimized: true
  }
  ```

  ```bash cURL theme={null}
  "retrieval": {
    "chunking": {"chunk_mode": "variable"},
    "embedding_optimized": true
  }
  ```
</CodeGroup>

With this enabled, each chunk has two content fields:

* `content`: Original format (tables as HTML/markdown)
* `embed`: Optimized for embeddings (tables converted to summaries like "This table shows quarterly revenue by region...")

Use `embed` for vector search, `content` for display.

## Filtering Block Types

Some content types (headers, footers, page numbers) add noise to search results. Filter them out:

<CodeGroup>
  ```python Python theme={null}
  retrieval={"filter_blocks": ["Header", "Footer", "Page Number"]}
  ```

  ```javascript Node.js theme={null}
  retrieval: { filter_blocks: ['Header', 'Footer', 'Page Number'] }
  ```

  ```go Go theme={null}
  Options: reducto.F(shared.BaseProcessingOptionsParam{
      FilterBlocks: reducto.F([]shared.BaseProcessingOptionsFilterBlock{
          shared.BaseProcessingOptionsFilterBlockHeader,
          shared.BaseProcessingOptionsFilterBlockFooter,
          shared.BaseProcessingOptionsFilterBlockPageNumber,
      }),
  })
  ```

  ```bash cURL theme={null}
  "retrieval": {"filter_blocks": ["Header", "Footer", "Page Number"]}
  ```
</CodeGroup>

Filtered blocks still appear in `chunks[].blocks` with full metadata, but they're excluded from the `content` and `embed` text fields.

**Available types:** `Header`, `Footer`, `Title`, `Section Header`, `Page Number`, `List Item`, `Figure`, `Table`, `Key Value`, `Text`, `Comment`, `Signature`

## Response Structure

Each chunk in the response contains:

```json theme={null}
{
  "content": "Combined markdown content of all blocks in this chunk",
  "embed": "Embedding-optimized version (tables summarized if embedding_optimized=True)",
  "blocks": [
    {
      "type": "Text",
      "content": "Individual block content",
      "bbox": {"left": 0.1, "top": 0.2, "width": 0.8, "height": 0.05, "page": 1},
      "confidence": "high"
    }
  ]
}
```

The `blocks` array gives you access to individual elements with their bounding boxes and types, useful for citations or highlighting source locations. See [Parse Response Format](/parse/response-format) for complete field documentation.

## Troubleshooting

<AccordionGroup>
  <Accordion title="Chunks are too large">
    Reduce `chunk_size` or switch to `block` mode and implement your own chunking logic on the individual blocks.
  </Accordion>

  <Accordion title="Chunks lack sufficient context">
    Increase `chunk_size` or use `section` mode if your document has well-defined sections.
  </Accordion>

  <Accordion title="Tables are split across chunks">
    Increase `chunk_size` to accommodate full tables. Alternatively, enable `formatting.merge_tables` to combine consecutive tables with the same column structure before chunking.
  </Accordion>

  <Accordion title="Retrieval quality is poor for tables">
    Enable `embedding_optimized: True` to generate natural language summaries of tables for the `embed` field.
  </Accordion>
</AccordionGroup>
