Chunking methods

Understanding blocks

Blocks are single units of information in a document.
Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
Chunks are composed of one or more blocks.

Chunking strategies

1. Variable chunking

Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
Can break long sections or merge short ones.
Useful for documents with varying section lengths or when consistent chunk sizes are needed.

2. Section-based chunking

Identifies “title” and “section header” blocks.
Groups blocks until a new section header/title is encountered.
Breaks documents into coherent subsections.
Ideal for documents with clear hierarchical structures.

3. Page-based chunking

Returns one chunk for each page in a document or sheet in a spreadsheet.
Maintains the original page/sheet structure of the source material.
Useful for page-specific analysis or when page boundaries are significant.

4. Page sections chunking

Combines page and section-based chunking.
First splits the document by pages, then within each page splits by sections.
Maintains both page and section boundaries.
Ideal for documents where both page and section organization are important.

5. Block chunking

Each block becomes a separate chunk.
Essentially disables advanced chunking.
Useful when you want to implement your own chunking strategy with the layout information.

6. Disabled (no chunking)

Returns one large markdown representation of the entire document.
No splitting occurs; the document is treated as a single unit.
Useful when you need to process the entire document as a whole or perform your own custom chunking.

Choosing a strategy

Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
Section-based Chunking: For documents with clear hierarchical organization.
Page-based Chunking: When maintaining page/sheet boundaries is important.
Page Sections Chunking: When both page and section organization need to be preserved.
Block Chunking: When exact structure preservation is necessary.
Disabled: When you need the entire document as a single unit or plan to implement custom chunking.

The right chunking strategy can significantly improve performance across tasks like information retrieval, summarization, or content analysis. If you have any questions about our recommendations for your use case, we’d be happy to help!

Embedding Optimization

The embedding_optimized option enhances the embed field in each chunk for better performance with embedding models and vector search.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"},
        "embedding_optimized": True
    }
)

When enabled:

Tables in the embed field get natural language summaries instead of raw markup
Figure descriptions are optimized for semantic search
Content is restructured for better embedding quality

This option affects only the embed field. The content field retains the full structured output with tables in their original format.

Filter Blocks

Remove specific content types from the content and embed fields while keeping them in the blocks metadata. This is useful for RAG when headers, footers, or page numbers would pollute search results.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

Available Block Types

Block Type	Description
`Header`	Document headers (letterhead, running titles)
`Footer`	Document footers
`Page Number`	Page number indicators
`Title`	Document titles
`Section Header`	Section headings
`List Item`	Bulleted or numbered list items
`Figure`	Images and diagrams
`Table`	Data tables
`Key Value`	Label-value pairs
`Text`	Body text paragraphs
`Comment`	Document comments
`Signature`	Signature blocks

Filtered blocks still appear in chunks[].blocks with full metadata, but their content is excluded from the text fields used for embeddings and retrieval.

Configuration Example

Combine chunking with retrieval optimizations:

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {
            "chunk_mode": "variable",
            "chunk_size": 1000
        },
        "embedding_optimized": True,
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

This configuration:

Splits the document into ~1000 character variable chunks
Optimizes chunk content for embedding models
Excludes headers, footers, and page numbers from searchable content

Get Started

Migration

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

Understanding blocks

Chunking strategies

1. Variable chunking

2. Section-based chunking

3. Page-based chunking

4. Page sections chunking

5. Block chunking

6. Disabled (no chunking)

Choosing a strategy

Embedding Optimization

Filter Blocks

Available Block Types

Configuration Example

Get Started

Migration

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

​Understanding blocks

​Chunking strategies

​1. Variable chunking

​2. Section-based chunking

​3. Page-based chunking

​4. Page sections chunking

​5. Block chunking

​6. Disabled (no chunking)

​Choosing a strategy

​Embedding Optimization

​Filter Blocks

​Available Block Types

​Configuration Example

Understanding blocks

Chunking strategies

1. Variable chunking

2. Section-based chunking

3. Page-based chunking

4. Page sections chunking

5. Block chunking

6. Disabled (no chunking)

Choosing a strategy

Embedding Optimization

Filter Blocks

Available Block Types

Configuration Example