Skip to main content

Understanding blocks

  • Blocks are single units of information in a document.
  • Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
  • Chunks are composed of one or more blocks.

Chunking strategies

1. Variable chunking

  • Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
  • Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
  • Can break long sections or merge short ones.
  • Useful for documents with varying section lengths or when consistent chunk sizes are needed.

2. Section-based chunking

  • Identifies โ€œtitleโ€ and โ€œsection headerโ€ blocks.
  • Groups blocks until a new section header/title is encountered.
  • Breaks documents into coherent subsections.
  • Ideal for documents with clear hierarchical structures.

3. Page-based chunking

  • Returns one chunk for each page in a document or sheet in a spreadsheet.
  • Maintains the original page/sheet structure of the source material.
  • Useful for page-specific analysis or when page boundaries are significant.

4. Page sections chunking

  • Combines page and section-based chunking.
  • First splits the document by pages, then within each page splits by sections.
  • Maintains both page and section boundaries.
  • Ideal for documents where both page and section organization are important.

5. Block chunking

  • Each block becomes a separate chunk.
  • Essentially disables advanced chunking.
  • Useful when you want to implement your own chunking strategy with the layout information.

6. Disabled (no chunking)

  • Returns one large markdown representation of the entire document.
  • No splitting occurs; the document is treated as a single unit.
  • Useful when you need to process the entire document as a whole or perform your own custom chunking.

Choosing a strategy

  • Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
  • Section-based Chunking: For documents with clear hierarchical organization.
  • Page-based Chunking: When maintaining page/sheet boundaries is important.
  • Page Sections Chunking: When both page and section organization need to be preserved.
  • Block Chunking: When exact structure preservation is necessary.
  • Disabled: When you need the entire document as a single unit or plan to implement custom chunking.
The right chunking strategy can significantly improve performance across tasks like information retrieval, summarization, or content analysis. If you have any questions about our recommendations for your use case, weโ€™d be happy to help!

Embedding Optimization

The embedding_optimized option enhances the embed field in each chunk for better performance with embedding models and vector search.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"},
        "embedding_optimized": True
    }
)
When enabled:
  • Tables in the embed field get natural language summaries instead of raw markup
  • Figure descriptions are optimized for semantic search
  • Content is restructured for better embedding quality
This option affects only the embed field. The content field retains the full structured output with tables in their original format.

Filter Blocks

Remove specific content types from the content and embed fields while keeping them in the blocks metadata. This is useful for RAG when headers, footers, or page numbers would pollute search results.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

Available Block Types

Block TypeDescription
HeaderDocument headers (letterhead, running titles)
FooterDocument footers
Page NumberPage number indicators
TitleDocument titles
Section HeaderSection headings
List ItemBulleted or numbered list items
FigureImages and diagrams
TableData tables
Key ValueLabel-value pairs
TextBody text paragraphs
CommentDocument comments
SignatureSignature blocks
Filtered blocks still appear in chunks[].blocks with full metadata, but their content is excluded from the text fields used for embeddings and retrieval.

Configuration Example

Combine chunking with retrieval optimizations:
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {
            "chunk_mode": "variable",
            "chunk_size": 1000
        },
        "embedding_optimized": True,
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)
This configuration:
  1. Splits the document into ~1000 character variable chunks
  2. Optimizes chunk content for embedding models
  3. Excludes headers, footers, and page numbers from searchable content