Understanding blocks
- Blocks are single units of information in a document.
- Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
- Chunks are composed of one or more blocks.
Chunking strategies
1. Variable chunking
- Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
- Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
- Can break long sections or merge short ones.
- Useful for documents with varying section lengths or when consistent chunk sizes are needed.
2. Section-based chunking
- Identifies โtitleโ and โsection headerโ blocks.
- Groups blocks until a new section header/title is encountered.
- Breaks documents into coherent subsections.
- Ideal for documents with clear hierarchical structures.
3. Page-based chunking
- Returns one chunk for each page in a document or sheet in a spreadsheet.
- Maintains the original page/sheet structure of the source material.
- Useful for page-specific analysis or when page boundaries are significant.
4. Page sections chunking
- Combines page and section-based chunking.
- First splits the document by pages, then within each page splits by sections.
- Maintains both page and section boundaries.
- Ideal for documents where both page and section organization are important.
5. Block chunking
- Each block becomes a separate chunk.
- Essentially disables advanced chunking.
- Useful when you want to implement your own chunking strategy with the layout information.
6. Disabled (no chunking)
- Returns one large markdown representation of the entire document.
- No splitting occurs; the document is treated as a single unit.
- Useful when you need to process the entire document as a whole or perform your own custom chunking.
Choosing a strategy
- Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
- Section-based Chunking: For documents with clear hierarchical organization.
- Page-based Chunking: When maintaining page/sheet boundaries is important.
- Page Sections Chunking: When both page and section organization need to be preserved.
- Block Chunking: When exact structure preservation is necessary.
- Disabled: When you need the entire document as a single unit or plan to implement custom chunking.
Embedding Optimization
Theembedding_optimized option enhances the embed field in each chunk for better performance with embedding models and vector search.
- Tables in the
embedfield get natural language summaries instead of raw markup - Figure descriptions are optimized for semantic search
- Content is restructured for better embedding quality
This option affects only the
embed field. The content field retains the full structured output with tables in their original format.Filter Blocks
Remove specific content types from thecontent and embed fields while keeping them in the blocks metadata. This is useful for RAG when headers, footers, or page numbers would pollute search results.
Available Block Types
| Block Type | Description |
|---|---|
Header | Document headers (letterhead, running titles) |
Footer | Document footers |
Page Number | Page number indicators |
Title | Document titles |
Section Header | Section headings |
List Item | Bulleted or numbered list items |
Figure | Images and diagrams |
Table | Data tables |
Key Value | Label-value pairs |
Text | Body text paragraphs |
Comment | Document comments |
Signature | Signature blocks |
chunks[].blocks with full metadata, but their content is excluded from the text fields used for embeddings and retrieval.
Configuration Example
Combine chunking with retrieval optimizations:- Splits the document into ~1000 character variable chunks
- Optimizes chunk content for embedding models
- Excludes headers, footers, and page numbers from searchable content