Configuration
Chunking Methods
Understanding Blocks
- Blocks are single units of information in a document.
- Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
- Chunks are composed of one or more blocks.
Chunking Strategies
1. Variable Chunking
- Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
- Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
- Can break long sections or merge short ones.
- Useful for documents with varying section lengths or when consistent chunk sizes are needed.
2. Section-based Chunking
- Identifies “title” and “section header” blocks.
- Groups blocks until a new section header/title is encountered.
- Breaks documents into coherent subsections.
- Ideal for documents with clear hierarchical structures.
3. Page-based Chunking
- Returns one chunk for each page in a document or sheet in a spreadsheet.
- Maintains the original page/sheet structure of the source material.
- Useful for page-specific analysis or when page boundaries are significant.
4. Page Sections Chunking
- Combines page and section-based chunking.
- First splits the document by pages, then within each page splits by sections.
- Maintains both page and section boundaries.
- Ideal for documents where both page and section organization are important.
5. Block Chunking
- Each block becomes a separate chunk.
- Essentially disables advanced chunking.
- Useful when you want to implement your own chunking strategy with the layout information.
6. Disabled (No Chunking)
- Returns one large markdown representation of the entire document.
- No splitting occurs; the document is treated as a single unit.
- Useful when you need to process the entire document as a whole or perform your own custom chunking.
Choosing a Strategy
- Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
- Section-based Chunking: For documents with clear hierarchical organization.
- Page-based Chunking: When maintaining page/sheet boundaries is important.
- Page Sections Chunking: When both page and section organization need to be preserved.
- Block Chunking: When exact structure preservation is necessary.
- Disabled: When you need the entire document as a single unit or plan to implement custom chunking.
The right chunking strategy can significantly improve performance across tasks like information retrieval, summarization, or content analysis. If you have any questions about our recommendations for your use case, we’d be happy to help!
Was this page helpful?