Chunking Methods
Understanding Blocks
- Blocks are single units of information in a document.
- Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
- Chunks are composed of one or more blocks.
Chunking Strategies
1. Variable Chunking
- Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
- Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
- Can break long sections or merge short ones.
- Useful for documents with varying section lengths or when consistent chunk sizes are needed.
2. Section-based Chunking
- Identifies "title" and "section header" blocks.
- Groups blocks until a new section header/title is encountered.
- Breaks documents into coherent subsections.
- Ideal for documents with clear hierarchical structures.
3. Page-based Chunking
- Returns one chunk for each page in a document or sheet in a spreadsheet.
- Maintains the original page/sheet structure of the source material.
- Useful for page-specific analysis or when page boundaries are significant.
4. Block Chunking
- Each block becomes a separate chunk.
- Essentially disables advanced chunking.
- Useful when you want to implement your own chunking strategy with the layout information.
5. Disabled (No Chunking)
- Returns one large markdown representation of the entire document.
- No splitting occurs; the document is treated as a single unit.
- Useful when you need to process the entire document as a whole or perform your own custom chunking.
Choosing a Strategy
- Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
- Section-based Chunking: For documents with clear hierarchical organization.
- Page-based Chunking: When maintaining page/sheet boundaries is important.
- Block Chunking: When exact structure preservation is necessary.
- Disabled: When you need the entire document as a single unit or plan to implement custom chunking.
The right chunking strategy can significantly improve performance across tasks like information retrieval, summarization, or content analysis. If you have any questions about our recommendations for your use case, we'd be happy to help!
Updated 4 months ago