Understanding blocks
- Blocks are single units of information in a document.
- Each distinct element (paragraphs, headers, images, titles, tables) is typically a separate block.
- Chunks are composed of one or more blocks.
Chunking strategies
1. Variable chunking
- Aims for a target chunk size that you can provide. Defaults to 1000 characters as a target
- Dynamically approximates target size while maintaining layout structure using layout, spatial, and semantic information.
- Can break long sections or merge short ones.
- Useful for documents with varying section lengths or when consistent chunk sizes are needed.
2. Section-based chunking
- Identifies “title” and “section header” blocks.
- Groups blocks until a new section header/title is encountered.
- Breaks documents into coherent subsections.
- Ideal for documents with clear hierarchical structures.
3. Page-based chunking
- Returns one chunk for each page in a document or sheet in a spreadsheet.
- Maintains the original page/sheet structure of the source material.
- Useful for page-specific analysis or when page boundaries are significant.
4. Page sections chunking
- Combines page and section-based chunking.
- First splits the document by pages, then within each page splits by sections.
- Maintains both page and section boundaries.
- Ideal for documents where both page and section organization are important.
5. Block chunking
- Each block becomes a separate chunk.
- Essentially disables advanced chunking.
- Useful when you want to implement your own chunking strategy with the layout information.
6. Disabled (no chunking)
- Returns one large markdown representation of the entire document.
- No splitting occurs; the document is treated as a single unit.
- Useful when you need to process the entire document as a whole or perform your own custom chunking.
Choosing a strategy
- Variable Chunking: When consistent chunk sizes are needed or for documents with inconsistent section lengths.
- Section-based Chunking: For documents with clear hierarchical organization.
- Page-based Chunking: When maintaining page/sheet boundaries is important.
- Page Sections Chunking: When both page and section organization need to be preserved.
- Block Chunking: When exact structure preservation is necessary.
- Disabled: When you need the entire document as a single unit or plan to implement custom chunking.