Basic Usage
Chunking Modes
- variable
- section
- page
- page_sections
- block
- disabled
Best for: RAG, semantic searchGroups blocks to target a specific character count while keeping semantically related content together. This is the recommended mode for most RAG applications.The algorithm:
- Groups blocks by document structure (new group at each title/section header)
- Splits oversized groups at natural boundaries (points where blocks are physically separated on the page)
- Applies adjacency rules: keeps headers with content, figures with captions, list items together
- Merges undersized groups to reach the target range
chunk_size: 1000, chunks will range from 750 to 1250 characters (±25%). Without a size specified, the default range is 750-1250.Optimizing for Embeddings
Tables often embed poorly because their structure doesn’t translate well to vector representations. Enableembedding_optimized to generate natural language summaries of tables:
content: Original format (tables as HTML/markdown)embed: Optimized for embeddings (tables converted to summaries like “This table shows quarterly revenue by region…”)
embed for vector search, content for display.
Filtering Block Types
Some content types (headers, footers, page numbers) add noise to search results. Filter them out:chunks[].blocks with full metadata, but they’re excluded from the content and embed text fields.
Available types: Header, Footer, Title, Section Header, Page Number, List Item, Figure, Table, Key Value, Text, Comment, Signature
Response Structure
Each chunk in the response contains:blocks array gives you access to individual elements with their bounding boxes and types, useful for citations or highlighting source locations. See Parse Response Format for complete field documentation.
Troubleshooting
Chunks are too large
Chunks are too large
Reduce
chunk_size or switch to block mode and implement your own chunking logic on the individual blocks.Chunks lack sufficient context
Chunks lack sufficient context
Increase
chunk_size or use section mode if your document has well-defined sections.Tables are split across chunks
Tables are split across chunks
Increase
chunk_size to accommodate full tables. Alternatively, enable formatting.merge_tables to combine consecutive tables with the same column structure before chunking.Retrieval quality is poor for tables
Retrieval quality is poor for tables
Enable
embedding_optimized: True to generate natural language summaries of tables for the embed field.