Setting Table Format
Available Formats
- dynamic
- html
- md
- json
- jsonbbox
- csv
Default. Automatically chooses between markdown and HTML based on table complexity.Best for RAG pipelines where you want clean, readable output for simple tables while preserving structure for complex ones.
- Uses markdown for simple tables (30 cells or fewer AND 4 merged cells or fewer)
- Uses HTML for complex tables (more than 30 cells OR more than 4 merged cells)
Additional Options
Merge Tables
When a logical table spans multiple pages, Reducto may detect it as separate tables. Enablemerge_tables to combine consecutive tables with the same column count:
- Identifies consecutive tables with identical column counts
- Uses a language model to determine if the second table is a continuation of the first
- Combines them into a single table, removing duplicate headers
Add Page Markers
Inserts page boundary indicators into the content:Include Additional Metadata
Theformatting group also supports extracting comments, highlights, change tracking, hyperlinks, and signatures. See Additional Document Data for details.
Choosing the Right Format
For LLM context (RAG, summarization):- Use
dynamic(default). It balances readability with structure preservation. - Markdown is easier for LLMs to parse than HTML for simple tables.
- Complex tables benefit from HTML to preserve relationships between cells.
- Use
jsonwhen you need to iterate over rows and cells. - Use
jsonbboxwhen cell positions matter (highlighting, overlays). - Use
csvfor direct import into pandas, spreadsheets, or data pipelines.
- Use
html. Itβs the only format that preserves merged cells. - Financial statements, regulatory filings, and complex reports need HTML to maintain correct structure.
Troubleshooting
Merged cells not appearing correctly
Merged cells not appearing correctly
Use
html format. Markdown and JSON formats cannot represent merged cells and will flatten them.Tables split across pages
Tables split across pages
Enable
merge_tables: True to combine consecutive tables with the same column structure.Output is too verbose
Output is too verbose
Use
csv for minimal output. If you need structure but want fewer tokens, use json instead of html.Need to know cell positions
Need to know cell positions
Use
jsonbbox format. Each cell includes normalized (x, y, width, height) coordinates.