Skip to main content
Reducto extracts tables from documents and can return them in several formats. The format you choose affects how merged cells, headers, and structure are represented.

Setting Table Format

result = client.parse.run(
    input=upload.file_id,
    formatting={"table_output_format": "html"}
)

Available Formats

Default. Automatically chooses between markdown and HTML based on table complexity.
  • Uses markdown for simple tables (30 cells or fewer AND 4 merged cells or fewer)
  • Uses HTML for complex tables (more than 30 cells OR more than 4 merged cells)
formatting={"table_output_format": "dynamic"}
Best for RAG pipelines where you want clean, readable output for simple tables while preserving structure for complex ones.

Additional Options

Merge Tables

When a logical table spans multiple pages, Reducto may detect it as separate tables. Enable merge_tables to combine consecutive tables with the same column count:
formatting={
    "table_output_format": "html",
    "merge_tables": True
}
The algorithm:
  1. Identifies consecutive tables with identical column counts
  2. Uses a language model to determine if the second table is a continuation of the first
  3. Combines them into a single table, removing duplicate headers
Tables are merged based on column count and semantic analysis. Tables with the same number of columns but different structures may be incorrectly merged. Review output when using this option on complex documents.

Add Page Markers

Inserts page boundary indicators into the content:
formatting={"add_page_markers": True}
Output includes markers like:
[[START OF PAGE 1]]

# Document Title

Content from page 1...

[[END OF PAGE 1]]
[[START OF PAGE 2]]
Useful for page-specific extraction or citation tracking.

Include Additional Metadata

The formatting group also supports extracting comments, highlights, change tracking, hyperlinks, and signatures. See Additional Document Data for details.

Choosing the Right Format

For LLM context (RAG, summarization):
  • Use dynamic (default). It balances readability with structure preservation.
  • Markdown is easier for LLMs to parse than HTML for simple tables.
  • Complex tables benefit from HTML to preserve relationships between cells.
For programmatic data extraction:
  • Use json when you need to iterate over rows and cells.
  • Use jsonbbox when cell positions matter (highlighting, overlays).
  • Use csv for direct import into pandas, spreadsheets, or data pipelines.
For accuracy-critical applications:
  • Use html. It’s the only format that preserves merged cells.
  • Financial statements, regulatory filings, and complex reports need HTML to maintain correct structure.

Troubleshooting

Use html format. Markdown and JSON formats cannot represent merged cells and will flatten them.
Enable merge_tables: True to combine consecutive tables with the same column structure.
Use csv for minimal output. If you need structure but want fewer tokens, use json instead of html.
Use jsonbbox format. Each cell includes normalized (x, y, width, height) coordinates.