Table Output Formats

Reducto extracts tables from documents and can return them in several formats. The format you choose affects how merged cells, headers, and structure are represented.

Setting Table Format

result = client.parse.run(
    input=upload.file_id,
    formatting={"table_output_format": "html"}
)

Available Formats

Default. Automatically chooses between markdown and HTML based on table complexity.

Uses markdown for simple tables (30 cells or fewer AND 4 merged cells or fewer)
Uses HTML for complex tables (more than 30 cells OR more than 4 merged cells)

formatting={"table_output_format": "dynamic"}

Best for RAG pipelines where you want clean, readable output for simple tables while preserving structure for complex ones.

Full HTML table structure with proper support for merged cells.

formatting={"table_output_format": "html"}

<table>
  <tr><th colspan="2">Q1 Results</th></tr>
  <tr><th>Product</th><th>Revenue</th></tr>
  <tr><td>Widget A</td><td>$10,000</td></tr>
</table>

Merged cells are encoded using rowspan and colspan attributes. Use for financial statements, regulatory filings, or any tables where cell merging carries meaning.

GitHub-flavored markdown tables.

formatting={"table_output_format": "md"}

| Header 1 | Header 2 |
| - | - |
| Data 1 | Data 2 |

Markdown cannot represent merged cells. If your table has merged cells, they will be flattened. Use for simple tables where human readability matters.

Nested arrays for programmatic access.

formatting={"table_output_format": "json"}

[
  ["Header 1", "Header 2"],
  ["Data 1", "Data 2"]
]

First row contains headers. All cell values are strings. Use when you need to process table data programmatically.

JSON with normalized bounding box coordinates for each cell.

formatting={"table_output_format": "jsonbbox"}

[
  [
    {"text": "Header 1", "bbox": {"x": 0.1, "y": 0.2, "width": 0.3, "height": 0.04}},
    {"text": "Header 2", "bbox": {"x": 0.4, "y": 0.2, "width": 0.3, "height": 0.04}}
  ]
]

Coordinates are normalized to [0, 1] relative to page dimensions. Use when you need to know where each cell is located on the page.

Agentic table enhancement is not compatible with jsonbbox and will be automatically disabled when this format is used.

Comma-separated values.

formatting={"table_output_format": "csv"}

Header 1,Header 2
Data 1,Data 2

Minimal output, easy to import into spreadsheet software. Most token-efficient format.

Additional Options

Merge Tables

When a logical table spans multiple pages, Reducto may detect it as separate tables. Enable merge_tables to combine consecutive tables with the same column count:

formatting={
    "table_output_format": "html",
    "merge_tables": True
}

The algorithm:

Identifies consecutive tables with identical column counts
Uses a language model to determine if the second table is a continuation of the first
Combines them into a single table, removing duplicate headers

Tables are merged based on column count and semantic analysis. Tables with the same number of columns but different structures may be incorrectly merged. Review output when using this option on complex documents.

Add Page Markers

Inserts page boundary indicators into the content:

formatting={"add_page_markers": True}

Output includes markers like:

[[START OF PAGE 1]]

# Document Title

Content from page 1...

[[END OF PAGE 1]]
[[START OF PAGE 2]]

Useful for page-specific extraction or citation tracking.

Include Additional Metadata

The formatting group also supports extracting comments, highlights, change tracking, hyperlinks, and signatures. See Additional Document Data for details.

Choosing the Right Format

For LLM context (RAG, summarization):

Use dynamic (default). It balances readability with structure preservation.
Markdown is easier for LLMs to parse than HTML for simple tables.
Complex tables benefit from HTML to preserve relationships between cells.

For programmatic data extraction:

Use json when you need to iterate over rows and cells.
Use jsonbbox when cell positions matter (highlighting, overlays).
Use csv for direct import into pandas, spreadsheets, or data pipelines.

For accuracy-critical applications:

Use html. It’s the only format that preserves merged cells.
Financial statements, regulatory filings, and complex reports need HTML to maintain correct structure.

Troubleshooting

Merged cells not appearing correctly

Use html format. Markdown and JSON formats cannot represent merged cells and will flatten them.

Tables split across pages

Enable merge_tables: True to combine consecutive tables with the same column structure.

Output is too verbose

Use csv for minimal output. If you need structure but want fewer tokens, use json instead of html.

Need to know cell positions

Use jsonbbox format. Each cell includes normalized (x, y, width, height) coordinates.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

Setting Table Format

Available Formats

Additional Options

Merge Tables

Add Page Markers

Include Additional Metadata

Choosing the Right Format

Troubleshooting

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​Setting Table Format

​Available Formats

​Additional Options

​Merge Tables

​Add Page Markers

​Include Additional Metadata

​Choosing the Right Format

​Troubleshooting

Setting Table Format

Available Formats

Additional Options

Merge Tables

Add Page Markers

Include Additional Metadata

Choosing the Right Format

Troubleshooting