Skip to main content
Reducto provides several options for controlling how tables are formatted in the API response. You can specify the table output format using the formatting.table_output_format parameter in your configuration.

Usage

result = client.parse.run(
    input="https://example.com/document.pdf",
    formatting={
        "table_output_format": "html"  # or "md", "json", "jsonbbox", "dynamic", "csv", "ai_json"
    }
)

Available formats

Dynamic format

The dynamic format (dynamic) automatically chooses between markdown and HTML based on table complexity:
  • Uses markdown for simple tables (≤ 30 cells and ≤ 4 merged cells)
  • Uses HTML for complex tables
This is our overall recommended format for RAG use cases, etc.

HTML format

<table>
  <tr>
    <th>Header 1</th>
    <th>Header 2</th>
  </tr>
  <tr>
    <td>Data 1</td>
    <td>Data 2</td>
  </tr>
</table>
The HTML format (html) returns tables as HTML strings with proper support for:
  • Table headers (<th> tags)
  • Merged cells (using rowspan and colspan attributes)
  • Complex table structures
  • Cell formatting
This is the default format and is recommended for accuracy sensitive use cases as it preserves all table information.

Markdown format

| Header 1 | Header 2 |
|----------|----------|
| Data 1   | Data 2   |
The Markdown format (md) returns tables in GitHub-flavored markdown format. This is useful when:
  • You need a human-readable format
  • You’re displaying the content in markdown viewers
  • You want simpler table representation
  • The table doesn’t have complex merged cells

JSON format

[
  ["Header 1", "Header 2"],
  ["Data 1", "Data 2"]
]
The JSON format (json) returns tables as nested arrays where:
  • The outer array represents rows
  • Each inner array represents cells in that row
  • First row typically contains headers
  • All cell values are strings
This format is useful for programmatic processing of table data.

JSON with bounding boxes

[
  [
    {
      "text": "Header 1",
      "bbox": {
        "x": 0.1,
        "y": 0.2,
        "width": 0.3,
        "height": 0.4
      }
    }
  ]
]
The JSON with bounding boxes format (jsonbbox) extends the JSON format by including positional information for each cell. The coordinates are normalized to [0,1] range where:
  • x: Distance from left edge of the page
  • y: Distance from top edge of the page
  • width: Cell width as percentage of page width
  • height: Cell height as percentage of page height

CSV format

Header 1,Header 2
Data 1,Data 2
The CSV format (csv) returns tables in comma-separated values format. This is useful when:
  • You need to import the data into spreadsheet software
  • You want a simple, widely-supported format
  • The table structure is relatively simple
  • You want to save on output tokens.

AI JSON format

The AI JSON format (ai_json) uses a custom LVM to parse the table structure and return the underlying JSON data. This mode performs the best in cases where the underlying table structure is very complex and not strictly tabular or contains many artifacts.

Merge Tables

The merge_tables option combines consecutive tables that have the same number of columns. This is useful when a single logical table spans multiple pages or is split by page breaks.
result = client.parse.run(
    input="https://example.com/document.pdf",
    formatting={
        "table_output_format": "html",
        "merge_tables": True
    }
)

When to use

Enable merge_tables when:
  • Tables span multiple pages with repeated headers
  • A logical table is split by page breaks or other content
  • You’re processing documents where tables continue across sections

How it works

When enabled, Reducto:
  1. Identifies consecutive tables with identical column counts
  2. Combines them into a single table block
  3. Preserves the first header row and removes duplicates
Merging only considers column count. Tables with the same number of columns but different structures (e.g., different headers) may be incorrectly merged. Review output when using this option on complex documents.

Add Page Markers

The add_page_markers option inserts page boundary indicators into the output. This helps when you need to track which page content came from.
result = client.parse.run(
    input="https://example.com/document.pdf",
    formatting={
        "add_page_markers": True
    }
)
When enabled, page markers appear as blocks in the content:
[[PAGE 1 BEGINS HERE]]

# Document Title

Content from page 1...

[[PAGE 1 ENDS HERE]]
[[PAGE 2 BEGINS HERE]]

Content from page 2...

Use cases

  • Page-specific extraction: When your schema needs to know which page data came from
  • Citation tracking: Maintaining page references for extracted information
  • Document reconstruction: Preserving original pagination for review