Response Format

The parsing response format is optimized for flexibility with retrieval augmented generation. If you are just looking for a markdown representation of a given document, you can disable chunking altogether and just use response['result']['chunks'][0]['content'].

{
    "result": {
        "type": "full",
        "chunks": [
            {
                "content": "Chunk content optimized for passing to an LLM.",
                "embed": "Chunk content optimized for passing to an embedding model.",
                "blocks": [
                    {
                        "type": "Text", // Block type (Text, Table, Figure, etc.)
                        "bbox": {
                            // All bbox values normalized to [0,1] range
                            "left": 0.1,   // Distance from left edge (10%)
                            "top": 0.2,    // Distance from top edge (20%) 
                            "width": 0.3,  // Width as % of page width (30%)
                            "height": 0.4, // Height as % of page height (40%)
                            "page": 1,     // Current page number (1-indexed)
                            "original_page": 10  // Original doc page number
                        },
                        "content": "Text content",
                        "image_url": null // presigned url to download figure/table image
                    }
                    // ...
                ]
            }
            // ...
        ]
    }
}

FAQ

What is the difference between page and original_page?

How do embed and content differ for the blocks?

What is the result type?

How can I get images for figures and tables?

Can Reducto handle checkboxes?

Does Reducto handle skewed / rotated pages?

Can you return equations and subscripts/superscripts?

How do Excel/spreadsheet citations work differently?

Excel and other spreadsheet formats handle citations differently from PDFs and images:Coordinate System:

Excel: Uses actual row/column positions (1-indexed). For example, cell A1 would have coordinates left: 1, top: 1, width: 1, height: 1
Other formats: Use normalized coordinates in [0,1] range relative to page dimensions

Page Field:

Excel: The page field represents the sheet index (1-indexed). Sheet 1 = page 1, Sheet 2 = page 2, etc.
Other formats: The page field represents the actual page number in the document

Example Excel Citation:

{
  "bbox": {
    "left": 2,      // Column B (1-indexed)
    "top": 5,       // Row 5 (1-indexed) 
    "width": 1,     // 1 column wide
    "height": 1,    // 1 row tall
    "page": 1,      // First sheet
    "original_page": 1
  }
}

This allows for precise cell-level citations that correspond directly to Excel’s native coordinate system.

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise