Parse

Parse converts your documents into structured JSON. It runs OCR, detects document layout (headers, paragraphs, tables, figures), and returns content organized into chunks for LLM and RAG workflows. Each element includes its type, page position, and confidence score. It can handle multi-column text, nested tables, forms with handwriting, rotated pages, and documents mixing text with charts and images.

Try it live: See Parse in action with a sample bank statement in Reducto Studio.

File size limits: Upload files up to 100MB directly via the Upload endpoint, or up to 5GB via presigned URL. You can also pass public URLs or presigned S3/GCS/Azure URLs directly.

Quick Start

from pathlib import Path
from reducto import Reducto

client = Reducto()

upload = client.upload(file=Path("invoice.pdf"))
result = client.parse.run(input=upload.file_id)

for chunk in result.result.chunks:
    print(chunk.content)

What You Get Back

{
  "job_id": "7600c8c5-a52f-49d2-8a7d-d75d1b51e141",
  "duration": 3.89,
  "result": {
    "type": "full",
    "chunks": [
      {
        "content": "# Invoice\n\nBill To: Acme Corp\n123 Main St...",
        "embed": "# Invoice\n\nBill To: Acme Corp...",
        "blocks": [
          {
            "type": "Title",
            "content": "Invoice",
            "bbox": { "left": 0.1, "top": 0.05, "width": 0.3, "height": 0.04, "page": 1 },
            "confidence": "high"
          }
        ]
      }
    ]
  },
  "usage": { "num_pages": 1, "credits": 2.0 },
  "studio_link": "https://studio.reducto.ai/job/7600c8c5-..."
}

Key fields:

Field	What it is
`chunks[].content`	The extracted content, formatted as Markdown (headers become `#`, tables become Markdown/HTML tables). Ready to pass to an LLM.
`chunks[].embed`	Same content but optimized for embeddings. When figure/table summaries are enabled, this field contains natural language descriptions instead of raw table markup.
`chunks[].blocks`	The individual elements (paragraphs, tables, figures) with their positions and types. Useful for highlighting or linking back to source.
`result.type`	Either `"full"` (content inline) or `"url"` (content at a URL). Large documents return `"url"` to avoid HTTP size limits.

Response Format Details

Full breakdown of chunks, blocks, bounding boxes, and confidence scores.

Input Options

The input field accepts four formats:

Upload response (reducto://...): After uploading via /upload, use the returned file_id. This is the most common method for local files.
Public URL: Any publicly accessible URL. Reducto fetches the file directly.
Presigned URL: S3, GCS, or Azure Blob presigned URLs work. Useful when files are in your cloud storage.
Previous job ID (jobid://...): Reprocess a document from a previous parse job without re-uploading. Useful for testing different configurations.

# From upload
result = client.parse.run(input=upload.file_id)

# Public URL
result = client.parse.run(input="https://example.com/doc.pdf")

# Presigned S3 URL  
result = client.parse.run(input="https://bucket.s3.amazonaws.com/doc.pdf?X-Amz-...")

# Reprocess previous job
result = client.parse.run(input="jobid://7600c8c5-a52f-49d2-8a7d-d75d1b51e141")

Sync vs Async

Parse has both synchronous (/parse) and asynchronous (/parse_async) endpoints. Use async for large documents or when you need webhook delivery.

Sync vs Async Guide

When to use each, how priority works, webhook setup.

Configuration

Parse has several configuration groups. Here are the most commonly changed options:

Chunking

By default, Parse returns the entire document as one chunk. For RAG applications, you want smaller chunks that can be embedded and retrieved independently.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"}
    }
)

Mode	Behavior
`disabled`	One chunk for the whole document (default)
`variable`	Splits at semantic boundaries (sections, tables, figures stay intact). Best for RAG.
`page`	One chunk per page
`section`	Splits at section headers

Full chunking options →

Table Output Format

Controls how tables appear in the output.

result = client.parse.run(
    input=upload.file_id,
    formatting={
        "table_output_format": "html"
    }
)

Format	When to use
`dynamic`	Auto-selects HTML or Markdown based on table complexity (default)
`html`	Complex tables with merged cells, nested headers
`md`	Simple tables, Markdown-based workflows
`json`	Programmatic processing, need cell-level access
`csv`	Export to spreadsheets

Full table format options →

Figure Summaries

By default, Parse uses a vision model to generate descriptions for figures and images. This helps with RAG (the embed field contains the description) but adds latency.

result = client.parse.run(
    input=upload.file_id,
    enhance={
        "summarize_figures": True
    }
)

Agentic Mode

Uses an LLM to review and correct parsing output. Adds latency with additional credit usage. Enable it when:

scope: "text": Handwritten text, faded scans, documents with unusual fonts, or when you see garbled characters in the output.
scope: "table": Tables with misaligned columns, merged cells that didn’t parse correctly, or numbers that appear in wrong columns.
scope: "figure": Charts and graphs that need data extraction, including advanced chart extraction with structured data output.

result = client.parse.run(
    input=upload.file_id,
    enhance={
        "agentic": [
            {"scope": "text"},
            {"scope": "table"},
            {"scope": "figure", "advanced_chart_agent": True}
        ]
    }
)

Don’t enable for clean digital PDFs (native text, not scanned). They parse correctly without it and you’ll just add latency.

Filter Blocks

Remove specific content types from the output. The blocks still appear in blocks metadata but are excluded from content and embed.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

Useful for RAG when headers/footers would pollute search results.

Page Range

Process only specific pages.

result = client.parse.run(
    input=upload.file_id,
    settings={
        "page_range": {"start": 1, "end": 10}
    }
)

Return Images

Get image URLs for figures and tables in the document.

result = client.parse.run(
    input=upload.file_id,
    settings={
        "return_images": ["figure", "table"]
    }
)

# Access images from blocks
for chunk in result.result.chunks:
    for block in chunk.blocks:
        if block.image_url:
            print(f"{block.type}: {block.image_url}")

Options: ["figure"], ["table"], or ["figure", "table"]. By default, no images are returned.

Additional Settings

Setting	Default	Description
`persist_results`	`false`	Keep results indefinitely instead of expiring after 24 hours
`timeout`	`null`	Custom timeout in seconds for processing
`force_url_result`	`false`	Always return results as a URL (useful for consistent handling)
`embed_pdf_metadata`	`false`	Embed OCR metadata into returned PDF

result = client.parse.run(
    input=upload.file_id,
    settings={
        "persist_results": True,
        "timeout": 120,
        "force_url_result": True
    }
)

For complete configuration reference including OCR settings, spreadsheet options, and more, see the Configuration section.

Troubleshooting

Tables look wrong

Try formatting.table_output_format: "html". HTML handles merged cells and complex headers better than Markdown.Still broken? Enable enhance.agentic: [{"scope": "table"}] to use an LLM for alignment fixes.

Response is slow

Main causes:

enhance.agentic can add latency with higher accuracy
enhance.summarize_figures adds latency with figures
Large documents take longer linearly
async_priority should be True for faster priority processing

For fastest processing, disable what you don’t need. See Best Practices.

Password-protected PDF

result = client.parse.run(
    input=upload.file_id,
    settings={"document_password": "your-password"}
)

Response is a URL instead of content

Large documents return result.type: "url" instead of inline content to avoid HTTP size limits. Fetch the content:

import requests

if result.result.type == "url":
    chunks = requests.get(result.result.url).json()
else:
    chunks = result.result.chunks

To always get a URL (consistent handling): settings.force_url_result: true

Next Steps

Response Format

Full breakdown of chunks, blocks, and bounding boxes.

Best Practices

Optimization by document type, latency tips.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

Quick Start

What You Get Back

Response Format Details

Input Options

Sync vs Async

Sync vs Async Guide

Configuration

Chunking

Table Output Format

Figure Summaries

Agentic Mode

Filter Blocks

Page Range

Return Images

Additional Settings

Troubleshooting

Next Steps

Response Format

Best Practices

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​Quick Start

​What You Get Back

Response Format Details

​Input Options

​Sync vs Async

Sync vs Async Guide

​Configuration

​Chunking

​Table Output Format

​Figure Summaries

​Agentic Mode

​Filter Blocks

​Page Range

​Return Images

​Additional Settings

​Troubleshooting

​Next Steps

Response Format

Best Practices

Quick Start

What You Get Back

Input Options

Sync vs Async

Configuration

Chunking

Table Output Format

Figure Summaries

Agentic Mode

Filter Blocks

Page Range

Return Images

Additional Settings

Troubleshooting

Next Steps