Skip to main content
Parse converts your documents into structured JSON. It runs OCR, detects document layout (headers, paragraphs, tables, figures), and returns content organized into chunks for LLM and RAG workflows. Each element includes its type, page position, and confidence score. It can handle multi-column text, nested tables, forms with handwriting, rotated pages, and documents mixing text with charts and images.
Try it live: See Parse in action with a sample bank statement in Reducto Studio.
File size limits: Upload files up to 100MB directly via the Upload endpoint, or up to 5GB via presigned URL. You can also pass public URLs or presigned S3/GCS/Azure URLs directly.

Quick Start

from pathlib import Path
from reducto import Reducto

client = Reducto()

upload = client.upload(file=Path("invoice.pdf"))
result = client.parse.run(input=upload.file_id)

for chunk in result.result.chunks:
    print(chunk.content)

What You Get Back

{
  "job_id": "7600c8c5-a52f-49d2-8a7d-d75d1b51e141",
  "duration": 3.89,
  "result": {
    "type": "full",
    "chunks": [
      {
        "content": "# Invoice\n\nBill To: Acme Corp\n123 Main St...",
        "embed": "# Invoice\n\nBill To: Acme Corp...",
        "blocks": [
          {
            "type": "Title",
            "content": "Invoice",
            "bbox": { "left": 0.1, "top": 0.05, "width": 0.3, "height": 0.04, "page": 1 },
            "confidence": "high"
          }
        ]
      }
    ]
  },
  "usage": { "num_pages": 1, "credits": 2.0 },
  "studio_link": "https://studio.reducto.ai/job/7600c8c5-..."
}
Key fields:
FieldWhat it is
chunks[].contentThe extracted content, formatted as Markdown (headers become #, tables become Markdown/HTML tables). Ready to pass to an LLM.
chunks[].embedSame content but optimized for embeddings. When figure/table summaries are enabled, this field contains natural language descriptions instead of raw table markup.
chunks[].blocksThe individual elements (paragraphs, tables, figures) with their positions and types. Useful for highlighting or linking back to source.
result.typeEither "full" (content inline) or "url" (content at a URL). Large documents return "url" to avoid HTTP size limits.

Response Format Details

Full breakdown of chunks, blocks, bounding boxes, and confidence scores.

Input Options

The input field accepts four formats:
  1. Upload response (reducto://...): After uploading via /upload, use the returned file_id. This is the most common method for local files.
  2. Public URL: Any publicly accessible URL. Reducto fetches the file directly.
  3. Presigned URL: S3, GCS, or Azure Blob presigned URLs work. Useful when files are in your cloud storage.
  4. Previous job ID (jobid://...): Reprocess a document from a previous parse job without re-uploading. Useful for testing different configurations.
# From upload
result = client.parse.run(input=upload.file_id)

# Public URL
result = client.parse.run(input="https://example.com/doc.pdf")

# Presigned S3 URL  
result = client.parse.run(input="https://bucket.s3.amazonaws.com/doc.pdf?X-Amz-...")

# Reprocess previous job
result = client.parse.run(input="jobid://7600c8c5-a52f-49d2-8a7d-d75d1b51e141")

Sync vs Async

Parse has both synchronous (/parse) and asynchronous (/parse_async) endpoints. Use async for large documents or when you need webhook delivery.

Sync vs Async Guide

When to use each, how priority works, webhook setup.

Configuration

Parse has several configuration groups. Here are the most commonly changed options:

Chunking

By default, Parse returns the entire document as one chunk. For RAG applications, you want smaller chunks that can be embedded and retrieved independently.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"}
    }
)
ModeBehavior
disabledOne chunk for the whole document (default)
variableSplits at semantic boundaries (sections, tables, figures stay intact). Best for RAG.
pageOne chunk per page
sectionSplits at section headers
Full chunking options →

Table Output Format

Controls how tables appear in the output.
result = client.parse.run(
    input=upload.file_id,
    formatting={
        "table_output_format": "html"
    }
)
FormatWhen to use
dynamicAuto-selects HTML or Markdown based on table complexity (default)
htmlComplex tables with merged cells, nested headers
mdSimple tables, Markdown-based workflows
jsonProgrammatic processing, need cell-level access
csvExport to spreadsheets
Full table format options →

Figure Summaries

By default, Parse uses a vision model to generate descriptions for figures and images. This helps with RAG (the embed field contains the description) but adds latency.
result = client.parse.run(
    input=upload.file_id,
    enhance={
        "summarize_figures": True
    }
)

Agentic Mode

Uses an LLM to review and correct parsing output. Adds latency with additional credit usage. Enable it when:
  • scope: "text": Handwritten text, faded scans, documents with unusual fonts, or when you see garbled characters in the output.
  • scope: "table": Tables with misaligned columns, merged cells that didn’t parse correctly, or numbers that appear in wrong columns.
  • scope: "figure": Charts and graphs that need data extraction, including advanced chart extraction with structured data output.
result = client.parse.run(
    input=upload.file_id,
    enhance={
        "agentic": [
            {"scope": "text"},
            {"scope": "table"},
            {"scope": "figure", "advanced_chart_agent": True}
        ]
    }
)
Don’t enable for clean digital PDFs (native text, not scanned). They parse correctly without it and you’ll just add latency.

Filter Blocks

Remove specific content types from the output. The blocks still appear in blocks metadata but are excluded from content and embed.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)
Useful for RAG when headers/footers would pollute search results.

Page Range

Process only specific pages.
result = client.parse.run(
    input=upload.file_id,
    settings={
        "page_range": {"start": 1, "end": 10}
    }
)

Return Images

Get image URLs for figures and tables in the document.
result = client.parse.run(
    input=upload.file_id,
    settings={
        "return_images": ["figure", "table"]
    }
)

# Access images from blocks
for chunk in result.result.chunks:
    for block in chunk.blocks:
        if block.image_url:
            print(f"{block.type}: {block.image_url}")
Options: ["figure"], ["table"], or ["figure", "table"]. By default, no images are returned.

Additional Settings

SettingDefaultDescription
persist_resultsfalseKeep results indefinitely instead of expiring after 24 hours
timeoutnullCustom timeout in seconds for processing
force_url_resultfalseAlways return results as a URL (useful for consistent handling)
embed_pdf_metadatafalseEmbed OCR metadata into returned PDF
result = client.parse.run(
    input=upload.file_id,
    settings={
        "persist_results": True,
        "timeout": 120,
        "force_url_result": True
    }
)
For complete configuration reference including OCR settings, spreadsheet options, and more, see the Configuration section.

Troubleshooting

Try formatting.table_output_format: "html". HTML handles merged cells and complex headers better than Markdown.Still broken? Enable enhance.agentic: [{"scope": "table"}] to use an LLM for alignment fixes.
Main causes:
  • enhance.agentic can add latency with higher accuracy
  • enhance.summarize_figures adds latency with figures
  • Large documents take longer linearly
  • async_priority should be True for faster priority processing
For fastest processing, disable what you don’t need. See Best Practices.
result = client.parse.run(
    input=upload.file_id,
    settings={"document_password": "your-password"}
)
Large documents return result.type: "url" instead of inline content to avoid HTTP size limits. Fetch the content:
import requests

if result.result.type == "url":
    chunks = requests.get(result.result.url).json()
else:
    chunks = result.result.chunks
To always get a URL (consistent handling): settings.force_url_result: true

Next Steps