Skip to main content
This page is a dense, structured reference designed for AI coding agents. It contains everything needed to integrate Reducto without navigating multiple pages.

Product Summary

Reducto converts documents (PDFs, images, spreadsheets, DOCX, and 30+ other formats) into structured data via a REST API.
  • Base URL: https://platform.reducto.ai
  • Auth: Authorization: Bearer $REDUCTO_API_KEY
  • SDKs: Python (pip install reductoai), Node.js (npm install reductoai), Go (go get github.com/reductoai/reducto-go-sdk)
  • Input: Upload a file via /upload to get a file_id, then pass it to any endpoint. You can also pass public URLs or presigned S3/GCS/Azure URLs directly.

Authentication

  1. Create a free account at studio.reducto.ai
  2. In the Studio sidebar, click API Keys, then Create new API key
  3. Set the key as an environment variable:
# macOS / Linux
export REDUCTO_API_KEY="your_api_key_here"

# Windows (PowerShell)
$env:REDUCTO_API_KEY="your_api_key_here"
The Python and Node.js SDKs automatically read REDUCTO_API_KEY from the environment. For the Go SDK, pass it explicitly:
client := reducto.NewClient(option.WithAPIKey(os.Getenv("REDUCTO_API_KEY")))
For direct REST calls, pass it as a Bearer token:
curl -X POST https://platform.reducto.ai/parse \
  -H "Authorization: Bearer $REDUCTO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"input": "https://example.com/doc.pdf"}'

Supported File Types

CategoryFormats
PDF.pdf
Documents.docx, .doc, .dotx, .rtf, .txt, .wpd
Spreadsheets.xlsx, .xlsm, .xls, .xltx, .xltm, .csv, .qpw
Presentations.pptx, .ppt
Images.png, .jpg/.jpeg, .gif, .bmp, .tiff, .heic, .psd
Upload limit: 100MB direct, 5GB via presigned URL. Multi-page TIFFs are processed as multi-page documents.

Which Endpoint Should I Use?

I want to…EndpointMethodKey config
Get all text, tables, and figures from a document separated into chunks with bounding box coordinates/parsePOSTenhance.agentic for a stronger model pass to correct mistakes
Extract specific fields into JSON using a specific schema/extractPOSTinstructions.schema (JSON Schema)
Divide a document into named sections by page range/splitPOSTsplit_description (section definitions)
Fill PDF or DOCX forms/editPOSTedit_instructions (natural language)
Classify a document’s type before processing/classifyPOSTclassification_schema (categories + criteria)
Upload a local file for processing/uploadPOST (multipart)file field
Process asynchronously with webhooks/parse_async, /extract_async, /split_async, /edit_asyncPOSTwebhook URL
Check job status or retrieve results/job/{job_id}GET-

Quick Start (Python)

from reducto import Reducto

client = Reducto()  # reads REDUCTO_API_KEY from env

# --- Option A: Pass a URL directly (no upload needed) ---
parse_result = client.parse.run(input="https://example.com/document.pdf")

# --- Option B: Upload a local file first ---
upload = client.upload(file="document.pdf")
parse_result = client.parse.run(input=upload.file_id)

# --- Handle the response (important: check result.type for large docs) ---
import requests

if parse_result.result.type == "url":
    # Large documents return a URL instead of inline content
    chunks = requests.get(parse_result.result.url).json()
else:
    chunks = parse_result.result.chunks

for chunk in chunks:
    # Use dict access for URL results, attribute access for inline results
    content = chunk["content"] if isinstance(chunk, dict) else chunk.content
    print(content)

# --- Extract: pull specific fields ---
extract_result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string", "description": "The invoice number"},
                "total": {"type": "number", "description": "Total amount due"}
            }
        }
    }
)
# result is a list, access first item for single-document extraction
data = extract_result.result[0]
print(data["invoice_number"], data["total"])

# --- Split: find section boundaries ---
split_result = client.split.run(
    input=upload.file_id,
    split_description=[
        {"name": "Summary", "description": "Executive summary section"},
        {"name": "Financials", "description": "Financial statements and tables"}
    ]
)
for split in split_result.result.splits:
    print(f"{split.name}: pages {split.pages}")

# --- Classify: identify document type ---
classify_result = client.classify.run(
    input=upload.file_id,
    classification_schema=[
        {"category": "invoice", "criteria": ["billing info", "itemized charges"]},
        {"category": "contract", "criteria": ["legal terms", "signatures"]}
    ]
)
print(classify_result.result)

# --- Edit: fill a form ---
# NOTE: Edit uses "document_url" instead of "input" (unlike other endpoints)
edit_result = client.edit.run(
    document_url=upload.file_id,
    edit_instructions="Fill Name: John Doe, Date: 2024-01-15, Check 'Yes' for US Citizen"
)
print(edit_result.document_url)  # URL to download filled document

SDK Naming Conventions

SDKProperty namesInstallNotes
Pythonsnake_case (array_extract, file_id)pip install reductoaiClient auto-reads REDUCTO_API_KEY env var
Node.jssnake_case (array_extract, file_id)npm install reductoaiAll methods return promises, use await
GoPascalCase (ArrayExtract, FileID)go get github.com/reductoai/reducto-go-sdkWrap values with reducto.F(), use shared.UnionString() for document URLs
RESTsnake_case in JSON body-Authorization: Bearer $REDUCTO_API_KEY header

Parse Parameters

POST /parse. Convert documents into structured JSON with text, tables, and figures.

Core Parameters

ParameterTypeDefaultDescription
inputstringrequiredFile ID (reducto://...), public URL, presigned URL, or jobid://... to reprocess

enhance group

ParameterTypeDefaultDescription
enhance.agenticarray[]List of agentic scopes. Each item has a scope field.
enhance.agentic[].scope"text" | "table" | "figure"-AI correction scope. text: OCR cleanup for scanned docs. table: fix misaligned columns. figure: chart data extraction. Adds latency and cost.
enhance.agentic[].promptstring | nullnullCustom prompt for agentic processing
enhance.agentic[].advanced_chart_agentboolfalseStructured chart data extraction (figure scope only)
enhance.summarize_figuresbooltrueGenerate natural language descriptions of figures for RAG
enhance.intelligent_orderingboolfalseUse vision model to improve reading order accuracy

retrieval group

ParameterTypeDefaultDescription
retrieval.chunking.chunk_mode"disabled" | "variable" | "section" | "page" | "block" | "page_sections""disabled"disabled: one chunk for entire doc. variable: semantic boundaries (best for RAG). section: split at headers. page: one chunk per page. page_sections: sections within each page.
retrieval.chunking.chunk_sizeint | nullnullTarget chunk size in characters. Defaults to 250-1500 range in variable mode.
retrieval.chunking.chunk_overlapint0Characters of overlap between adjacent chunks
retrieval.filter_blocksstring[][]Block types to exclude from content/embed. Options: "Header", "Footer", "Title", "Section Header", "Page Number", "List Item", "Figure", "Table", "Key Value", "Text", "Comment", "Signature"
retrieval.embedding_optimizedboolfalseOptimize output for embedding models

formatting group

ParameterTypeDefaultDescription
formatting.table_output_format"dynamic" | "html" | "md" | "json" | "csv" | "jsonbbox""dynamic"dynamic: auto-selects md or html based on complexity. html: best for complex/merged cells. md: simple tables. json: programmatic cell access.
formatting.add_page_markersboolfalseAdd page markers to output
formatting.merge_tablesboolfalseMerge consecutive tables with same column count
formatting.includestring[][]Include: "change_tracking", "highlight", "comments", "hyperlinks", "signatures", "ignore_watermarks"

spreadsheet group

ParameterTypeDefaultDescription
spreadsheet.split_large_tables.enabledbooltrueSplit large tables into smaller tables
spreadsheet.split_large_tables.sizeint50Rows per chunk for split tables
spreadsheet.clustering"accurate" | "fast" | "disabled""accurate"Algorithm for splitting sheets into tables. Accurate uses more powerful models (5x cost).
spreadsheet.includestring[][]Include: "cell_colors", "formula", "dropdowns"

settings group

ParameterTypeDefaultDescription
settings.page_rangeobject | nullnull{"start": 1, "end": 10} (1-indexed). Process specific pages only.
settings.return_imagesstring[][]Return image URLs for block types: "figure", "table", "page"
settings.ocr_system"standard" | "legacy""standard"standard: best multilingual OCR. legacy: Germanic languages only.
settings.extraction_mode"ocr" | "hybrid""hybrid"hybrid: combines OCR with embedded PDF text (recommended). ocr: OCR only.
settings.persist_resultsboolfalseKeep results indefinitely (default: expire after 24h)
settings.force_url_resultboolfalseAlways return results as a URL
settings.timeoutfloat | nullnullCustom timeout in seconds
settings.document_passwordstring | nullnullPassword for encrypted documents
settings.embed_pdf_metadataboolfalseEmbed OCR metadata into returned PDF

Parse Response Shape

{
  "job_id": "uuid",
  "duration": 3.89,
  "result": {
    "type": "full",           // "full" (inline) or "url" (fetch from URL)
    "chunks": [
      {
        "content": "# Heading\n\nText content...",   // Markdown-formatted
        "embed": "Heading. Text content...",          // Embedding-optimized
        "blocks": [
          {
            "type": "Title",       // Title, Section Header, Text, Table, Figure, Key Value, etc.
            "content": "Heading",
            "bbox": {"left": 0.1, "top": 0.05, "width": 0.3, "height": 0.04, "page": 1},
            "confidence": "high"   // "high" or "low"
          }
        ]
      }
    ]
  },
  "usage": {"num_pages": 3, "credits": 4.0},
  "studio_link": "https://studio.reducto.ai/job/..."
}
When result.type is "url", chunks are not inline. Fetch them from the URL:
import requests

if parse_result.result.type == "url":
    chunks = requests.get(parse_result.result.url).json()
    # chunks are plain dicts when fetched via URL
else:
    chunks = parse_result.result.chunks
    # chunks are SDK objects with attribute access

Extract Parameters

POST /extract. Pull specific fields from documents into structured JSON using a schema. Extract runs Parse internally. If a value doesn’t appear in the Parse output, Extract cannot extract it.
ParameterTypeDefaultDescription
inputstring | string[]requiredFile ID, URL, jobid://..., or array of job IDs to combine
instructions.schemaobject{}JSON Schema defining fields to extract. Field names and descriptions directly influence accuracy.
instructions.system_promptstring"Be precise and thorough."Document-level context for the LLM
settings.array_extractboolfalseSegment document for long arrays. Required when schema has array fields in long documents. Schema must have at least one top-level array property.
settings.deep_extractboolfalseAgentic mode that iteratively refines output for near-perfect accuracy. Higher cost/latency.
settings.citations.enabledboolfalseReturn source page, bbox, and text for each value. Mutually exclusive with chunking.
settings.citations.numerical_confidencebooltrueInclude 0-1 confidence scores (vs “high”/“low”)
settings.include_imagesboolfalseInclude page images in extraction context
settings.optimize_for_latencyboolfalseHigher priority processing at 2x cost
parsingobject{}All Parse parameters (see above). Ignored if input is jobid://.

Extract Response Shape

{
  "result": [
    {
      "invoice_number": "INV-2024-001",
      "total": 1250.00,
      "line_items": [{"description": "Widget", "amount": 500.00}]
    }
  ],
  "job_id": "uuid",
  "usage": {"num_fields": 4, "num_pages": 2, "credits": 10.0},
  "studio_link": "https://studio.reducto.ai/job/..."
}
With citations.enabled: true, each value is wrapped:
{
  "result": {
    "total": {
      "value": 1250.00,
      "citations": [
        {
          "type": "Table",
          "content": "Total: $1,250.00",
          "bbox": {"left": 0.04, "top": 0.26, "width": 0.45, "height": 0.50, "page": 2},
          "confidence": "high"
        }
      ]
    }
  }
}

Split Parameters

POST /split. Divide documents into named sections by page number. Split runs Parse internally, then uses an LLM to classify pages against your section descriptions.
ParameterTypeDefaultDescription
inputstringrequiredFile ID, URL, or jobid://...
split_descriptionarrayrequiredList of {"name": "...", "description": "..."} section definitions
split_rulesstring"Split the document into the applicable sections..."Natural language rules for splitting behavior
settings.table_cutoff"truncate" | "preserve""truncate"truncate: first rows only (faster). preserve: all content.
parsingobject{}All Parse parameters. Ignored if input is jobid://.

Split Response Shape

{
  "result": {
    "splits": [
      {"name": "Executive Summary", "pages": [1, 2]},
      {"name": "Financial Statements", "pages": [3, 4, 5, 6]},
      {"name": "Risk Factors", "pages": [7, 8, 9]}
    ]
  },
  "job_id": "uuid",
  "usage": {"num_pages": 9, "credits": 6.0}
}

Edit Parameters

POST /edit. Fill PDF forms and modify DOCX documents. Note: Edit uses document_url as its input parameter, not input like other endpoints.
ParameterTypeDefaultDescription
document_urlstringrequiredFile ID or URL of the document to edit (this is document_url, not input)
edit_instructionsstringrequiredNatural language instructions. Be explicit: "Fill Name: John Doe, Date: 2024-01-15"
edit_options.colorstring"#FF0000"Highlight color for edits (DOCX only)
edit_options.enable_overflow_pagesboolfalseCreate appendix pages for text exceeding field capacity (PDF only)
form_schemaarray | nullnullPre-defined field locations for repeatable form filling. Skips detection.

Edit Response Shape

{
  "document_url": "https://storage.reducto.ai/filled-form.pdf?...",
  "form_schema": [
    {
      "bbox": {"left": 0.1, "top": 0.2, "width": 0.4, "height": 0.03, "page": 1},
      "description": "Name field",
      "type": "text"
    }
  ],
  "usage": {"num_pages": 2, "credits": 8}
}
The document_url is a presigned URL valid for 24 hours. Save the returned form_schema to reuse for the same form type (skips field detection).

Classify Parameters

POST /classify. Categorize a document before processing.
ParameterTypeDefaultDescription
inputstringrequiredFile ID or URL
classification_schemaarrayrequiredList of {"category": "...", "criteria": ["...", "..."]}
page_rangeobject | nullnullPages to use for classification context. Defaults to first 5 pages. Max 10 pages.
document_metadatastring | nullnullOptional metadata to include in classification prompt

Classify Response Shape

{
  "result": {
    "category": "invoice"
  },
  "job_id": "uuid",
  "duration": 1.23
}

Async Processing

Most endpoints have async variants (/parse_async, /extract_async, /split_async, /edit_async). Classify is synchronous only. Async endpoints return a job_id immediately and process in the background.
# Submit async job
job = client.parse.run_job(input=upload.file_id)
print(job.job_id)

# Poll for results
import time
while True:
    result = client.job.get(job.job_id)
    if result.status in ("Completed", "Failed"):
        break
    time.sleep(2)
Configure webhooks for push-based delivery instead of polling.

Error Codes

HTTP StatusMeaningCommon cause
401UnauthorizedMissing or invalid REDUCTO_API_KEY
422Validation errorInvalid parameters, schema too large, or constraint violation
429Rate limitedToo many concurrent requests. Retry with backoff.
500Server errorTransient issue. Retry with backoff.