Documentation Index
Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
Use this file to discover all available pages before exploring further.
This page is a dense, structured reference designed for AI coding agents. It contains everything needed to integrate Reducto without navigating multiple pages.
Product Summary
Reducto converts documents (PDFs, images, spreadsheets, DOCX, and 30+ other formats) into structured data via a REST API.
- Base URL:
https://platform.reducto.ai
- Auth:
Authorization: Bearer $REDUCTO_API_KEY
- SDKs: Python (
pip install reductoai), Node.js (npm install reductoai), Go (go get github.com/reductoai/reducto-go-sdk)
- Input: Upload a file via
/upload to get a file_id, then pass it to any endpoint. You can also pass public URLs or presigned S3/GCS/Azure URLs directly.
Authentication
- Create a free account at studio.reducto.ai
- In the Studio sidebar, click API Keys, then Create new API key
- Set the key as an environment variable:
# macOS / Linux
export REDUCTO_API_KEY="your_api_key_here"
# Windows (PowerShell)
$env:REDUCTO_API_KEY="your_api_key_here"
The Python and Node.js SDKs automatically read REDUCTO_API_KEY from the environment. For the Go SDK, pass it explicitly:
client := reducto.NewClient(option.WithAPIKey(os.Getenv("REDUCTO_API_KEY")))
For direct REST calls, pass it as a Bearer token:
curl -X POST https://platform.reducto.ai/parse \
-H "Authorization: Bearer $REDUCTO_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": "https://example.com/doc.pdf"}'
Supported File Types
| Category | Formats |
|---|
| PDF | .pdf |
| Documents | .docx, .doc, .dotx, .rtf, .txt, .wpd |
| Spreadsheets | .xlsx, .xlsm, .xls, .xltx, .xltm, .csv, .qpw |
| Presentations | .pptx, .ppt |
| Images | .png, .jpg/.jpeg, .gif, .bmp, .tiff, .heic, .psd, .pcx, .ppm, .apng, .cur, .dcx, .ftex, .pixar |
Upload limit: 100MB direct, 5GB via presigned URL. Multi-page TIFFs are processed as multi-page documents.
Which Endpoint Should I Use?
| I want to… | Endpoint | Method | Key config |
|---|
| Get all text, tables, and figures from a document separated into chunks with bounding box coordinates | /parse | POST | enhance.agentic for a stronger model pass to correct mistakes |
| Extract specific fields into JSON using a specific schema | /extract | POST | instructions.schema (JSON Schema) |
| Divide a document into named sections by page range | /split | POST | split_description (section definitions) |
| Fill PDF or DOCX forms | /edit | POST | edit_instructions (natural language) |
| Classify a document’s type before processing | /classify | POST | classification_schema (categories + criteria) |
| Upload a local file for processing | /upload | POST (multipart) | file field |
| Process asynchronously with webhooks | /parse_async, /extract_async, /split_async, /edit_async | POST | webhook URL |
| Check job status or retrieve results | /job/{job_id} | GET | - |
Quick Start (Python)
from pathlib import Path
from reducto import Reducto
client = Reducto() # reads REDUCTO_API_KEY from env
# --- Option A: Pass a URL directly (no upload needed) ---
parse_result = client.parse.run(input="https://example.com/document.pdf")
# --- Option B: Upload a local file first ---
upload = client.upload(file=Path("document.pdf"))
parse_result = client.parse.run(input=upload.file_id)
# --- Handle the response (important: check result.type for large docs) ---
import requests
if parse_result.result.type == "url":
# Large documents return a URL instead of inline content
chunks = requests.get(parse_result.result.url).json()
else:
chunks = parse_result.result.chunks
for chunk in chunks:
# Use dict access for URL results, attribute access for inline results
content = chunk["content"] if isinstance(chunk, dict) else chunk.content
print(content)
# --- Extract: pull specific fields ---
extract_result = client.extract.run(
input=upload.file_id,
instructions={
"schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "The invoice number"},
"total": {"type": "number", "description": "Total amount due"}
}
}
}
)
# result is a list, access first item for single-document extraction
data = extract_result.result[0]
print(data["invoice_number"], data["total"])
# --- Split: find section boundaries ---
split_result = client.split.run(
input=upload.file_id,
split_description=[
{"name": "Summary", "description": "Executive summary section"},
{"name": "Financials", "description": "Financial statements and tables"}
]
)
for split in split_result.result.splits:
print(f"{split.name}: pages {split.pages}")
# --- Classify: identify document type ---
classify_result = client.classify.run(
input=upload.file_id,
classification_schema=[
{"category": "invoice", "criteria": ["billing info", "itemized charges"]},
{"category": "contract", "criteria": ["legal terms", "signatures"]}
]
)
print(classify_result.result)
# --- Edit: fill a form ---
# NOTE: Edit uses "document_url" instead of "input" (unlike other endpoints)
edit_result = client.edit.run(
document_url=upload.file_id,
edit_instructions="Fill Name: John Doe, Date: 2024-01-15, Check 'Yes' for US Citizen"
)
print(edit_result.document_url) # URL to download filled document
SDK Naming Conventions
| SDK | Property names | Install | Notes |
|---|
| Python | snake_case (array_extract, file_id) | pip install reductoai | Client auto-reads REDUCTO_API_KEY env var |
| Node.js | snake_case (array_extract, file_id) | npm install reductoai | All methods return promises, use await |
| Go | PascalCase (ArrayExtract, FileID) | go get github.com/reductoai/reducto-go-sdk | Wrap values with reducto.F(), use shared.UnionString() for document URLs |
| REST | snake_case in JSON body | - | Authorization: Bearer $REDUCTO_API_KEY header |
Parse Parameters
POST /parse. Convert documents into structured JSON with text, tables, and figures.
Core Parameters
| Parameter | Type | Default | Description |
|---|
input | string | required | File ID (reducto://...), public URL, presigned URL, or jobid://... to reprocess |
enhance group
| Parameter | Type | Default | Description |
|---|
enhance.agentic | array | [] | List of agentic scopes. Each item has a scope field. |
enhance.agentic[].scope | "text" | "table" | "figure" | - | AI correction scope. text: OCR cleanup for scanned docs. table: fix misaligned columns. figure: chart data extraction. Adds latency and cost. |
enhance.agentic[].prompt | string | null | null | Custom prompt for agentic processing |
enhance.agentic[].advanced_chart_agent | bool | false | Structured chart data extraction (figure scope only) |
enhance.summarize_figures | bool | true | Generate natural language descriptions of figures for RAG |
enhance.intelligent_ordering | bool | false | Use vision model to improve reading order accuracy |
retrieval group
| Parameter | Type | Default | Description |
|---|
retrieval.chunking.chunk_mode | "disabled" | "variable" | "section" | "page" | "block" | "page_sections" | "disabled" | disabled: one chunk for entire doc. variable: semantic boundaries (best for RAG). section: split at headers. page: one chunk per page. page_sections: sections within each page. |
retrieval.chunking.chunk_size | int | null | null | Target chunk size in characters. Defaults to 250-1500 range in variable mode. |
retrieval.chunking.chunk_overlap | int | 0 | Characters of overlap between adjacent chunks |
retrieval.filter_blocks | string[] | [] | Block types to exclude from content/embed. Options: "Header", "Footer", "Title", "Section Header", "Page Number", "List Item", "Figure", "Table", "Key Value", "Text", "Comment", "Signature" |
retrieval.embedding_optimized | bool | false | Optimize output for embedding models |
| Parameter | Type | Default | Description |
|---|
formatting.table_output_format | "dynamic" | "html" | "md" | "json" | "csv" | "jsonbbox" | "dynamic" | dynamic: auto-selects md or html based on complexity. html: best for complex/merged cells. md: simple tables. json: programmatic cell access. |
formatting.add_page_markers | bool | false | Add page markers to output |
formatting.merge_tables | bool | false | Merge consecutive tables with same column count |
formatting.include | string[] | [] | Include: "change_tracking", "highlight", "comments", "hyperlinks", "signatures", "ignore_watermarks" |
spreadsheet group
| Parameter | Type | Default | Description |
|---|
spreadsheet.split_large_tables.enabled | bool | true | Split large tables into smaller tables |
spreadsheet.split_large_tables.size | int | 50 | Rows per chunk for split tables |
spreadsheet.clustering | "accurate" | "fast" | "disabled" | "accurate" | Algorithm for splitting sheets into tables. Accurate uses more powerful models (5x cost). |
spreadsheet.include | string[] | [] | Include: "cell_colors", "formula", "dropdowns" |
settings group
| Parameter | Type | Default | Description |
|---|
settings.page_range | object | null | null | {"start": 1, "end": 10} (1-indexed). Process specific pages only. |
settings.return_images | string[] | [] | Return image URLs for block types: "figure", "table", "page" |
settings.ocr_system | "standard" | "legacy" | "standard" | standard: best multilingual OCR. legacy: Germanic languages only. |
settings.extraction_mode | "ocr" | "hybrid" | "hybrid" | hybrid: combines OCR with embedded PDF text (recommended). ocr: OCR only. |
settings.persist_results | bool | false | Keep results indefinitely (default: expire after 24h) |
settings.force_url_result | bool | false | Always return results as a URL |
settings.timeout | float | null | null | Custom timeout in seconds |
settings.document_password | string | null | null | Password for encrypted documents |
settings.embed_pdf_metadata | bool | false | Embed OCR metadata into returned PDF |
Parse Response Shape
{
"job_id": "uuid",
"duration": 3.89,
"result": {
"type": "full", // "full" (inline) or "url" (fetch from URL)
"chunks": [
{
"content": "# Heading\n\nText content...", // Markdown-formatted
"embed": "Heading. Text content...", // Embedding-optimized
"blocks": [
{
"type": "Title", // Title, Section Header, Text, Table, Figure, Key Value, etc.
"content": "Heading",
"bbox": {"left": 0.1, "top": 0.05, "width": 0.3, "height": 0.04, "page": 1},
"confidence": "high" // "high" or "low"
}
]
}
]
},
"usage": {"num_pages": 3, "credits": 4.0},
"studio_link": "https://studio.reducto.ai/job/..."
}
When result.type is "url", chunks are not inline. Fetch them from the URL:
import requests
if parse_result.result.type == "url":
chunks = requests.get(parse_result.result.url).json()
# chunks are plain dicts when fetched via URL
else:
chunks = parse_result.result.chunks
# chunks are SDK objects with attribute access
POST /extract. Pull specific fields from documents into structured JSON using a schema.
Extract runs Parse internally. If a value doesn’t appear in the Parse output, Extract cannot extract it.
| Parameter | Type | Default | Description |
|---|
input | string | string[] | required | File ID, URL, jobid://..., or array of job IDs to combine |
instructions.schema | object | {} | JSON Schema defining fields to extract. Field names and descriptions directly influence accuracy. |
instructions.system_prompt | string | "Be precise and thorough." | Document-level context for the LLM |
settings.array_extract | bool | false | Segment document for long arrays. Required when schema has array fields in long documents. Schema must have at least one top-level array property. |
settings.deep_extract | bool | false | Agentic mode that iteratively refines output for near-perfect accuracy. Higher cost/latency. |
settings.citations.enabled | bool | false | Return source page, bbox, and text for each value. Mutually exclusive with chunking. |
settings.citations.numerical_confidence | bool | true | Include 0-1 confidence scores (vs “high”/“low”) |
settings.include_images | bool | false | Include page images in extraction context |
settings.optimize_for_latency | bool | false | Higher priority processing at 2x cost |
parsing | object | {} | All Parse parameters (see above). Ignored if input is jobid://. |
{
"result": [
{
"invoice_number": "INV-2024-001",
"total": 1250.00,
"line_items": [{"description": "Widget", "amount": 500.00}]
}
],
"job_id": "uuid",
"usage": {"num_fields": 4, "num_pages": 2, "credits": 10.0},
"studio_link": "https://studio.reducto.ai/job/..."
}
With citations.enabled: true, each value is wrapped:
{
"result": {
"total": {
"value": 1250.00,
"citations": [
{
"type": "Table",
"content": "Total: $1,250.00",
"bbox": {"left": 0.04, "top": 0.26, "width": 0.45, "height": 0.50, "page": 2},
"confidence": "high"
}
]
}
}
}
Split Parameters
POST /split. Divide documents into named sections by page number.
Split runs Parse internally, then uses an LLM to classify pages against your section descriptions.
| Parameter | Type | Default | Description |
|---|
input | string | required | File ID, URL, or jobid://... |
split_description | array | required | List of {"name": "...", "description": "..."} section definitions |
split_rules | string | "Split the document into the applicable sections..." | Natural language rules for splitting behavior |
settings.table_cutoff | "truncate" | "preserve" | "truncate" | truncate: first rows only (faster). preserve: all content. |
parsing | object | {} | All Parse parameters. Ignored if input is jobid://. |
Split Response Shape
{
"result": {
"splits": [
{"name": "Executive Summary", "pages": [1, 2]},
{"name": "Financial Statements", "pages": [3, 4, 5, 6]},
{"name": "Risk Factors", "pages": [7, 8, 9]}
]
},
"job_id": "uuid",
"usage": {"num_pages": 9, "credits": 6.0}
}
Edit Parameters
POST /edit. Fill PDF forms and modify DOCX documents.
Note: Edit uses document_url as its input parameter, not input like other endpoints.
| Parameter | Type | Default | Description |
|---|
document_url | string | required | File ID or URL of the document to edit (this is document_url, not input) |
edit_instructions | string | required | Natural language instructions. Be explicit: "Fill Name: John Doe, Date: 2024-01-15" |
edit_options.color | string | "#FF0000" | Highlight color for edits (DOCX only) |
edit_options.enable_overflow_pages | bool | false | Create appendix pages for text exceeding field capacity (PDF only) |
form_schema | array | null | null | Pre-defined field locations for repeatable form filling. Skips detection. |
Edit Response Shape
{
"document_url": "https://storage.reducto.ai/filled-form.pdf?...",
"form_schema": [
{
"bbox": {"left": 0.1, "top": 0.2, "width": 0.4, "height": 0.03, "page": 1},
"description": "Name field",
"type": "text"
}
],
"usage": {"num_pages": 2, "credits": 8}
}
The document_url is a presigned URL valid for 24 hours. Save the returned form_schema to reuse for the same form type (skips field detection).
Classify Parameters
POST /classify. Categorize a document before processing.
| Parameter | Type | Default | Description |
|---|
input | string | required | File ID or URL |
classification_schema | array | required | List of {"category": "...", "criteria": ["...", "..."]} |
page_range | object | null | null | Pages to use for classification context. Defaults to first 5 pages. Max 10 pages. |
document_metadata | string | null | null | Optional metadata to include in classification prompt |
Classify Response Shape
{
"result": {
"category": "invoice"
},
"job_id": "uuid",
"duration": 1.23
}
Async Processing
Most endpoints have async variants (/parse_async, /extract_async, /split_async, /edit_async). Classify is synchronous only. Async endpoints return a job_id immediately and process in the background.
# Submit async job
job = client.parse.run_job(input=upload.file_id)
print(job.job_id)
# Poll for results
import time
while True:
result = client.job.get(job.job_id)
if result.status in ("Completed", "Failed"):
break
time.sleep(2)
Configure webhooks for push-based delivery instead of polling.
Error Codes
| HTTP Status | Meaning | Common cause |
|---|
| 401 | Unauthorized | Missing or invalid REDUCTO_API_KEY |
| 422 | Validation error | Invalid parameters, schema too large, or constraint violation |
| 429 | Rate limited | Too many concurrent requests. Retry with backoff. |
| 500 | Server error | Transient issue. Retry with backoff. |
Useful Links