Product Summary
Reducto converts documents (PDFs, images, spreadsheets, DOCX, and 30+ other formats) into structured data via a REST API.- Base URL:
https://platform.reducto.ai - Auth:
Authorization: Bearer $REDUCTO_API_KEY - SDKs: Python (
pip install reductoai), Node.js (npm install reductoai), Go (go get github.com/reductoai/reducto-go-sdk) - Input: Upload a file via
/uploadto get afile_id, then pass it to any endpoint. You can also pass public URLs or presigned S3/GCS/Azure URLs directly.
Authentication
- Create a free account at studio.reducto.ai
- In the Studio sidebar, click API Keys, then Create new API key
- Set the key as an environment variable:
REDUCTO_API_KEY from the environment. For the Go SDK, pass it explicitly:
Supported File Types
| Category | Formats |
|---|---|
.pdf | |
| Documents | .docx, .doc, .dotx, .rtf, .txt, .wpd |
| Spreadsheets | .xlsx, .xlsm, .xls, .xltx, .xltm, .csv, .qpw |
| Presentations | .pptx, .ppt |
| Images | .png, .jpg/.jpeg, .gif, .bmp, .tiff, .heic, .psd |
Which Endpoint Should I Use?
| I want to… | Endpoint | Method | Key config |
|---|---|---|---|
| Get all text, tables, and figures from a document separated into chunks with bounding box coordinates | /parse | POST | enhance.agentic for a stronger model pass to correct mistakes |
| Extract specific fields into JSON using a specific schema | /extract | POST | instructions.schema (JSON Schema) |
| Divide a document into named sections by page range | /split | POST | split_description (section definitions) |
| Fill PDF or DOCX forms | /edit | POST | edit_instructions (natural language) |
| Classify a document’s type before processing | /classify | POST | classification_schema (categories + criteria) |
| Upload a local file for processing | /upload | POST (multipart) | file field |
| Process asynchronously with webhooks | /parse_async, /extract_async, /split_async, /edit_async | POST | webhook URL |
| Check job status or retrieve results | /job/{job_id} | GET | - |
Quick Start (Python)
SDK Naming Conventions
| SDK | Property names | Install | Notes |
|---|---|---|---|
| Python | snake_case (array_extract, file_id) | pip install reductoai | Client auto-reads REDUCTO_API_KEY env var |
| Node.js | snake_case (array_extract, file_id) | npm install reductoai | All methods return promises, use await |
| Go | PascalCase (ArrayExtract, FileID) | go get github.com/reductoai/reducto-go-sdk | Wrap values with reducto.F(), use shared.UnionString() for document URLs |
| REST | snake_case in JSON body | - | Authorization: Bearer $REDUCTO_API_KEY header |
Parse Parameters
POST /parse. Convert documents into structured JSON with text, tables, and figures.
Core Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | required | File ID (reducto://...), public URL, presigned URL, or jobid://... to reprocess |
enhance group
| Parameter | Type | Default | Description |
|---|---|---|---|
enhance.agentic | array | [] | List of agentic scopes. Each item has a scope field. |
enhance.agentic[].scope | "text" | "table" | "figure" | - | AI correction scope. text: OCR cleanup for scanned docs. table: fix misaligned columns. figure: chart data extraction. Adds latency and cost. |
enhance.agentic[].prompt | string | null | null | Custom prompt for agentic processing |
enhance.agentic[].advanced_chart_agent | bool | false | Structured chart data extraction (figure scope only) |
enhance.summarize_figures | bool | true | Generate natural language descriptions of figures for RAG |
enhance.intelligent_ordering | bool | false | Use vision model to improve reading order accuracy |
retrieval group
| Parameter | Type | Default | Description |
|---|---|---|---|
retrieval.chunking.chunk_mode | "disabled" | "variable" | "section" | "page" | "block" | "page_sections" | "disabled" | disabled: one chunk for entire doc. variable: semantic boundaries (best for RAG). section: split at headers. page: one chunk per page. page_sections: sections within each page. |
retrieval.chunking.chunk_size | int | null | null | Target chunk size in characters. Defaults to 250-1500 range in variable mode. |
retrieval.chunking.chunk_overlap | int | 0 | Characters of overlap between adjacent chunks |
retrieval.filter_blocks | string[] | [] | Block types to exclude from content/embed. Options: "Header", "Footer", "Title", "Section Header", "Page Number", "List Item", "Figure", "Table", "Key Value", "Text", "Comment", "Signature" |
retrieval.embedding_optimized | bool | false | Optimize output for embedding models |
formatting group
| Parameter | Type | Default | Description |
|---|---|---|---|
formatting.table_output_format | "dynamic" | "html" | "md" | "json" | "csv" | "jsonbbox" | "dynamic" | dynamic: auto-selects md or html based on complexity. html: best for complex/merged cells. md: simple tables. json: programmatic cell access. |
formatting.add_page_markers | bool | false | Add page markers to output |
formatting.merge_tables | bool | false | Merge consecutive tables with same column count |
formatting.include | string[] | [] | Include: "change_tracking", "highlight", "comments", "hyperlinks", "signatures", "ignore_watermarks" |
spreadsheet group
| Parameter | Type | Default | Description |
|---|---|---|---|
spreadsheet.split_large_tables.enabled | bool | true | Split large tables into smaller tables |
spreadsheet.split_large_tables.size | int | 50 | Rows per chunk for split tables |
spreadsheet.clustering | "accurate" | "fast" | "disabled" | "accurate" | Algorithm for splitting sheets into tables. Accurate uses more powerful models (5x cost). |
spreadsheet.include | string[] | [] | Include: "cell_colors", "formula", "dropdowns" |
settings group
| Parameter | Type | Default | Description |
|---|---|---|---|
settings.page_range | object | null | null | {"start": 1, "end": 10} (1-indexed). Process specific pages only. |
settings.return_images | string[] | [] | Return image URLs for block types: "figure", "table", "page" |
settings.ocr_system | "standard" | "legacy" | "standard" | standard: best multilingual OCR. legacy: Germanic languages only. |
settings.extraction_mode | "ocr" | "hybrid" | "hybrid" | hybrid: combines OCR with embedded PDF text (recommended). ocr: OCR only. |
settings.persist_results | bool | false | Keep results indefinitely (default: expire after 24h) |
settings.force_url_result | bool | false | Always return results as a URL |
settings.timeout | float | null | null | Custom timeout in seconds |
settings.document_password | string | null | null | Password for encrypted documents |
settings.embed_pdf_metadata | bool | false | Embed OCR metadata into returned PDF |
Parse Response Shape
result.type is "url", chunks are not inline. Fetch them from the URL:
Extract Parameters
POST /extract. Pull specific fields from documents into structured JSON using a schema.
Extract runs Parse internally. If a value doesn’t appear in the Parse output, Extract cannot extract it.
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | string[] | required | File ID, URL, jobid://..., or array of job IDs to combine |
instructions.schema | object | {} | JSON Schema defining fields to extract. Field names and descriptions directly influence accuracy. |
instructions.system_prompt | string | "Be precise and thorough." | Document-level context for the LLM |
settings.array_extract | bool | false | Segment document for long arrays. Required when schema has array fields in long documents. Schema must have at least one top-level array property. |
settings.deep_extract | bool | false | Agentic mode that iteratively refines output for near-perfect accuracy. Higher cost/latency. |
settings.citations.enabled | bool | false | Return source page, bbox, and text for each value. Mutually exclusive with chunking. |
settings.citations.numerical_confidence | bool | true | Include 0-1 confidence scores (vs “high”/“low”) |
settings.include_images | bool | false | Include page images in extraction context |
settings.optimize_for_latency | bool | false | Higher priority processing at 2x cost |
parsing | object | {} | All Parse parameters (see above). Ignored if input is jobid://. |
Extract Response Shape
citations.enabled: true, each value is wrapped:
Split Parameters
POST /split. Divide documents into named sections by page number.
Split runs Parse internally, then uses an LLM to classify pages against your section descriptions.
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | required | File ID, URL, or jobid://... |
split_description | array | required | List of {"name": "...", "description": "..."} section definitions |
split_rules | string | "Split the document into the applicable sections..." | Natural language rules for splitting behavior |
settings.table_cutoff | "truncate" | "preserve" | "truncate" | truncate: first rows only (faster). preserve: all content. |
parsing | object | {} | All Parse parameters. Ignored if input is jobid://. |
Split Response Shape
Edit Parameters
POST /edit. Fill PDF forms and modify DOCX documents.
Note: Edit uses document_url as its input parameter, not input like other endpoints.
| Parameter | Type | Default | Description |
|---|---|---|---|
document_url | string | required | File ID or URL of the document to edit (this is document_url, not input) |
edit_instructions | string | required | Natural language instructions. Be explicit: "Fill Name: John Doe, Date: 2024-01-15" |
edit_options.color | string | "#FF0000" | Highlight color for edits (DOCX only) |
edit_options.enable_overflow_pages | bool | false | Create appendix pages for text exceeding field capacity (PDF only) |
form_schema | array | null | null | Pre-defined field locations for repeatable form filling. Skips detection. |
Edit Response Shape
document_url is a presigned URL valid for 24 hours. Save the returned form_schema to reuse for the same form type (skips field detection).
Classify Parameters
POST /classify. Categorize a document before processing.
| Parameter | Type | Default | Description |
|---|---|---|---|
input | string | required | File ID or URL |
classification_schema | array | required | List of {"category": "...", "criteria": ["...", "..."]} |
page_range | object | null | null | Pages to use for classification context. Defaults to first 5 pages. Max 10 pages. |
document_metadata | string | null | null | Optional metadata to include in classification prompt |
Classify Response Shape
Async Processing
Most endpoints have async variants (/parse_async, /extract_async, /split_async, /edit_async). Classify is synchronous only. Async endpoints return a job_id immediately and process in the background.
Error Codes
| HTTP Status | Meaning | Common cause |
|---|---|---|
| 401 | Unauthorized | Missing or invalid REDUCTO_API_KEY |
| 422 | Validation error | Invalid parameters, schema too large, or constraint violation |
| 429 | Rate limited | Too many concurrent requests. Retry with backoff. |
| 500 | Server error | Transient issue. Retry with backoff. |
Useful Links
- API Reference: Full OpenAPI spec with request/response details
- Parse Configuration: All configuration options
- Cookbooks: End-to-end tutorials (invoice extraction, form filling, RAG)
- Error Codes: Complete error catalog
- Rate Limits: Request limits and quotas
- Credit Usage: How credits are calculated