Extract

Extract returns specific fields from documents as structured JSON. You define a schema with the fields you need, and Extract returns values matching that schema. Under the hood, Extract runs Parse to process the document (OCR, layout detection, table parsing), then uses an LLM to locate and extract the specified fields.

Parse vs Extract

Both endpoints process documents, but they answer different questions. Parse answers: “What’s in this document?” It returns all content as structured chunks with positions and types. Use Parse for RAG pipelines, document viewers, or when you need to feed full content to an LLM. Extract answers: “What is the value of X?” It returns only the specific fields you request. Extract runs Parse internally, then uses AI to pull out values matching your schema. The key insight is that Extract can only return what Parse sees. If a value doesn’t appear in the Parse output (perhaps due to OCR issues or a table format problem), no amount of schema tweaking will extract it.

When debugging extraction issues, always verify the data exists in the Parse result first.

Quick Start

Given this investment statement, we’ll extract the portfolio value change, total income, and top holdings:

from pathlib import Path
from reducto import Reducto

client = Reducto()
upload = client.upload(file=Path("statement.pdf"))

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "portfolio_increase": {
                    "type": "number",
                    "description": "Increase in total portfolio value"
                },
                "total_income_ytd": {
                    "type": "number",
                    "description": "Total income year-to-date"
                },
                "top_holdings": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Names of top holdings"
                }
            }
        },
        "system_prompt": "Extract financial data from this investment statement."
    }
)

print(result.result)

What this does:

Upload the PDF to get a file_id
Call /extract with the file reference and a JSON schema defining the three fields you want
Get back JSON with exactly those fields populated from the document

{
  "result": [
    {
      "portfolio_increase": 21000.37,
      "total_income_ytd": 23278.62,
      "top_holdings": [
        "Johnson & Johnson (JNJ)",
        "Apple Inc (AAPL)",
        "NH Portfolio 2015 Delphi",
        "Corp Jr Sb Nt Slm Corp",
        "Spi Lkd Nt (OSM)"
      ]
    }
  ],
  "job_id": "9531166f-9725-4854-8096-459785a33972",
  "usage": {"num_fields": 7, "num_pages": 3, "credits": 10.0},
  "studio_link": "https://studio.reducto.ai/job/9531166f-..."
}

The result is an array containing objects matching your schema. When you enable citations, the response format changes to wrap each value with its source location.

Response Format Details

Full breakdown of result structure, citations, and usage fields.

Request Parameters

result = client.extract.run(
    input="...",                    # Required: file_id, jobid://, or URL
    instructions={
        "schema": {...},            # JSON schema defining fields to extract
        "system_prompt": "..."      # Context for the LLM about the document
    },
    settings={
        "array_extract": False,     # Segment document for long arrays
        "citations": {
            "enabled": False,       # Return source locations
            "numerical_confidence": True
        },
        "include_images": False,
        "optimize_for_latency": False
    },
    parsing={...}                   # Parse options (ignored if using jobid://)
)

input (required)

The document to process. Accepts several formats:

Format	Example	When to use
Upload response	`reducto://abc123`	Local files uploaded via `/upload`
Public URL	`https://example.com/doc.pdf`	Publicly accessible documents
Presigned URL	`https://bucket.s3.../doc.pdf?X-Amz-...`	Files in your cloud storage
Job ID	`jobid://7600c8c5-...`	Reuse a previous Parse result
Job ID list	`["jobid://...", "jobid://..."]`	Combine multiple parsed documents

Using jobid:// skips the parsing step entirely, which is useful when you want to try different extraction schemas on the same document without re-parsing, or when combining data from multiple documents into a single extraction.

# Combine multiple parsed documents
result = client.extract.run(
    input=["jobid://job-1", "jobid://job-2", "jobid://job-3"],
    instructions={"schema": schema}
)

instructions

Field	Purpose
`schema`	JSON schema defining target fields and types. Field names and descriptions directly influence extraction quality because the LLM uses them to locate values. A field called `invoice_total` with description `"The total amount due, typically at the bottom of the invoice"` performs better than a generic `total` field.
`system_prompt`	Document-level context. Describe what kind of document this is or highlight edge cases. Field-specific instructions belong in schema descriptions, not here.

settings

Field	Default	Purpose
`array_extract`	`false`	For documents with repeating data (line items, transactions). Segments the document, extracts from each segment, and merges results. Required when you need complete arrays from long documents.
`citations.enabled`	`false`	Return page number, bounding box, and source text for each extracted value. Useful for verification and debugging.
`citations.numerical_confidence`	`true`	When citations are enabled, include a 0-1 confidence score instead of just “high”/“low”.
`include_images`	`false`	Include page images in the extraction context. Can help with visually complex documents but increases cost.
`optimize_for_latency`	`false`	Prioritize speed at 2x credit cost. Jobs get higher priority in the processing queue.

Citations cannot be used with chunking. If you enable settings.citations.enabled, the parsing step automatically disables chunking. This is because citations require knowing exactly where each piece of content came from, which chunking obscures.

parsing

Since Extract runs Parse internally, you can configure how parsing works. These options are ignored if your input is a jobid:// reference. Common options:

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # LLM correction for tables
        },
        "formatting": {
            "table_output_format": "html"     # Better for complex tables
        },
        "settings": {
            "page_range": {"start": 1, "end": 10},  # Process specific pages
            "document_password": "secret"            # For encrypted PDFs
        }
    }
)

Parse Configuration

All available parsing options.

Schema vs Schemaless

Extract supports two modes of operation: schema-based extraction (the default) and schemaless extraction. Schema-based extraction is what most users need. You define a JSON schema specifying exactly which fields to extract and their types. The model returns data matching your schema structure. This gives you predictable, typed output that integrates cleanly with your application code.

# Schema-based: you define the exact structure
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total": {"type": "number"}
            }
        }
    }
)

Schemaless extraction lets the model decide what to extract based on a natural language prompt. Instead of providing a schema, you describe what you want in plain English. The model analyzes the document and returns whatever it deems relevant. This is useful for exploration or when you don’t know the document structure in advance.

# Schemaless: the model decides what to extract
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "system_prompt": "Extract all the key financial information from this invoice"
    }
)

Use schema-based extraction for production workflows where you need consistent output structure. Use schemaless extraction when exploring new document types or building prototypes.

Schema Best Practices

Detailed guidance on schema design, naming conventions, and descriptions.

Array Extraction

Standard extraction works well for short documents, but for documents with many repeating items (hundreds of transactions, long invoice line items), you need array extraction. The problem: LLMs have context limits. When a document is too long, items toward the end may be truncated or missed. Array extraction solves this by segmenting the document, extracting from each segment, and merging the results.

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "transactions": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "date": {"type": "string"},
                            "description": {"type": "string"},
                            "amount": {"type": "number"}
                        }
                    }
                }
            }
        }
    },
    settings={"array_extract": True}
)

Array extraction requires at least one top-level property of type array in your schema. If your schema has no arrays, the endpoint returns an error.

Array Extraction Guide

Detailed configuration and algorithm options.

Citations

Citations link each extracted value back to its source location in the document. Enable them when you need to verify extractions or show users where values came from.

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "citations": {
            "enabled": True
        }
    }
)

# With citations enabled, result is a dict with wrapped values
field = result.result["total_amount"]
print(f"Value: {field.value}")
print(f"Found on page {field.citations[0].bbox.page}")
print(f"Confidence: {field.citations[0].confidence}")

When citations are enabled, the response format changes. Each value is wrapped in an object containing value and citations:

{
  "result": {
    "total_amount": {
      "value": 23278.62,
      "citations": [
        {
          "type": "Table",
          "content": "Total: $23,278.62",
          "bbox": {"left": 0.04, "top": 0.26, "width": 0.45, "height": 0.50, "page": 3},
          "confidence": "high"
        }
      ]
    }
  }
}

Each citation includes:

Page number where the value was found
Bounding box coordinates (normalized 0-1)
Confidence as "high" or "low"
Source text the original text that was extracted from

Citations Guide

Working with bounding boxes and confidence scores.

Troubleshooting

Outputs differ between runs

LLM outputs are inherently non-deterministic. Small variations are normal. To reduce variance:

Use enums to constrain possible values
Make field descriptions more specific
Add examples in your system prompt

If you need identical outputs for identical inputs, consider caching results by document hash.

Only the first pages are processed

This typically happens with long documents containing arrays. Enable array_extract to process the full document:

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={"array_extract": True}
)

You can also add guidance in your system prompt: “Process all pages in the document, not just the beginning.”

Missing values from schema

When expected fields come back empty:

Check the Parse output first. Extract can only find what Parse sees. Run client.parse.run(input=upload.file_id) and verify the value appears in the content.
If it’s in Parse output, refine your schema. Add better field descriptions that match how the value appears in the document.
If it’s not in Parse output, adjust your parsing configuration. Try enabling agentic mode for tables, or changing the table output format to HTML.

For long arrays, also try enabling array_extract.

Hallucinated or computed values

Extract returns only what’s on the document. If you request calculated fields (like “annual cost” when only monthly appears), the model may fabricate values.Solution: Extract raw values and compute in your code:

monthly_cost = result.result["monthly_cost"].value
annual_cost = monthly_cost * 12  # Compute yourself

Enable citations to verify source locations for any suspicious values.

Schema is too large

Very large schemas may exceed LLM token limits and fail with a 422 error. Solutions:

Flatten deeply nested structures
Remove unnecessary fields
Split into multiple extraction calls

As a rule of thumb, keep schemas under 50 fields. If you need more, consider breaking the extraction into logical groups.

Citations and chunking error

If you see “Citations and chunking cannot be enabled at the same time”, you have conflicting options.When citations are enabled, chunking is automatically disabled in the parsing step. If you’re explicitly setting chunking options in parsing.retrieval.chunking, either remove them or disable citations.

Password-protected PDF

Pass the document password in parsing settings:

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "settings": {"document_password": "your-password"}
    }
)

Next Steps

Response Format

Full breakdown of the response structure.

Best Practices

Schema design and prompt writing tips.

Array Extraction

Handle long documents with repeating data.

Citations

Trace values back to source locations.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

Parse vs Extract

Quick Start

Response Format Details

Request Parameters

input (required)

instructions

settings

parsing

Parse Configuration

Schema vs Schemaless

Schema Best Practices

Array Extraction

Array Extraction Guide

Citations

Citations Guide

Troubleshooting

Next Steps

Response Format

Best Practices

Array Extraction

Citations

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​Parse vs Extract

​Quick Start

Response Format Details

​Request Parameters

​input (required)

​instructions

​settings

​parsing

Parse Configuration

​Schema vs Schemaless

Schema Best Practices

​Array Extraction

Array Extraction Guide

​Citations

Citations Guide

​Troubleshooting

​Next Steps

Response Format

Best Practices

Array Extraction

Citations

Parse vs Extract

Quick Start

Request Parameters

input (required)

instructions

settings

parsing

Schema vs Schemaless

Array Extraction

Citations

Troubleshooting

Next Steps