Skip to main content
Extract returns specific fields from documents as structured JSON. You define a schema with the fields you need, and Extract returns values matching that schema. Under the hood, Extract runs Parse to process the document (OCR, layout detection, table parsing), then uses an LLM to locate and extract the specified fields.

Parse vs Extract

Both endpoints process documents, but they answer different questions. Parse answers: “What’s in this document?” It returns all content as structured chunks with positions and types. Use Parse for RAG pipelines, document viewers, or when you need to feed full content to an LLM. Extract answers: “What is the value of X?” It returns only the specific fields you request. Extract runs Parse internally, then uses AI to pull out values matching your schema. The key insight is that Extract can only return what Parse sees. If a value doesn’t appear in the Parse output (perhaps due to OCR issues or a table format problem), no amount of schema tweaking will extract it.
When debugging extraction issues, always verify the data exists in the Parse result first.

Quick Start

Given this investment statement, we’ll extract the portfolio value change, total income, and top holdings: Finance Statement
from pathlib import Path
from reducto import Reducto

client = Reducto()
upload = client.upload(file=Path("statement.pdf"))

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "portfolio_increase": {
                    "type": "number",
                    "description": "Increase in total portfolio value"
                },
                "total_income_ytd": {
                    "type": "number",
                    "description": "Total income year-to-date"
                },
                "top_holdings": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Names of top holdings"
                }
            }
        },
        "system_prompt": "Extract financial data from this investment statement."
    }
)

print(result.result)
What this does:
  1. Upload the PDF to get a file_id
  2. Call /extract with the file reference and a JSON schema defining the three fields you want
  3. Get back JSON with exactly those fields populated from the document
{
  "result": [
    {
      "portfolio_increase": 21000.37,
      "total_income_ytd": 23278.62,
      "top_holdings": [
        "Johnson & Johnson (JNJ)",
        "Apple Inc (AAPL)",
        "NH Portfolio 2015 Delphi",
        "Corp Jr Sb Nt Slm Corp",
        "Spi Lkd Nt (OSM)"
      ]
    }
  ],
  "job_id": "9531166f-9725-4854-8096-459785a33972",
  "usage": {"num_fields": 7, "num_pages": 3, "credits": 10.0},
  "studio_link": "https://studio.reducto.ai/job/9531166f-..."
}
The result is an array containing objects matching your schema. When you enable citations, the response format changes to wrap each value with its source location.

Response Format Details

Full breakdown of result structure, citations, and usage fields.

Request Parameters

result = client.extract.run(
    input="...",                    # Required: file_id, jobid://, or URL
    instructions={
        "schema": {...},            # JSON schema defining fields to extract
        "system_prompt": "..."      # Context for the LLM about the document
    },
    settings={
        "array_extract": False,     # Segment document for long arrays
        "citations": {
            "enabled": False,       # Return source locations
            "numerical_confidence": True
        },
        "include_images": False,
        "optimize_for_latency": False
    },
    parsing={...}                   # Parse options (ignored if using jobid://)
)

input (required)

The document to process. Accepts several formats:
FormatExampleWhen to use
Upload responsereducto://abc123Local files uploaded via /upload
Public URLhttps://example.com/doc.pdfPublicly accessible documents
Presigned URLhttps://bucket.s3.../doc.pdf?X-Amz-...Files in your cloud storage
Job IDjobid://7600c8c5-...Reuse a previous Parse result
Job ID list["jobid://...", "jobid://..."]Combine multiple parsed documents
Using jobid:// skips the parsing step entirely, which is useful when you want to try different extraction schemas on the same document without re-parsing, or when combining data from multiple documents into a single extraction.
# Combine multiple parsed documents
result = client.extract.run(
    input=["jobid://job-1", "jobid://job-2", "jobid://job-3"],
    instructions={"schema": schema}
)

instructions

FieldPurpose
schemaJSON schema defining target fields and types. Field names and descriptions directly influence extraction quality because the LLM uses them to locate values. A field called invoice_total with description "The total amount due, typically at the bottom of the invoice" performs better than a generic total field.
system_promptDocument-level context. Describe what kind of document this is or highlight edge cases. Field-specific instructions belong in schema descriptions, not here.

settings

FieldDefaultPurpose
array_extractfalseFor documents with repeating data (line items, transactions). Segments the document, extracts from each segment, and merges results. Required when you need complete arrays from long documents.
citations.enabledfalseReturn page number, bounding box, and source text for each extracted value. Useful for verification and debugging.
citations.numerical_confidencetrueWhen citations are enabled, include a 0-1 confidence score instead of just “high”/“low”.
include_imagesfalseInclude page images in the extraction context. Can help with visually complex documents but increases cost.
optimize_for_latencyfalsePrioritize speed at 2x credit cost. Jobs get higher priority in the processing queue.
Citations cannot be used with chunking. If you enable settings.citations.enabled, the parsing step automatically disables chunking. This is because citations require knowing exactly where each piece of content came from, which chunking obscures.

parsing

Since Extract runs Parse internally, you can configure how parsing works. These options are ignored if your input is a jobid:// reference. Common options:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # LLM correction for tables
        },
        "formatting": {
            "table_output_format": "html"     # Better for complex tables
        },
        "settings": {
            "page_range": {"start": 1, "end": 10},  # Process specific pages
            "document_password": "secret"            # For encrypted PDFs
        }
    }
)

Parse Configuration

All available parsing options.

Schema vs Schemaless

Extract supports two modes of operation: schema-based extraction (the default) and schemaless extraction. Schema-based extraction is what most users need. You define a JSON schema specifying exactly which fields to extract and their types. The model returns data matching your schema structure. This gives you predictable, typed output that integrates cleanly with your application code.
# Schema-based: you define the exact structure
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "total": {"type": "number"}
            }
        }
    }
)
Schemaless extraction lets the model decide what to extract based on a natural language prompt. Instead of providing a schema, you describe what you want in plain English. The model analyzes the document and returns whatever it deems relevant. This is useful for exploration or when you don’t know the document structure in advance.
# Schemaless: the model decides what to extract
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "system_prompt": "Extract all the key financial information from this invoice"
    }
)
Use schema-based extraction for production workflows where you need consistent output structure. Use schemaless extraction when exploring new document types or building prototypes.

Schema Best Practices

Detailed guidance on schema design, naming conventions, and descriptions.

Array Extraction

Standard extraction works well for short documents, but for documents with many repeating items (hundreds of transactions, long invoice line items), you need array extraction. The problem: LLMs have context limits. When a document is too long, items toward the end may be truncated or missed. Array extraction solves this by segmenting the document, extracting from each segment, and merging the results.
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "transactions": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "date": {"type": "string"},
                            "description": {"type": "string"},
                            "amount": {"type": "number"}
                        }
                    }
                }
            }
        }
    },
    settings={"array_extract": True}
)
Array extraction requires at least one top-level property of type array in your schema. If your schema has no arrays, the endpoint returns an error.
For truly critical arrays where you cannot afford to miss any items, consider Agent-in-the-Loop, which uses an AI agent to iteratively verify completeness at the cost of additional latency.

Array Extraction Guide

Detailed configuration and algorithm options.

Citations

Citations link each extracted value back to its source location in the document. Enable them when you need to verify extractions or show users where values came from.
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "citations": {
            "enabled": True
        }
    }
)

# With citations enabled, result is a dict with wrapped values
field = result.result["total_amount"]
print(f"Value: {field.value}")
print(f"Found on page {field.citations[0].bbox.page}")
print(f"Confidence: {field.citations[0].confidence}")
When citations are enabled, the response format changes. Each value is wrapped in an object containing value and citations:
{
  "result": {
    "total_amount": {
      "value": 23278.62,
      "citations": [
        {
          "type": "Table",
          "content": "Total: $23,278.62",
          "bbox": {"left": 0.04, "top": 0.26, "width": 0.45, "height": 0.50, "page": 3},
          "confidence": "high"
        }
      ]
    }
  }
}
Each citation includes:
  • Page number where the value was found
  • Bounding box coordinates (normalized 0-1)
  • Confidence as "high" or "low"
  • Source text the original text that was extracted from

Citations Guide

Working with bounding boxes and confidence scores.

Troubleshooting

LLM outputs are inherently non-deterministic. Small variations are normal. To reduce variance:
  1. Use enums to constrain possible values
  2. Make field descriptions more specific
  3. Add examples in your system prompt
If you need identical outputs for identical inputs, consider caching results by document hash.
This typically happens with long documents containing arrays. Enable array_extract to process the full document:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={"array_extract": True}
)
You can also add guidance in your system prompt: “Process all pages in the document, not just the beginning.”
When expected fields come back empty:
  1. Check the Parse output first. Extract can only find what Parse sees. Run client.parse.run(input=upload.file_id) and verify the value appears in the content.
  2. If it’s in Parse output, refine your schema. Add better field descriptions that match how the value appears in the document.
  3. If it’s not in Parse output, adjust your parsing configuration. Try enabling agentic mode for tables, or changing the table output format to HTML.
For long arrays, also try enabling array_extract.
Extract returns only what’s on the document. If you request calculated fields (like “annual cost” when only monthly appears), the model may fabricate values.Solution: Extract raw values and compute in your code:
monthly_cost = result.result["monthly_cost"].value
annual_cost = monthly_cost * 12  # Compute yourself
Enable citations to verify source locations for any suspicious values.
Very large schemas may exceed LLM token limits and fail with a 422 error. Solutions:
  1. Flatten deeply nested structures
  2. Remove unnecessary fields
  3. Split into multiple extraction calls
As a rule of thumb, keep schemas under 50 fields. If you need more, consider breaking the extraction into logical groups.
If you see “Citations and chunking cannot be enabled at the same time”, you have conflicting options.When citations are enabled, chunking is automatically disabled in the parsing step. If you’re explicitly setting chunking options in parsing.retrieval.chunking, either remove them or disable citations.
Pass the document password in parsing settings:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "settings": {"document_password": "your-password"}
    }
)

Next Steps