Extract

The extract.run() method pulls specific fields from documents as structured JSON. You define a JSON schema with the fields you need, and Extract returns values matching that schema.

Basic Usage

from pathlib import Path
from reducto import Reducto

client = Reducto()

# Upload
upload = client.upload(file=Path("invoice.pdf"))

# Extract with schema
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {
                    "type": "string",
                    "description": "The invoice number, typically at the top"
                },
                "total": {
                    "type": "number",
                    "description": "The total amount due"
                },
                "date": {
                    "type": "string",
                    "description": "Invoice date"
                }
            }
        }
    }
)

# Access extracted values
print(result.result[0]["invoice_number"])
print(result.result[0]["total"])

Method Signature

def extract.run(
    input: str | list[str],
    instructions: dict | None = None,
    settings: dict | None = None,
    parsing: dict | None = None
) -> ExtractResponse

Parameters

Parameter	Type	Required	Description
`input`	`str \| list[str]`	Yes	File ID, URL, or `jobid://` reference(s)
`instructions`	`dict \| None`	No	Schema and/or system prompt for extraction
`settings`	`dict \| None`	No	Extraction settings (citations, array extraction, images)
`parsing`	`dict \| None`	No	Parse configuration (used if input is not `jobid://`)

Settings Options

Setting	Type	Default	Description
`array_extract`	`bool`	`false`	Enable array extraction for repeating data
`citations.enabled`	`bool`	`false`	Include source citations in results
`citations.numerical_confidence`	`bool`	`true`	Use numeric confidence scores (0-1)
`include_images`	`bool`	`false`	Include images in the extraction context
`optimize_for_latency`	`bool`	`false`	Prioritize speed over cost

Schema Definition

The instructions parameter requires a schema field with a JSON schema:

schema = {
    "type": "object",
    "properties": {
        "field_name": {
            "type": "string",  # or "number", "boolean", "array", "object"
            "description": "Clear description of what to extract"
        }
    }
}

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema}
)

Field Descriptions

Field descriptions are critical for accurate extraction. Be specific:

# Good: Specific description
{
    "invoice_total": {
        "type": "number",
        "description": "The total amount due, typically at the bottom of the invoice in a 'Total' or 'Amount Due' section"
    }
}

# Bad: Vague description
{
    "total": {
        "type": "number",
        "description": "Total"
    }
}

System Prompt

Add document-level context with system_prompt:

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "This is a medical invoice. Extract billing codes and patient information."
    }
)

Input Options

Extract accepts multiple input formats:

# From upload
result = client.extract.run(input=upload.file_id, instructions={...})

# Public URL
result = client.extract.run(input="https://example.com/invoice.pdf", instructions={...})

# Reprocess previous parse job
result = client.extract.run(input="jobid://7600c8c5-...", instructions={...})

# Combine multiple parsed documents
result = client.extract.run(
    input=["jobid://job-1", "jobid://job-2", "jobid://job-3"],
    instructions={...}
)

Array Extraction

For documents with repeating data (line items, transactions), enable array extraction:

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "number"},
                            "price": {"type": "number"}
                        }
                    }
                }
            }
        }
    },
    settings={
        "array_extract": True
    }
)

Array Extraction Guide

Detailed guide to array extraction configuration.

Citations

Enable citations to get source locations for each extracted value:

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "citations": {
            "enabled": True,
            "numerical_confidence": True  # 0-1 confidence score
        }
    }
)

# With citations enabled, values are wrapped
field = result.result[0]["total_amount"]
print(f"Value: {field.value}")
print(f"Page: {field.citations[0].bbox.page}")
print(f"Confidence: {field.citations[0].confidence}")

Citations cannot be used with chunking. If you enable citations, chunking is automatically disabled.

Parsing Configuration

Since Extract runs Parse internally, you can configure parsing:

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # For better table extraction
        },
        "formatting": {
            "table_output_format": "html"  # Better for complex tables
        },
        "settings": {
            "page_range": {"start": 1, "end": 10},
            "document_password": "secret"  # For encrypted PDFs
        }
    }
)

These options are ignored if your input is a jobid:// reference.

Response Structure

result: ExtractResponse = client.extract.run(...)

# Top-level fields
print(result.job_id)          # str: Job identifier
print(result.usage.num_pages) # int: Pages processed
print(result.usage.credits)   # float: Credits used
print(result.studio_link)     # str: Studio link

# Extracted data
extracted_data = result.result  # list[dict]: Array of extracted objects
first_result = extracted_data[0]
print(first_result["invoice_number"])

With Citations

When citations are enabled, the response format changes. Instead of a list, result.result is a dict with values wrapped in citation objects:

# Without citations - result.result is a list
result.result[0]["total"]  # 1234.56

# With citations - result.result is a dict
result.result["total"]["value"]  # 1234.56
result.result["total"]["citations"][0]["bbox"]["page"]  # 1
result.result["total"]["citations"][0]["confidence"]  # "high"

Schemaless Extraction

You can also extract without a schema using only a system prompt:

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "system_prompt": "Extract all key financial information from this invoice"
    }
)

# The model decides what to extract
print(result.result[0])

Error Handling

from reducto import Reducto
import reducto

try:
    result = client.extract.run(
        input=upload.file_id,
        instructions={"schema": schema}
    )
except reducto.APIConnectionError as e:
    print(f"Connection failed: {e}")
except reducto.APIStatusError as e:
    print(f"Extraction failed: {e.status_code} - {e.response}")

Complete Example

from pathlib import Path
from reducto import Reducto

client = Reducto()

# Upload
upload = client.upload(file=Path("financial-statement.pdf"))

# Define schema
schema = {
    "type": "object",
    "properties": {
        "portfolio_value": {
            "type": "number",
            "description": "Total portfolio value at the end of the period"
        },
        "total_income_ytd": {
            "type": "number",
            "description": "Total income year-to-date"
        },
        "top_holdings": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Names of the top 5 holdings"
        }
    }
}

# Extract with configuration
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "Extract financial data from this investment statement."
    },
    settings={
        "citations": {"enabled": True},
        "array_extract": True  # For top_holdings array
    },
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # Better table extraction
        }
    }
)

# Process results
print(f"Extracted {len(result.result)} results")
print(f"Used {result.usage.credits} credits")

for i, extracted in enumerate(result.result):
    print(f"\n=== Result {i + 1} ===")
    print(f"Portfolio Value: ${extracted['portfolio_value']:,.2f}")
    print(f"Total Income YTD: ${extracted['total_income_ytd']:,.2f}")
    print(f"Top Holdings: {', '.join(extracted['top_holdings'])}")

Best Practices

Write Clear Descriptions

Field descriptions directly impact extraction quality. Be specific about location and format.

Use Array Extraction

Enable array_extract for documents with many repeating items (transactions, line items).

Enable Citations for Verification

Use citations to verify extracted values and show users source locations.

Debug with Parse First

If extraction fails, check the Parse output first. Extract can only find what Parse sees.

Troubleshooting

Missing values

If expected fields are empty:

Check the Parse output: client.parse.run(input=upload.file_id)
Verify the value appears in the parsed content
Improve field descriptions to match how values appear
Try enabling array_extract for long documents

Hallucinated values

Extract only returns what’s on the document. If you need computed values, extract raw data and compute in your code:

monthly = result.result[0]["monthly_cost"]
annual = monthly * 12  # Compute yourself

Next Steps

Learn about schema design best practices
Explore array extraction for long documents
Check out citations for source verification
See the async client for concurrent processing

Get Started

Core Methods

Async & Utilities

Basic Usage

Method Signature

Parameters

Settings Options

Schema Definition

Field Descriptions

System Prompt

Input Options

Array Extraction

Array Extraction Guide

Citations

Parsing Configuration

Response Structure

With Citations

Schemaless Extraction

Error Handling

Complete Example

Best Practices

Write Clear Descriptions

Use Array Extraction

Enable Citations for Verification

Debug with Parse First

Troubleshooting

Next Steps

Get Started

Core Methods

Async & Utilities

​Basic Usage

​Method Signature

​Parameters

​Settings Options

​Schema Definition

​Field Descriptions

​System Prompt

​Input Options

​Array Extraction

Array Extraction Guide

​Citations

​Parsing Configuration

​Response Structure

​With Citations

​Schemaless Extraction

​Error Handling

​Complete Example

​Best Practices

Write Clear Descriptions

Use Array Extraction

Enable Citations for Verification

Debug with Parse First

​Troubleshooting

​Next Steps

Basic Usage

Method Signature

Parameters

Settings Options

Schema Definition

Field Descriptions

System Prompt

Input Options

Array Extraction

Citations

Parsing Configuration

Response Structure

With Citations

Schemaless Extraction

Error Handling

Complete Example

Best Practices

Troubleshooting

Next Steps