Skip to main content
The extract.run() method pulls specific fields from documents as structured JSON. You define a JSON schema with the fields you need, and Extract returns values matching that schema.

Basic Usage

from pathlib import Path
from reducto import Reducto

client = Reducto()

# Upload
upload = client.upload(file=Path("invoice.pdf"))

# Extract with schema
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "invoice_number": {
                    "type": "string",
                    "description": "The invoice number, typically at the top"
                },
                "total": {
                    "type": "number",
                    "description": "The total amount due"
                },
                "date": {
                    "type": "string",
                    "description": "Invoice date"
                }
            }
        }
    }
)

# Access extracted values
print(result.result[0]["invoice_number"])
print(result.result[0]["total"])

Method Signature

def extract.run(
    input: str | list[str],
    instructions: dict | None = None,
    settings: dict | None = None,
    parsing: dict | None = None
) -> ExtractResponse

Parameters

ParameterTypeRequiredDescription
inputstr | list[str]YesFile ID, URL, or jobid:// reference(s)
instructionsdict | NoneNoSchema and/or system prompt for extraction
settingsdict | NoneNoExtraction settings (citations, array extraction, images)
parsingdict | NoneNoParse configuration (used if input is not jobid://)

Settings Options

SettingTypeDefaultDescription
array_extractboolfalseEnable array extraction for repeating data
citations.enabledboolfalseInclude source citations in results
citations.numerical_confidencebooltrueUse numeric confidence scores (0-1)
include_imagesboolfalseInclude images in the extraction context
optimize_for_latencyboolfalsePrioritize speed over cost

Schema Definition

The instructions parameter requires a schema field with a JSON schema:
schema = {
    "type": "object",
    "properties": {
        "field_name": {
            "type": "string",  # or "number", "boolean", "array", "object"
            "description": "Clear description of what to extract"
        }
    }
}

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema}
)

Field Descriptions

Field descriptions are critical for accurate extraction. Be specific:
# Good: Specific description
{
    "invoice_total": {
        "type": "number",
        "description": "The total amount due, typically at the bottom of the invoice in a 'Total' or 'Amount Due' section"
    }
}

# Bad: Vague description
{
    "total": {
        "type": "number",
        "description": "Total"
    }
}

System Prompt

Add document-level context with system_prompt:
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "This is a medical invoice. Extract billing codes and patient information."
    }
)

Input Options

Extract accepts multiple input formats:
# From upload
result = client.extract.run(input=upload.file_id, instructions={...})

# Public URL
result = client.extract.run(input="https://example.com/invoice.pdf", instructions={...})

# Reprocess previous parse job
result = client.extract.run(input="jobid://7600c8c5-...", instructions={...})

# Combine multiple parsed documents
result = client.extract.run(
    input=["jobid://job-1", "jobid://job-2", "jobid://job-3"],
    instructions={...}
)

Array Extraction

For documents with repeating data (line items, transactions), enable array extraction:
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": {
            "type": "object",
            "properties": {
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "number"},
                            "price": {"type": "number"}
                        }
                    }
                }
            }
        }
    },
    settings={
        "array_extract": True
    }
)

Array Extraction Guide

Detailed guide to array extraction configuration.

Citations

Enable citations to get source locations for each extracted value:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "citations": {
            "enabled": True,
            "numerical_confidence": True  # 0-1 confidence score
        }
    }
)

# With citations enabled, values are wrapped
field = result.result[0]["total_amount"]
print(f"Value: {field.value}")
print(f"Page: {field.citations[0].bbox.page}")
print(f"Confidence: {field.citations[0].confidence}")
Citations cannot be used with chunking. If you enable citations, chunking is automatically disabled.

Parsing Configuration

Since Extract runs Parse internally, you can configure parsing:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # For better table extraction
        },
        "formatting": {
            "table_output_format": "html"  # Better for complex tables
        },
        "settings": {
            "page_range": {"start": 1, "end": 10},
            "document_password": "secret"  # For encrypted PDFs
        }
    }
)
These options are ignored if your input is a jobid:// reference.

Response Structure

result: ExtractResponse = client.extract.run(...)

# Top-level fields
print(result.job_id)          # str: Job identifier
print(result.usage.num_pages) # int: Pages processed
print(result.usage.credits)   # float: Credits used
print(result.studio_link)     # str: Studio link

# Extracted data
extracted_data = result.result  # list[dict]: Array of extracted objects
first_result = extracted_data[0]
print(first_result["invoice_number"])

With Citations

When citations are enabled, the response format changes. Instead of a list, result.result is a dict with values wrapped in citation objects:
# Without citations - result.result is a list
result.result[0]["total"]  # 1234.56

# With citations - result.result is a dict
result.result["total"]["value"]  # 1234.56
result.result["total"]["citations"][0]["bbox"]["page"]  # 1
result.result["total"]["citations"][0]["confidence"]  # "high"

Schemaless Extraction

You can also extract without a schema using only a system prompt:
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "system_prompt": "Extract all key financial information from this invoice"
    }
)

# The model decides what to extract
print(result.result[0])

Error Handling

from reducto import Reducto
import reducto

try:
    result = client.extract.run(
        input=upload.file_id,
        instructions={"schema": schema}
    )
except reducto.APIConnectionError as e:
    print(f"Connection failed: {e}")
except reducto.APIStatusError as e:
    print(f"Extraction failed: {e.status_code} - {e.response}")

Complete Example

from pathlib import Path
from reducto import Reducto

client = Reducto()

# Upload
upload = client.upload(file=Path("financial-statement.pdf"))

# Define schema
schema = {
    "type": "object",
    "properties": {
        "portfolio_value": {
            "type": "number",
            "description": "Total portfolio value at the end of the period"
        },
        "total_income_ytd": {
            "type": "number",
            "description": "Total income year-to-date"
        },
        "top_holdings": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Names of the top 5 holdings"
        }
    }
}

# Extract with configuration
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "Extract financial data from this investment statement."
    },
    settings={
        "citations": {"enabled": True},
        "array_extract": True  # For top_holdings array
    },
    parsing={
        "enhance": {
            "agentic": [{"scope": "table"}]  # Better table extraction
        }
    }
)

# Process results
print(f"Extracted {len(result.result)} results")
print(f"Used {result.usage.credits} credits")

for i, extracted in enumerate(result.result):
    print(f"\n=== Result {i + 1} ===")
    print(f"Portfolio Value: ${extracted['portfolio_value']:,.2f}")
    print(f"Total Income YTD: ${extracted['total_income_ytd']:,.2f}")
    print(f"Top Holdings: {', '.join(extracted['top_holdings'])}")

Best Practices

Write Clear Descriptions

Field descriptions directly impact extraction quality. Be specific about location and format.

Use Array Extraction

Enable array_extract for documents with many repeating items (transactions, line items).

Enable Citations for Verification

Use citations to verify extracted values and show users source locations.

Debug with Parse First

If extraction fails, check the Parse output first. Extract can only find what Parse sees.

Troubleshooting

If expected fields are empty:
  1. Check the Parse output: client.parse.run(input=upload.file_id)
  2. Verify the value appears in the parsed content
  3. Improve field descriptions to match how values appear
  4. Try enabling array_extract for long documents
Extract only returns what’s on the document. If you need computed values, extract raw data and compute in your code:
monthly = result.result[0]["monthly_cost"]
annual = monthly * 12  # Compute yourself

Next Steps