The extract.run() method pulls specific fields from documents as structured JSON. You define a JSON schema with the fields you need, and Extract returns values matching that schema.
Basic Usage
from pathlib import Path
from reducto import Reducto
client = Reducto()
# Upload
upload = client.upload( file = Path( "invoice.pdf" ))
# Extract with schema
result = client.extract.run(
input = upload.file_id,
instructions = {
"schema" : {
"type" : "object" ,
"properties" : {
"invoice_number" : {
"type" : "string" ,
"description" : "The invoice number, typically at the top"
},
"total" : {
"type" : "number" ,
"description" : "The total amount due"
},
"date" : {
"type" : "string" ,
"description" : "Invoice date"
}
}
}
}
)
# Access extracted values
print (result.result[ 0 ][ "invoice_number" ])
print (result.result[ 0 ][ "total" ])
Method Signature
def extract.run(
input : str | list[ str ],
instructions: dict | None = None ,
settings: dict | None = None ,
parsing: dict | None = None
) -> ExtractResponse
Parameters
Parameter Type Required Description inputstr | list[str]Yes File ID, URL, or jobid:// reference(s) instructionsdict | NoneNo Schema and/or system prompt for extraction settingsdict | NoneNo Extraction settings (citations, array extraction, images) parsingdict | NoneNo Parse configuration (used if input is not jobid://)
Settings Options
Setting Type Default Description array_extractboolfalseEnable array extraction for repeating data citations.enabledboolfalseInclude source citations in results citations.numerical_confidencebooltrueUse numeric confidence scores (0-1) include_imagesboolfalseInclude images in the extraction context optimize_for_latencyboolfalsePrioritize speed over cost
Schema Definition
The instructions parameter requires a schema field with a JSON schema:
schema = {
"type" : "object" ,
"properties" : {
"field_name" : {
"type" : "string" , # or "number", "boolean", "array", "object"
"description" : "Clear description of what to extract"
}
}
}
result = client.extract.run(
input = upload.file_id,
instructions = { "schema" : schema}
)
Field Descriptions
Field descriptions are critical for accurate extraction. Be specific:
# Good: Specific description
{
"invoice_total" : {
"type" : "number" ,
"description" : "The total amount due, typically at the bottom of the invoice in a 'Total' or 'Amount Due' section"
}
}
# Bad: Vague description
{
"total" : {
"type" : "number" ,
"description" : "Total"
}
}
System Prompt
Add document-level context with system_prompt:
result = client.extract.run(
input = upload.file_id,
instructions = {
"schema" : schema,
"system_prompt" : "This is a medical invoice. Extract billing codes and patient information."
}
)
Extract accepts multiple input formats:
# From upload
result = client.extract.run( input = upload.file_id, instructions = { ... })
# Public URL
result = client.extract.run( input = "https://example.com/invoice.pdf" , instructions = { ... })
# Reprocess previous parse job
result = client.extract.run( input = "jobid://7600c8c5-..." , instructions = { ... })
# Combine multiple parsed documents
result = client.extract.run(
input = [ "jobid://job-1" , "jobid://job-2" , "jobid://job-3" ],
instructions = { ... }
)
For documents with repeating data (line items, transactions), enable array extraction:
result = client.extract.run(
input = upload.file_id,
instructions = {
"schema" : {
"type" : "object" ,
"properties" : {
"line_items" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"description" : { "type" : "string" },
"quantity" : { "type" : "number" },
"price" : { "type" : "number" }
}
}
}
}
}
},
settings = {
"array_extract" : True
}
)
Array Extraction Guide Detailed guide to array extraction configuration.
Citations
Enable citations to get source locations for each extracted value:
result = client.extract.run(
input = upload.file_id,
instructions = { "schema" : schema},
settings = {
"citations" : {
"enabled" : True ,
"numerical_confidence" : True # 0-1 confidence score
}
}
)
# With citations enabled, values are wrapped
field = result.result[ 0 ][ "total_amount" ]
print ( f "Value: { field.value } " )
print ( f "Page: { field.citations[ 0 ].bbox.page } " )
print ( f "Confidence: { field.citations[ 0 ].confidence } " )
Citations cannot be used with chunking. If you enable citations, chunking is automatically disabled.
Parsing Configuration
Since Extract runs Parse internally, you can configure parsing:
result = client.extract.run(
input = upload.file_id,
instructions = { "schema" : schema},
parsing = {
"enhance" : {
"agentic" : [{ "scope" : "table" }] # For better table extraction
},
"formatting" : {
"table_output_format" : "html" # Better for complex tables
},
"settings" : {
"page_range" : { "start" : 1 , "end" : 10 },
"document_password" : "secret" # For encrypted PDFs
}
}
)
These options are ignored if your input is a jobid:// reference.
Response Structure
result: ExtractResponse = client.extract.run( ... )
# Top-level fields
print (result.job_id) # str: Job identifier
print (result.usage.num_pages) # int: Pages processed
print (result.usage.credits) # float: Credits used
print (result.studio_link) # str: Studio link
# Extracted data
extracted_data = result.result # list[dict]: Array of extracted objects
first_result = extracted_data[ 0 ]
print (first_result[ "invoice_number" ])
With Citations
When citations are enabled, the response format changes. Instead of a list, result.result is a dict with values wrapped in citation objects:
# Without citations - result.result is a list
result.result[ 0 ][ "total" ] # 1234.56
# With citations - result.result is a dict
result.result[ "total" ][ "value" ] # 1234.56
result.result[ "total" ][ "citations" ][ 0 ][ "bbox" ][ "page" ] # 1
result.result[ "total" ][ "citations" ][ 0 ][ "confidence" ] # "high"
You can also extract without a schema using only a system prompt:
result = client.extract.run(
input = upload.file_id,
instructions = {
"system_prompt" : "Extract all key financial information from this invoice"
}
)
# The model decides what to extract
print (result.result[ 0 ])
Error Handling
from reducto import Reducto
import reducto
try :
result = client.extract.run(
input = upload.file_id,
instructions = { "schema" : schema}
)
except reducto.APIConnectionError as e:
print ( f "Connection failed: { e } " )
except reducto.APIStatusError as e:
print ( f "Extraction failed: { e.status_code } - { e.response } " )
Complete Example
from pathlib import Path
from reducto import Reducto
client = Reducto()
# Upload
upload = client.upload( file = Path( "financial-statement.pdf" ))
# Define schema
schema = {
"type" : "object" ,
"properties" : {
"portfolio_value" : {
"type" : "number" ,
"description" : "Total portfolio value at the end of the period"
},
"total_income_ytd" : {
"type" : "number" ,
"description" : "Total income year-to-date"
},
"top_holdings" : {
"type" : "array" ,
"items" : { "type" : "string" },
"description" : "Names of the top 5 holdings"
}
}
}
# Extract with configuration
result = client.extract.run(
input = upload.file_id,
instructions = {
"schema" : schema,
"system_prompt" : "Extract financial data from this investment statement."
},
settings = {
"citations" : { "enabled" : True },
"array_extract" : True # For top_holdings array
},
parsing = {
"enhance" : {
"agentic" : [{ "scope" : "table" }] # Better table extraction
}
}
)
# Process results
print ( f "Extracted { len (result.result) } results" )
print ( f "Used { result.usage.credits } credits" )
for i, extracted in enumerate (result.result):
print ( f " \n === Result { i + 1 } ===" )
print ( f "Portfolio Value: $ { extracted[ 'portfolio_value' ] :,.2f} " )
print ( f "Total Income YTD: $ { extracted[ 'total_income_ytd' ] :,.2f} " )
print ( f "Top Holdings: { ', ' .join(extracted[ 'top_holdings' ]) } " )
Best Practices
Write Clear Descriptions Field descriptions directly impact extraction quality. Be specific about location and format.
Use Array Extraction Enable array_extract for documents with many repeating items (transactions, line items).
Enable Citations for Verification Use citations to verify extracted values and show users source locations.
Debug with Parse First If extraction fails, check the Parse output first. Extract can only find what Parse sees.
Troubleshooting
If expected fields are empty:
Check the Parse output: client.parse.run(input=upload.file_id)
Verify the value appears in the parsed content
Improve field descriptions to match how values appear
Try enabling array_extract for long documents
Extract only returns whatβs on the document. If you need computed values, extract raw data and compute in your code: monthly = result.result[ 0 ][ "monthly_cost" ]
annual = monthly * 12 # Compute yourself
Next Steps