This feature is still in beta, config options and behavior are subject to change.
Overview
Agent-in-the-loop extraction is an advanced feature that uses AI agents to intelligently review and refine extraction results. This feature is specifically designed for schemas with arrays, where you need to extract all line items such as:
- Transaction lists from financial statements
- Holdings or portfolio items from investment reports
- Invoice line items from billing documents
- Product listings from catalogs or inventory reports
The agent-in-the-loop system is particularly valuable when you cannot afford to miss any items in these arrays. It works by having an AI agent methodically review each page’s extracted data against the original document, identify missing or incorrect items, and make corrections iteratively until all line items are accurately captured.
During the current beta period, agent-in-the-loop extraction is billed at the same rate as standard extraction. For detailed information on credit usage rates, please refer to our Credit Usage and Rates documentation.
Requirements and Limitations
Requirements
- PDF Documents: Currently only supports PDF input files
- Array Schema: Your schema must contain at least one top-level array field
- Full Document Processing: Processes the entire document (page range restrictions not supported)
Limitations
- Processing Time: Significantly longer processing time due to iterative refinement
- Cost: Higher costs due to multiple AI model calls per page
- File Format: Limited to PDF documents only
- Single Array Focus: Focuses refinement on one primary top-level array field at a time
- Citations: Citations are not currently supported with agent-in-the-loop extract
Best Practices
Schema Design
Ensure your schema has a clear top-level array that represents the items you want the agent to focus on:
{
"type": "object",
"properties": {
"transactions": { // This will be the focus of agent refinement, must set line_item_name to transactions
"type": "array",
"items": {
"type": "object",
"properties": {
"date": {"type": "string"},
"amount": {"type": "number"},
"description": {"type": "string"}
}
}
},
"document_summary": { // This won't be refined by the agent
"type": "string"
}
}
}
System Prompts
Provide clear, specific system prompts that help the agent understand:
- What constitutes a valid item vs. summary/header rows
- How to handle edge cases in your document type
- The level of precision required
- Details about how to identify the the items
"system_prompt": "Extract individual transaction line items only. Exclude summary rows, headers, and totals. Each transaction must have a unique date, amount, and description. Transactions are in tables following a 'Transactions' header"
Fields to Verify
The fields_to_verify must match the name of the key of the top-level array in the schema. Currently only one field can be checked, so the length of fields_to_verify must be equal to 1:
"fields_to_verify": ["transactions"] # Clear and specific, matches the schema above
# vs
"fields_to_verify": ["items", "document_summary"] # Does not match the key of the array, includes non array field names
Configuration
Basic Configuration
To enable agent-in-the-loop extraction, set the enabled flag to true in your extraction request:
import requests
headers = {"Authorization": f"Bearer {REDUCTO_API_KEY}"}
schema = {
"type": "object",
"properties": {
"invoice_line_items": {
"type": "array",
"description": "List of charges in an invoice table.",
"items": {
"type": "object",
"properties": {
"item_name": {"type": "string"},
"item_cost": {"type": "number"},
"quantity": {"type": "number"}
}
}
}
}
}
extract_response = requests.post(
"https://platform.reducto.ai/extract",
json={
"input": "YOUR_DOCUMENT_URL",
"instructions": {
"schema": schema,
"system_prompt": "Be precise and thorough when extracting invoice line items.",
"agent_in_the_loop": {
"enabled": True,
"fields_to_verify": ["invoice_line_items"]
}
}
},
headers=headers,
)
Configuration Parameters
agent_in_the_loop
| Parameter | Type | Default | Description |
|---|
enabled | boolean | false | Enables agent-in-the-loop extraction |
fields_to_verify | list[str] | [] | List of the fields that the agent will focus on for refinement. This currently only supports a single array field. The name of the array field must match the key of the top-level array field within the schema |
Example Use Cases
Invoice Processing
{
"instructions": {
"schema": {
"type": "object",
"properties": {
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"total": {"type": "number"}
}
}
}
}
},
"system_prompt": "Extract each invoice line item precisely. Exclude tax lines, subtotals, and summary rows.",
"agent_in_the_loop": {
"enabled": True,
"fields_to_verify": ["line_items"]
}
}
}
Financial Statements
{
"instructions": {
"schema": {
"type": "object",
"properties": {
"transactions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"date": {"type": "string"},
"account": {"type": "string"},
"debit": {"type": "number"},
"credit": {"type": "number"}
}
}
}
}
},
"system_prompt": "Extract individual journal entries. Each entry must have a date and either a debit or credit amount.",
"agent_in_the_loop": {
"enabled": True,
"fields_to_verify": ["transactions"]
}
}
}