Skip to main content
Array extraction solves a fundamental problem: LLMs have context limits. When a document is too long, items toward the end of arrays may be truncated or missed entirely. Array extraction segments the document, extracts from each segment in parallel, and merges the results.

When to Use Array Extraction

Enable array_extract when your document has:
  • Long lists or tables: Invoices with 50+ line items, transaction logs, inventory reports
  • Data spanning multiple pages: Tables that continue across page breaks
  • Dense content: Documents where standard extraction misses items toward the end
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={"array_extract": True}
)
If you’re seeing incomplete arrays in your extraction results, array extraction is usually the fix.

How It Works

Array extraction breaks the document into overlapping segments, extracts from each segment independently, then merges the results while removing duplicates. The process:
  1. Segment the document into overlapping page ranges (default: 10 pages per segment with 1 page overlap)
  2. Extract from each segment using your schema
  3. Merge results by combining array items and deduplicating based on content similarity
  4. Return unified output that looks identical to a standard extraction
This architecture means array extraction scales linearly with document length. A 100-page document takes roughly 10x longer than a 10-page document, but you get complete results instead of truncated ones.

Schema Requirements

Array extraction requires your schema to have at least one property of type array at the top level.
# Valid: has a top-level array property
schema = {
    "type": "object",
    "properties": {
        "transactions": {
            "type": "array",
            "description": "All transaction records",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        },
        "account_number": {"type": "string"},
        "closing_balance": {"type": "number"}
    }
}
Scalar fields (account_number, closing_balance) are extracted from the full document context. Only array fields (transactions) are extracted segment by segment.
If your schema has no array properties, array extraction returns an error: “The provided schema doesn’t have a top-level array.”

Complete Example

Here’s a full extraction for a bank statement with many transactions:
from reducto import Reducto

client = Reducto()
upload = client.upload(file="bank_statement.pdf")

schema = {
    "type": "object",
    "properties": {
        "account_holder": {
            "type": "string",
            "description": "Name of the account holder"
        },
        "statement_period": {
            "type": "object",
            "properties": {
                "start_date": {"type": "string"},
                "end_date": {"type": "string"}
            }
        },
        "transactions": {
            "type": "array",
            "description": "All debit and credit transactions",
            "items": {
                "type": "object",
                "properties": {
                    "date": {
                        "type": "string",
                        "description": "Transaction date"
                    },
                    "description": {
                        "type": "string",
                        "description": "Transaction description or merchant name"
                    },
                    "amount": {
                        "type": "number",
                        "description": "Transaction amount, negative for debits"
                    },
                    "balance": {
                        "type": "number",
                        "description": "Running balance after this transaction"
                    }
                }
            }
        },
        "opening_balance": {"type": "number"},
        "closing_balance": {"type": "number"}
    }
}

result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "Extract all transactions from this bank statement. Include both debits and credits. Exclude summary rows and totals."
    },
    settings={"array_extract": True}
)

# Access results
account = result.result["account_holder"].value
transactions = result.result["transactions"]

print(f"Account: {account}")
print(f"Transactions: {len(transactions)}")

for txn in transactions[:5]:
    print(f"  {txn['date'].value}: {txn['description'].value} - ${txn['amount'].value}")

Combining with Citations

Array extraction works with citations, but with some constraints. The default array extraction mode supports citations fully:
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "array_extract": True,
        "citations": {"enabled": True}
    }
)

# Each array item has citations
for item in result.result["transactions"]:
    amount = item["amount"]
    if amount.citations:
        print(f"${amount.value} found on page {amount.citations[0].bbox.page}")
Citations point to where each value was found in the original document, even though the extraction happened in segments.

When Array Extraction Isn’t Enough

For truly critical arrays where you cannot afford to miss any items, consider Agent-in-the-Loop. It uses an AI agent to iteratively verify completeness. The tradeoffs:
ApproachSpeedCompletenessUse case
Standard extractionFastestMay miss items in long docsShort documents, non-critical arrays
Array extractionModerateGood for most casesLong documents, tables spanning pages
Agent-in-the-loopSlowestHighestFinancial data, compliance, audit trails
Start with array extraction. If you’re still missing items after enabling it, escalate to agent-in-the-loop.

Troubleshooting

If array extraction still misses items:
  1. Check the Parse output: The items may not be visible to Extract. Run Parse separately and verify all items appear in the content.
  2. Improve field descriptions: Vague descriptions make it harder to identify items. Add specific details about what to look for.
  3. Add system prompt guidance: Tell the model to be thorough: “Extract every transaction in the document. Do not skip any items.”
  4. Consider agent-in-the-loop: For critical data, the agent approach provides the highest completeness guarantee.
Array extraction deduplicates based on content similarity. If you’re seeing duplicates:
  1. Items may be legitimately similar: Two transactions on the same day with the same amount are distinct items, not duplicates.
  2. Try adding unique identifiers: If transactions have IDs or line numbers, include them in your schema. This helps differentiation.
Array extraction requires at least one property of type array at the top level of your schema.This works:
{"properties": {"items": {"type": "array", ...}}}
This doesn’t work:
{"type": "array", "items": {...}}
The schema root must be an object with array properties, not an array itself.