Array Extraction

Array Extraction is being deprecated. We recommend using Deep Extract instead, which provides higher accuracy on complex and long extractions through an agentic loop.

Array extraction is a mode specifically designed for extracting arrays (lists of items) from documents. It exists because LLMs have context limits that cause them to truncate long lists. The core problem: if you ask an LLM to extract 500 transactions from a bank statement, it might return only the first 50-100 before stopping. Array extraction solves this by splitting the document into segments, extracting the array items from each segment, then merging results. This only affects array fields in your schema. Scalar fields (strings, numbers, single objects) are still extracted from the full document context normally.

When to Use

result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={"array_extract": True}
)

Enable it when:

Your schema has array fields with many items (50+)
Extraction results look truncated or end abruptly
Tables span multiple pages

If you’re extracting a few scalar fields like “invoice_number” and “total_amount”, you don’t need this.

How It Works

Segment the document into overlapping page ranges
Extract array items from each segment independently
Merge all array items together
Deduplicate items that appeared in overlapping regions

Segments overlap at boundaries to catch items that span page breaks. If a table row starts on page 10 and continues to page 11, both segments will capture it, and deduplication removes the duplicate.

Schema Requirements

Your schema must have at least one top-level array property:

schema = {
    "type": "object",
    "properties": {
        "transactions": {  # This array is extracted segment-by-segment
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        },
        "account_number": {"type": "string"}  # This scalar is extracted normally
    }
}

The schema root must be an object, not an array:

# Wrong - will error
{"type": "array", "items": {...}}

# Correct
{"type": "object", "properties": {"items": {"type": "array", ...}}}

Deduplication

When segments overlap, the same item may be extracted twice. Reducto deduplicates using content similarity. Problem: If your document has legitimately identical items (two transactions with the same date and amount), deduplication might incorrectly merge them. Solution: Add distinguishing fields like line numbers or IDs:

"items": {
    "type": "object",
    "properties": {
        "line_number": {"type": "integer", "description": "Row number if visible"},
        "date": {"type": "string"},
        "amount": {"type": "number"}
    }
}

With Citations

Array extraction works with citations. Each item retains its source location:

result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={
        "array_extract": True,
        "citations": {"enabled": True}
    }
)

for item in result.result["transactions"]:
    if item["amount"].citations:
        page = item["amount"].citations[0].bbox.page
        print(f"${item['amount'].value} on page {page}")

Troubleshooting

Still missing items:

Check Parse output first (client.parse.run). Extract can only find what Parse sees.
Add system prompt: “Extract every item. Do not skip any rows.”

Duplicate items: Add unique identifiers (line numbers, IDs) to help differentiation. Schema error: Ensure at least one "type": "array" property exists at the top level.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

When to Use

How It Works

Schema Requirements

Deduplication

With Citations

Troubleshooting

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​When to Use

​How It Works

​Schema Requirements

​Deduplication

​With Citations

​Troubleshooting

When to Use

How It Works

Schema Requirements

Deduplication

With Citations

Troubleshooting