Skip to main content
Array extraction is a mode specifically designed for extracting arrays (lists of items) from documents. It exists because LLMs have context limits that cause them to truncate long lists. The core problem: if you ask an LLM to extract 500 transactions from a bank statement, it might return only the first 50-100 before stopping. Array extraction solves this by splitting the document into segments, extracting the array items from each segment, then merging results. This only affects array fields in your schema. Scalar fields (strings, numbers, single objects) are still extracted from the full document context normally.

When to Use

result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={"array_extract": True}
)
Enable it when:
  • Your schema has array fields with many items (50+)
  • Extraction results look truncated or end abruptly
  • Tables span multiple pages
If you’re extracting a few scalar fields like “invoice_number” and “total_amount”, you don’t need this.

How It Works

  1. Segment the document into overlapping page ranges
  2. Extract array items from each segment independently
  3. Merge all array items together
  4. Deduplicate items that appeared in overlapping regions
Segments overlap at boundaries to catch items that span page breaks. If a table row starts on page 10 and continues to page 11, both segments will capture it, and deduplication removes the duplicate.

Schema Requirements

Your schema must have at least one top-level array property:
schema = {
    "type": "object",
    "properties": {
        "transactions": {  # This array is extracted segment-by-segment
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        },
        "account_number": {"type": "string"}  # This scalar is extracted normally
    }
}
The schema root must be an object, not an array:
# Wrong - will error
{"type": "array", "items": {...}}

# Correct
{"type": "object", "properties": {"items": {"type": "array", ...}}}

Deduplication

When segments overlap, the same item may be extracted twice. Reducto deduplicates using content similarity. Problem: If your document has legitimately identical items (two transactions with the same date and amount), deduplication might incorrectly merge them. Solution: Add distinguishing fields like line numbers or IDs:
"items": {
    "type": "object",
    "properties": {
        "line_number": {"type": "integer", "description": "Row number if visible"},
        "date": {"type": "string"},
        "amount": {"type": "number"}
    }
}

With Citations

Array extraction works with citations. Each item retains its source location:
result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={
        "array_extract": True,
        "citations": {"enabled": True}
    }
)

for item in result.result["transactions"]:
    if item["amount"].citations:
        page = item["amount"].citations[0].bbox.page
        print(f"${item['amount'].value} on page {page}")

Troubleshooting

Still missing items:
  1. Check Parse output first (client.parse.run). Extract can only find what Parse sees.
  2. Add system prompt: “Extract every item. Do not skip any rows.”
  3. For critical data, use Agent-in-the-Loop which iteratively verifies completeness.
Duplicate items: Add unique identifiers (line numbers, IDs) to help differentiation. Schema error: Ensure at least one "type": "array" property exists at the top level.