> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Array Extraction

> Extract complete arrays from long documents without truncation

<Warning>
  Array Extraction is being deprecated. We recommend using [Deep Extract](/configs/extract/deep-extract) instead, which provides higher accuracy on complex and long extractions through an agentic loop.
</Warning>

Array extraction is a mode specifically designed for extracting **arrays** (lists of items) from documents. It exists because LLMs have context limits that cause them to truncate long lists.

The core problem: if you ask an LLM to extract 500 transactions from a bank statement, it might return only the first 50-100 before stopping. Array extraction solves this by splitting the document into segments, extracting the array items from each segment, then merging results.

**This only affects array fields in your schema.** Scalar fields (strings, numbers, single objects) are still extracted from the full document context normally.

## When to Use

```python theme={null}
result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={"array_extract": True}
)
```

Enable it when:

* Your schema has array fields with many items (50+)
* Extraction results look truncated or end abruptly
* Tables span multiple pages

If you're extracting a few scalar fields like "invoice\_number" and "total\_amount", you don't need this.

## How It Works

1. **Segment** the document into overlapping page ranges
2. **Extract** array items from each segment independently
3. **Merge** all array items together
4. **Deduplicate** items that appeared in overlapping regions

Segments overlap at boundaries to catch items that span page breaks. If a table row starts on page 10 and continues to page 11, both segments will capture it, and deduplication removes the duplicate.

## Schema Requirements

Your schema must have at least one top-level array property:

```python theme={null}
schema = {
    "type": "object",
    "properties": {
        "transactions": {  # This array is extracted segment-by-segment
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        },
        "account_number": {"type": "string"}  # This scalar is extracted normally
    }
}
```

The schema root must be an object, not an array:

```python theme={null}
# Wrong - will error
{"type": "array", "items": {...}}

# Correct
{"type": "object", "properties": {"items": {"type": "array", ...}}}
```

## Deduplication

When segments overlap, the same item may be extracted twice. Reducto deduplicates using content similarity.

**Problem:** If your document has legitimately identical items (two transactions with the same date and amount), deduplication might incorrectly merge them.

**Solution:** Add distinguishing fields like line numbers or IDs:

```python theme={null}
"items": {
    "type": "object",
    "properties": {
        "line_number": {"type": "integer", "description": "Row number if visible"},
        "date": {"type": "string"},
        "amount": {"type": "number"}
    }
}
```

## With Citations

Array extraction works with citations. Each item retains its source location:

```python theme={null}
result = client.extract.run(
    input=upload,
    instructions={"schema": schema},
    settings={
        "array_extract": True,
        "citations": {"enabled": True}
    }
)

for item in result.result["transactions"]:
    if item["amount"].citations:
        page = item["amount"].citations[0].bbox.page
        print(f"${item['amount'].value} on page {page}")
```

## Troubleshooting

**Still missing items:**

1. Check Parse output first (`client.parse.run`). Extract can only find what Parse sees.
2. Add system prompt: "Extract every item. Do not skip any rows."

**Duplicate items:** Add unique identifiers (line numbers, IDs) to help differentiation.

**Schema error:** Ensure at least one `"type": "array"` property exists at the top level.
