When to Use Array Extraction
Enablearray_extract when your document has:
- Long lists or tables: Invoices with 50+ line items, transaction logs, inventory reports
- Data spanning multiple pages: Tables that continue across page breaks
- Dense content: Documents where standard extraction misses items toward the end
How It Works
Array extraction breaks the document into overlapping segments, extracts from each segment independently, then merges the results while removing duplicates. The process:- Segment the document into overlapping page ranges (default: 10 pages per segment with 1 page overlap)
- Extract from each segment using your schema
- Merge results by combining array items and deduplicating based on content similarity
- Return unified output that looks identical to a standard extraction
Schema Requirements
Array extraction requires your schema to have at least one property of typearray at the top level.
account_number, closing_balance) are extracted from the full document context. Only array fields (transactions) are extracted segment by segment.
Complete Example
Here’s a full extraction for a bank statement with many transactions:Combining with Citations
Array extraction works with citations, but with some constraints. The default array extraction mode supports citations fully:When Array Extraction Isn’t Enough
For truly critical arrays where you cannot afford to miss any items, consider Agent-in-the-Loop. It uses an AI agent to iteratively verify completeness. The tradeoffs:| Approach | Speed | Completeness | Use case |
|---|---|---|---|
| Standard extraction | Fastest | May miss items in long docs | Short documents, non-critical arrays |
| Array extraction | Moderate | Good for most cases | Long documents, tables spanning pages |
| Agent-in-the-loop | Slowest | Highest | Financial data, compliance, audit trails |
Troubleshooting
Still missing items with array_extract enabled
Still missing items with array_extract enabled
If array extraction still misses items:
- Check the Parse output: The items may not be visible to Extract. Run Parse separately and verify all items appear in the content.
- Improve field descriptions: Vague descriptions make it harder to identify items. Add specific details about what to look for.
- Add system prompt guidance: Tell the model to be thorough: “Extract every transaction in the document. Do not skip any items.”
- Consider agent-in-the-loop: For critical data, the agent approach provides the highest completeness guarantee.
Duplicate items in results
Duplicate items in results
Array extraction deduplicates based on content similarity. If you’re seeing duplicates:
- Items may be legitimately similar: Two transactions on the same day with the same amount are distinct items, not duplicates.
- Try adding unique identifiers: If transactions have IDs or line numbers, include them in your schema. This helps differentiation.
Error: Schema doesn't have a top-level array
Error: Schema doesn't have a top-level array
Array extraction requires at least one property of type This doesn’t work:The schema root must be an object with array properties, not an array itself.
array at the top level of your schema.This works: