Array Extraction
Extract large amounts of data from a document by breaking it into segments and running extraction on each segment.
Our extraction endpoint excels when it operates on a concise set of fields from a short document.
However, not all documents are structured this way. We may want to extract a large number of fields across an entire document. For example, let’s say we want to extract a list of all transactions across a multi-page receipt that is 200 pages long.
In this case we’d want to use our array_extract
functionality, which allows for the document to be broken up into pieces, for us to run extraction on each of those pieces, and then to concatenate the pieces back together when returning the final result. This is what array extraction enables.
We take the provided schema and break it down into two components:
- Array fields - fields that represent a list of objects (e.g.
transactions
,items
,customers
) - Top-level fields - fields that are not part of an array (e.g.
total
,customer_name
,date
)
The top-level fields are extracted from the full document, while the array fields are extracted in parallel from multiple segments of the document. This process can be configured with a few parameters described below.
Array Extraction Modes
legacy
: This is the default mode. It will break the document into segments (with overlap) and run extraction on each segment. Then for the regions with overlapping data, it will run a naive algorithm to deduplicate the overlapping data. However, this deduplication can sometimes miss items and result in extra/duplicated information.no_overlap
: This mode will break the document into segments (without overlap) and run extraction on each segment. This is the most reliable mode, but can result in poor performance if data within a single array element in the document can extend across pages (e.g. a singletransaction
that spans across two pages).streaming
: This mode uses an agentic extraction model. It is the most unpredictable, but can be very powerful in cases where a single table is extremely long (e.g. a very long excel spreadsheet).
Pages Per Segment
The pages_per_segment
parameter is used to control the number of pages that are extracted in each segment. Start with the default, but if your pages are very dense with information, you will likely want to increase this number.