Array Extraction is a type of extraction that is great for:

  • Long documents with fields across many pages
  • Dense/complex tables with many line items

In these cases we’d want to enable our array_extract functionality, which can help you return ALL of the information you need without missing pages.

Experimenting with your system prompt can also help with longer documents.

Under the hood, array_extract breaks up your long document intelligently, performs extraction on each segment, and then merges them back together all while preserving the original integrity in edge cases (i.e. tables spanning over page breaks).

Your schema needs to have an array item ([]) at the top level in order for array_extract to work. Otherwise it will throw an error.

Example extract call with array_extract enabled

import requests

headers = {"Authorization": f"Bearer {REDUCTO_API_KEY}"}

schema = {
  "type": "object",
  "properties": {
    "invoice_line_items": {
      "type": "array",
      "description": "List of charges in an invoice table.",
      "items": {
        "type": "object",
        "properties": {
          "item_name": {
            "type": "string",
            "description": "Name of the item type."
          },
          "item_cost": {
            "type": "number",
            "description": "Cost per item."
          },
          "item_sku": {
            "type": "number",
            "description": "Number of units per item type."
          }
        }
      }
    },
    "total_cost": {
      "type": "number",
      "description": "Total ending cost."
    }
  }
}

extract_response = requests.post(
    "https://platform.reducto.ai/extract",
    json={
        "document_url": "SAMPLE_LONG_PDF_URL",
        "schema": schema,
        "array_extract": {
            "enabled": True
        },
    },
    headers=headers,
)

print(extract_response.json())