Extraction Overview

Extract is used to pull out specific data you want isolated within your documents, returned in a JSON format. If you want to extract different fields from the same document, check out our pipelining documentation. If your document is a combination of many subsections, check out our splitting endpoint.

Example Use cases

Extracting important numbers and statistics on a patient lab report.
Extracting the rows and line items inside of an invoice.
Extracting key clauses and prices inside of a contract.

Key Features

Under the hood, an extract call first performs a /parse and then extracts your specified fields.

schema: A JSON schema that details the specific fields and structure of your output.
system_prompt: An overall system prompt, that helps our models understand your document structure better.
Special Configurations: array_extract and generate_citations

Read our best practices guide for how best to structure and configure your extract calls.

Debugging FAQ

Why are the outputs different when I run it multiple times with the same schema?

Why is extraction only working on the first X pages? I want the whole document.

Why are my extraction results missing values present in my schema?

How do I extract different schemas depending on the document type?

What's the best way to approach handling edge cases?

Can I manipulate data directly in the field descriptions or system prompt?

Example

Let’s say you’re looking to extract all the financial accounts under a customer off of a statement. You can see the output in our playground example, but your schema and code might look like this:

from pathlib import Path
from reducto import Reducto

client = Reducto()

system_prompt = "Be precise and thorough."
schema = {
  "type": "object",
  "properties": {
    "customerName": {
      "type": "string",
      "description": "The full name of the customer as registered in their financial account."
    },
    "accounts": {
      "type": "array",
      "description": "A list of the customer's financial accounts.",
      "items": {
        "type": "object",
        "properties": {
          "accountType": {
            "type": "string",
            "description": "Type of financial account, such as checking, savings, investment, etc."
          },
          "accountNumber": {
            "type": "string",
            "description": "The unique identifier for the financial account."
          },
          "endingValue": {
            "type": "number",
            "description": "The closing balance or value of the account."
          }
        },
        "required": [
          "accountType",
          "accountNumber",
          "endingValue"
        ]
      }
    }
  },
  "required": [
    "customerName",
    "accounts"
  ]
}

upload = client.upload(file=Path("sample.pdf"))
result = client.extract.run(
    document_url=upload,
    schema=schema,
    system_prompt=system_prompt,
    generate_citations=True
)

print(result)

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

Example Use cases

Key Features

Debugging FAQ

Example

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

​Example Use cases

​Key Features

​Debugging FAQ

​Example

Example Use cases

Key Features

Debugging FAQ

Example