Extract is used to pull out specific data you want isolated within your documents, returned in a JSON format. Extract first performs Parse, before extracting specific data from the Parse output.

Example use cases

  1. Extracting important numbers and statistics on a patient lab report.
  2. Extracting the rows and line items inside of an invoice.
  3. Extracting key clauses and prices inside of a contract.

Key features

Under the hood, an extract call first performs a /parse and then extracts your specified fields.
  • schema: A JSON schema that details the specific fields and structure of your output.
  • system_prompt: An overall system prompt, that helps our models understand your document structure better.
  • Special Configurations: array_extract and generate_citations
Read our best practices guide for how best to structure and configure your extract calls.

Example

Let’s say you’re looking to extract all the financial accounts under a customer off of a statement. You can see the output in our playground example, but your schema and code might look like this:
from pathlib import Path
from reducto import Reducto

client = Reducto()

system_prompt = "Be precise and thorough."
schema = {
  "type": "object",
  "properties": {
    "customerName": {
      "type": "string",
      "description": "The full name of the customer as registered in their financial account."
    },
    "accounts": {
      "type": "array",
      "description": "A list of the customer's financial accounts.",
      "items": {
        "type": "object",
        "properties": {
          "accountType": {
            "type": "string",
            "description": "Type of financial account, such as checking, savings, investment, etc."
          },
          "accountNumber": {
            "type": "string",
            "description": "The unique identifier for the financial account."
          },
          "endingValue": {
            "type": "number",
            "description": "The closing balance or value of the account."
          }
        },
        "required": [
          "accountType",
          "accountNumber",
          "endingValue"
        ]
      }
    }
  },
  "required": [
    "customerName",
    "accounts"
  ]
}

upload = client.upload(file=Path("sample.pdf"))
result = client.extract.run(
    document_url=upload,
    schema=schema,
    system_prompt=system_prompt,
    generate_citations=True
)

print(result)


Troubleshooting

Common problems and their solutions:
  • Outputs differ between runs
    LLM outputs are non-deterministic. Variations are normal and usually minor.
  • Only the first pages are processed
    Enable array_extract when working with long documents or large tables. You can also guide the model with a system prompt such as:
    “Make sure to process the entire document, not just the beginning.”
  • Missing values from schema
    1. Check whether the values appear in the parse step output.
    2. If present, refine the system prompt or add better field descriptions.
    3. If not present, improve the parse by adjusting configurations (e.g., OCR mode, layout settings). If you are extracting from a long list or table, try using the array type, or doing array_extract.

Advanced usage

  • Different schemas by document type
    Use pipelining to classify documents first, then apply the correct schema. This also avoids duplicate parsing.
  • Handling edge cases
    Use your system_prompt to highlight special cases. For example:
    “Pay close attention to clauses that mention early termination fees.”
  • Conditional transformations
    Both the system_prompt and schema field descriptions can include transformation instructions. For example:
    “Prepend [Contains Example] if keyword X appears in this field.” However, do not include post-processing logic, as Reducto can only extract what is on the document.