Extraction Overview
Understanding Reducto’s Extract endpoint
Extract is used to pull out specific data you want isolated within your documents, returned in a JSON format. If you want to extract different fields from the same document, check out our pipelining documentation. If your document is a combination of many subsections, check out our splitting endpoint.
Example Use cases
- Extracting important numbers and statistics on a patient lab report.
- Extracting the rows and line items inside of an invoice.
- Extracting key clauses and prices inside of a contract.
Key Features
Under the hood, an extract call first performs a /parse and then extracts your specified fields.
schema
: A JSON schema that details the specific fields and structure of your output.system_prompt
: An overall system prompt, that helps our models understand your document structure better.- Special Configurations:
array_extract
andgenerate_citations
Read our best practices guide for how best to structure and configure your extract calls.
Debugging FAQ
Why are the outputs different when I run it multiple times with the same schema?
Why are the outputs different when I run it multiple times with the same schema?
LLMs are non-deterministic by nature, so minor variations are expected.
Why is extraction only working on the first X pages? I want the whole document.
Why is extraction only working on the first X pages? I want the whole document.
If you are extracting from a long document or table, try enabling array_extract. Otherwise, experiment with the system prompt to add phrases like “make sure you process the entirety of the document…”.
Why are my extraction results missing values present in my schema?
Why are my extraction results missing values present in my schema?
Check to see if the values are present after the parse step of extraction. If the values are present, try to edit the system prompt or add better field names/descriptions. If the values aren’t present, then focus on improving parse by changing the configurations.
How do I extract different schemas depending on the document type?
How do I extract different schemas depending on the document type?
Use pipelining to first parse and classify the documents, then apply the correct schema. This also saves you from duplicate parsing.
What's the best way to approach handling edge cases?
What's the best way to approach handling edge cases?
Your system prompt is the best place to call out special cases to pay close attention to.
Can I manipulate data directly in the field descriptions or system prompt?
Can I manipulate data directly in the field descriptions or system prompt?
Yes. You can include special instructions in the system prompt, such as how to treat nuanced structure or conditionally transform data (e.g., “prepend [Contains Example] if keyword X appears”).
Example
Let’s say you’re looking to extract all the financial accounts under a customer off of a statement. You can see the output in our playground example, but your schema and code might look like this: