Skip to main content
Extract answers the question “What is the value of X?” You define the fields you need, and Extract returns those specific values as structured JSON. Unlike Parse, which returns all content, Extract returns only what you ask for. Under the hood, Extract runs Parse first—handling OCR, layout detection, and table parsing—then uses an LLM to locate and extract the fields you specified. This means Extract can only return what Parse sees. If a value doesn’t appear in the Parse output due to OCR issues or table formatting problems, no amount of schema tweaking will extract it.

Start with a document

Let’s walk through extracting data from this Fidelity investment statement. We want to pull out specific values—portfolio value, holdings, and account details—not the entire document content. Create an Extract pipeline in Studio and upload the document. The Configurations tab shows three panels: Build System Prompt for document-level context, Schema Builder for defining fields, and Settings for extraction options.
Extract pipeline

Extract pipeline interface showing document preview and configuration panels

Define your schema

The Schema Builder is where you specify what to extract. You have two options:

Option 1: Generate from natural language

Click Generate Schema to describe what you need in plain English. For example:
Extract the total portfolio value and a list of all holdings with their names, 
quantities, and current values.
Studio generates a schema from your description. Review the fields, adjust types or descriptions as needed, then run.

Option 2: Build manually

Click Add Field to define each field yourself. Each field needs:
  • Name — The key in your JSON output (e.g., portfolioValue)
  • Type — The data type (text, number, date, array, object, etc.)
  • Description — Instructions for the LLM on where to find this value
Schema Builder

Schema Builder with portfolioValue and holdings array defined

In this example, we define:
  • portfolioValue (text) — “Total portfolio value”
  • holdings (array) — “list of all my holdings”
The holdings array expands to show nested fields. Click on an array or object field to add nested structure—each holding could have name, quantity, currentValue, etc.

Field types

TypeUse forExample
textNames, IDs, addressesAccount number, customer name
numberAmounts, quantitiesTotal due, item count
booleanYes/no valuesIs paid, has signature
dateDates in any formatInvoice date, due date
enumFixed set of optionsStatus (pending/approved/rejected)
arrayLists of itemsHoldings, transactions, line items
objectGrouped fieldsAddress (street, city, zip)

Writing good descriptions

Field descriptions directly influence extraction accuracy. The LLM uses them to locate values. Specific descriptions work better:
  • ✓ “Total portfolio value in USD, displayed prominently at the top right of page 1”
  • ✓ “Account holder’s full name as shown in the mailing address”
  • ✗ “The total”
  • ✗ “Name”
Include location hints when the document has multiple similar values. See Schema Best Practices for more guidance.

Build System Prompt

Use the Build System Prompt panel to provide document-level context:
This is a Fidelity investment statement. Extract values from the portfolio summary 
on page 1. Amounts are in USD. Account numbers follow the format XXX-XXXXXX.
Field-specific guidance belongs in schema descriptions. The system prompt sets overall context that applies to all fields.

Settings

The Settings section (below Schema Builder) controls extraction behavior. Configure these before running: Array extraction — Enable for documents with long lists (hundreds of transactions, extensive line items). Without this, items toward the end may be missed due to LLM context limits. See Array Extraction for details. Citations — Returns page number, bounding box, and source text for each value. Useful for verification or showing users where values came from. See Citations for working with this data. Include images — Sends page images to the LLM alongside text. Helps with visually complex documents but increases cost.

Run and view results

Click Run to execute the extraction. The Results tab shows your extracted data:
Results view

Extraction results showing portfolio value and holdings array

Each field from your schema appears with its extracted value. For arrays like holdings, you see each item expanded with its nested fields. The toolbar provides:
  • Copy — Copy JSON to clipboard
  • Download — Save as file
  • JSON — Toggle raw API response view

Compare against ground truth

The Compare tab lets you validate extraction accuracy against known correct values. This is essential for testing schema changes before deploying to production. Studio shows a side-by-side comparison: your current result on the left, ground truth on the right. Matching values appear in green; mismatches appear in red. The header shows your match percentage (e.g., “17% Match (1/6 fields)”). Setting up ground truth:
  • Import — Upload a JSON file with expected values for this document
  • Generate — Create ground truth from the current result, then manually correct any errors
  • Edit — Adjust values directly in the UI
  • Copy — Export ground truth as JSON
  • Delete — Remove ground truth for this document
Use ground truth to track extraction quality as you iterate. If a schema change drops your match rate, you know something regressed before deploying.

Troubleshooting

First, check if the value appears in the Parse output. Create a Parse pipeline with the same document—if the value isn’t there, Extract can’t find it either. Fix the parsing first (enable agentic mode, adjust OCR settings, or change table format to HTML).If the value is in Parse output, refine your schema. Make the description more specific, add location hints like “found in the header on page 1”, or try a different field name.
For long documents with many items (hundreds of transactions, extensive line items), enable array extraction. Without it, the LLM may only process the beginning of the document due to context limits.
LLM outputs are non-deterministic. Small variations are normal. If you need consistent results, use ground truth comparison to track accuracy over time and catch regressions.
Extract runs Parse first, then the extraction LLM. Large documents take longer. If you enabled “Include images”, that adds processing time and cost. Disable it unless you need visual context for extraction.