Start with a document
Letâs walk through extracting data from this Fidelity investment statement. We want to pull out specific valuesâportfolio value, holdings, and account detailsânot the entire document content. Create an Extract pipeline in Studio and upload the document. The Configurations tab shows three panels: Build System Prompt for document-level context, Schema Builder for defining fields, and Settings for extraction options.
Extract pipeline interface showing document preview and configuration panels
Define your schema
The Schema Builder is where you specify what to extract. You have two options:Option 1: Generate from natural language
Click Generate Schema to describe what you need in plain English. For example:Option 2: Build manually
Click Add Field to define each field yourself. Each field needs:- Name â The key in your JSON output (e.g.,
portfolioValue) - Type â The data type (text, number, date, array, object, etc.)
- Description â Instructions for the LLM on where to find this value

Schema Builder with portfolioValue and holdings array defined
portfolioValue(text) â âTotal portfolio valueâholdings(array) â âlist of all my holdingsâ
holdings array expands to show nested fields. Click on an array or object field to add nested structureâeach holding could have name, quantity, currentValue, etc.
Field types
| Type | Use for | Example |
|---|---|---|
text | Names, IDs, addresses | Account number, customer name |
number | Amounts, quantities | Total due, item count |
boolean | Yes/no values | Is paid, has signature |
date | Dates in any format | Invoice date, due date |
enum | Fixed set of options | Status (pending/approved/rejected) |
array | Lists of items | Holdings, transactions, line items |
object | Grouped fields | Address (street, city, zip) |
Writing good descriptions
Field descriptions directly influence extraction accuracy. The LLM uses them to locate values. Specific descriptions work better:- â âTotal portfolio value in USD, displayed prominently at the top right of page 1â
- â âAccount holderâs full name as shown in the mailing addressâ
- â âThe totalâ
- â âNameâ
Build System Prompt
Use the Build System Prompt panel to provide document-level context:Settings

Run and view results
Click Run to execute the extraction. The Results tab shows your extracted data:
Extraction results showing portfolio value and holdings array
holdings, you see each item expanded with its nested fields.
The toolbar provides:
- Copy â Copy JSON to clipboard
- Download â Save as file
- JSON â Toggle raw API response view
Compare against ground truth
The Compare tab lets you validate extraction accuracy against known correct values. This is essential for testing schema changes before deploying to production. Studio shows a side-by-side comparison: your current result on the left, ground truth on the right. Matching values appear in green; mismatches appear in red. The header shows your match percentage (e.g., â17% Match (1/6 fields)â).
- Import â Upload a JSON file with expected values for this document
- Generate â Create ground truth from the current result, then manually correct any errors
- Edit â Adjust values directly in the UI
- Copy â Export ground truth as JSON
- Delete â Remove ground truth for this document
Troubleshooting
Why is a field returning null or incorrect values?
Why is a field returning null or incorrect values?
First, check if the value appears in the Parse output. Create a Parse pipeline with the same documentâif the value isnât there, Extract canât find it either. Fix the parsing first (enable agentic mode, adjust OCR settings, or change table format to HTML).If the value is in Parse output, refine your schema. Make the description more specific, add location hints like âfound in the header on page 1â, or try a different field name.
Why are items missing from my array?
Why are items missing from my array?
For long documents with many items (hundreds of transactions, extensive line items), enable array extraction. Without it, the LLM may only process the beginning of the document due to context limits.
Why do results vary between runs?
Why do results vary between runs?
LLM outputs are non-deterministic. Small variations are normal. If you need consistent results, use ground truth comparison to track accuracy over time and catch regressions.
Why is extraction slow?
Why is extraction slow?
Extract runs Parse first, then the extraction LLM. Large documents take longer. If you enabled âInclude imagesâ, that adds processing time and cost. Disable it unless you need visual context for extraction.