Start with a document
Let’s walk through extracting data from this Fidelity investment statement. We want to pull out specific values—portfolio value, holdings, and account details—not the entire document content. Create an Extract pipeline in Studio and upload the document. The Configurations tab shows three panels: Build System Prompt for document-level context, Schema Builder for defining fields, and Settings for extraction options.
Define your schema
The Schema Builder is where you specify what to extract. You have two options:Option 1: Generate from natural language
Click Generate Schema to describe what you need in plain English. For example:Option 2: Build manually
Click Add Field to define each field yourself. Each field needs:- Name — The key in your JSON output (e.g.,
portfolioValue) - Type — The data type (text, number, date, array, object, etc.)
- Description — Instructions for the LLM on where to find this value

portfolioValue(text) — “Total portfolio value”holdings(array) — “list of all my holdings”
holdings array expands to show nested fields. Click on an array or object field to add nested structure—each holding could have name, quantity, currentValue, etc.
Field types
| Type | Use for | Example |
|---|---|---|
text | Names, IDs, addresses | Account number, customer name |
number | Amounts, quantities | Total due, item count |
boolean | Yes/no values | Is paid, has signature |
date | Dates in any format | Invoice date, due date |
enum | Fixed set of options | Status (pending/approved/rejected) |
array | Lists of items | Holdings, transactions, line items |
object | Grouped fields | Address (street, city, zip) |
Writing good descriptions
Field descriptions directly influence extraction accuracy. The LLM uses them to locate values. Specific descriptions work better:- ✓ “Total portfolio value in USD, displayed prominently at the top right of page 1”
- ✓ “Account holder’s full name as shown in the mailing address”
- ✗ “The total”
- ✗ “Name”
Build System Prompt
Use the Build System Prompt panel to provide document-level context:Settings

Run and view results
Click Run to execute the extraction. The Results tab shows your extracted data:
holdings, you see each item expanded with its nested fields.
The toolbar provides:
- Copy — Copy JSON to clipboard
- Download — Save as file
- JSON — Toggle raw API response view
Compare against ground truth
The Compare tab lets you validate extraction accuracy against known correct values. This is essential for testing schema changes before deploying to production. Studio shows a side-by-side comparison: your current result on the left, ground truth on the right. Matching values appear in green; mismatches appear in red. The header shows your match percentage (e.g., “17% Match (1/6 fields)”).
- Import — Upload a JSON file with expected values for this document
- Generate — Create ground truth from the current result, then manually correct any errors
- Edit — Adjust values directly in the UI
- Copy — Export ground truth as JSON
- Delete — Remove ground truth for this document
Troubleshooting
Why is a field returning null or incorrect values?
Why is a field returning null or incorrect values?
First, check if the value appears in the Parse output. Create a Parse pipeline with the same document—if the value isn’t there, Extract can’t find it either. Fix the parsing first (enable agentic mode, adjust OCR settings, or change table format to HTML).If the value is in Parse output, refine your schema. Make the description more specific, add location hints like “found in the header on page 1”, or try a different field name.
Why are items missing from my array?
Why are items missing from my array?
For long documents with many items (hundreds of transactions, extensive line items), enable array extraction. Without it, the LLM may only process the beginning of the document due to context limits.
Why do results vary between runs?
Why do results vary between runs?
LLM outputs are non-deterministic. Small variations are normal. If you need consistent results, use ground truth comparison to track accuracy over time and catch regressions.
Why is extraction slow?
Why is extraction slow?
Extract runs Parse first, then the extraction LLM. Large documents take longer. If you enabled “Include images”, that adds processing time and cost. Disable it unless you need visual context for extraction.