Reliable extractions come from understanding the system’s architecture. Extract uses an LLM to find and pull values from parsed content, so the quality of your results depends on two things: whether the data exists in the Parse output, and whether your schema helps the LLM locate it.
Start with Parse
When extractions return incorrect values, the root cause is often parsing, not extraction.
Extract never works directly on your original file. It only sees the structured output generated by Parse.
Think of Parse as the ground truth layer. Extract is a filter on top of that layer, shaping the parsed content into your schema. If the foundation is wrong, extraction cannot fix it.
If the value you need isn’t in the Parse output, no amount of schema tweaking will help. You’ll need to adjust your Parse configuration first. Common fixes include:
- Enabling agentic mode for tables with misaligned columns or OCR errors
- Changing table format to HTML for complex tables with merged cells
- Adding formatting detection for signatures, change tracking, and hyperlinks
- Setting a document password for password-protected PDFs
Once you confirm the data exists in Parse output, then focus on improving your extraction schema.
Schema Design Principles
Your schema is the primary input to the extraction LLM. It determines not just the output structure, but also what the model looks for in the document.
Use descriptive field names
The LLM uses field names as search hints. A field called po_number will be matched against text like “PO Number” or “Purchase Order #” in the document. Generic names like field1 or data give the model nothing to work with.
# Effective: name matches document terminology
schema = {
"properties": {
"invoice_total": {"type": "number"},
"due_date": {"type": "string"},
"bill_to_address": {"type": "string"}
}
}
# Problematic: generic names
schema = {
"properties": {
"amount": {"type": "number"},
"date": {"type": "string"},
"address": {"type": "string"}
}
}
When your document has multiple dates or amounts, specific names help the model distinguish between them.
Write descriptions that locate values
Field descriptions aren’t just documentation. The LLM reads them to understand what to extract. A good description tells the model where to look and what distinguishes this field from similar ones.
schema = {
"properties": {
"contract_date": {
"type": "string",
"description": "The date the contract was signed, typically found near the signature block at the end of the document"
},
"effective_date": {
"type": "string",
"description": "The date when the contract terms take effect, usually stated in the first section"
},
"expiration_date": {
"type": "string",
"description": "The date when the contract expires, found in the termination clause"
}
}
}
Each date field is now distinguishable because the description explains where it appears and what it represents.
Constrain values with enums
When a field has a known set of possible values, use an enum. This prevents hallucination and ensures consistent output formatting.
schema = {
"properties": {
"document_type": {
"type": "string",
"enum": ["invoice", "receipt", "purchase_order", "credit_memo"],
"description": "The type of financial document"
},
"payment_status": {
"type": "string",
"enum": ["paid", "unpaid", "partial", "overdue"],
"description": "Current payment status"
}
}
}
Without enums, the model might return “Invoice”, “INVOICE”, “invoice document”, or other variations. Enums force a canonical format.
Keep nesting shallow
Deeply nested schemas reduce extraction accuracy. Each level of nesting adds cognitive load for the LLM, increasing the chance of structural errors.
# Avoid: deeply nested
schema = {
"properties": {
"parties": {
"type": "object",
"properties": {
"buyer": {
"type": "object",
"properties": {
"contact": {
"type": "object",
"properties": {
"address": {"type": "string"}
}
}
}
}
}
}
}
}
# Better: flattened
schema = {
"properties": {
"buyer_name": {"type": "string"},
"buyer_address": {"type": "string"},
"buyer_contact_email": {"type": "string"}
}
}
If you need nested output for your application, extract flat data and restructure it in your code.
Extract can only return values that appear in the document. If you request calculated fields or inferred data, the model may hallucinate. This principle extends to any transformation: currency conversion, date formatting, string concatenation. Extract the raw data and transform it yourself.
System Prompts
The system prompt provides document-level context. It’s where you describe what kind of document this is and how to handle ambiguity.
The system prompt should have the following:
- Document type context: “This is a commercial real estate lease agreement” or “These are bank statements from various institutions”
- Global extraction rules: “Extract all individual transactions. Exclude summary rows, headers, and running totals.”
- Edge case handling: “Some invoices split line items across pages. Treat these as single items.”
- Precision guidance: “Be thorough and process all pages in the document.”
Field-specific instructions work better as descriptions because they’re attached to the relevant field. For instance, putting “use YYYY-MM-DD format” in the system prompt would apply to all dates, which might not be what you want for different date fields.
Citations
Citations link each extracted value back to its source location in the document. Enable them when you need to audit extractions, show users where values came from, or debug extraction accuracy.
result = client.extract.run(
input=upload.file_id,
instructions={"schema": schema},
settings={"citations": {"enabled": True}}
)
When citations are enabled, chunking is automatically disabled because citations require knowing exactly where each piece of content came from.