Schema design and prompt writing for reliable extractions
Reliable extractions come from understanding the system’s architecture. Extract uses an LLM to find and pull values from parsed content, so the quality of your results depends on two things: whether the data exists in the Parse output, and whether your schema helps the LLM locate it.
When extractions return incorrect values, the root cause is often parsing, not extraction.
Extract never works directly on your original file. It only sees the structured output generated by Parse.
Think of Parse as the ground truth layer. Extract is a filter on top of that layer, shaping the parsed content into your schema. If the foundation is wrong, extraction cannot fix it.If the value you need isn’t in the Parse output, no amount of schema tweaking will help. You’ll need to adjust your Parse configuration first. Common fixes include:
Enabling agentic mode for tables with misaligned columns or OCR errors
Changing table format to HTML for complex tables with merged cells
Adding formatting detection for signatures, change tracking, and hyperlinks
Setting a document password for password-protected PDFs
Once you confirm the data exists in Parse output, then focus on improving your extraction schema.
Your schema is the primary input to the extraction LLM. It determines not just the output structure, but also what the model looks for in the document.
The LLM uses field names as search hints. A field called po_number will be matched against text like “PO Number” or “Purchase Order #” in the document. Generic names like field1 or data give the model nothing to work with.
Field descriptions aren’t just documentation. The LLM reads them to understand what to extract. A good description tells the model where to look and what distinguishes this field from similar ones.
schema = { "properties": { "contract_date": { "type": "string", "description": "The date the contract was signed, typically found near the signature block at the end of the document" }, "effective_date": { "type": "string", "description": "The date when the contract terms take effect, usually stated in the first section" }, "expiration_date": { "type": "string", "description": "The date when the contract expires, found in the termination clause" } }}
Each date field is now distinguishable because the description explains where it appears and what it represents.
Extract can only return values that appear in the document. If you request calculated fields or inferred data, the model may hallucinate. This principle extends to any transformation: currency conversion, date formatting, string concatenation. Extract the raw data and transform it yourself.
The system prompt provides document-level context. It’s where you describe what kind of document this is and how to handle ambiguity.The system prompt should have the following:
Document type context: “This is a commercial real estate lease agreement” or “These are bank statements from various institutions”
Global extraction rules: “Extract all individual transactions. Exclude summary rows, headers, and running totals.”
Edge case handling: “Some invoices split line items across pages. Treat these as single items.”
Precision guidance: “Be thorough and process all pages in the document.”
Field-specific instructions work better as descriptions because they’re attached to the relevant field. For instance, putting “use YYYY-MM-DD format” in the system prompt would apply to all dates, which might not be what you want for different date fields.
Citations link each extracted value back to its source location in the document. Enable them when you need to audit extractions, show users where values came from, or debug extraction accuracy.
result = client.extract.run( input=upload.file_id, instructions={"schema": schema}, settings={"citations": {"enabled": True}})
When citations are enabled, chunking is automatically disabled because citations require knowing exactly where each piece of content came from.