Reliable extractions come from understanding the system’s architecture. Extract uses an LLM to find and pull values from parsed content, so the quality of your results depends on two things: whether the data exists in the Parse output, and whether your schema helps the LLM locate it.Documentation Index
Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
Use this file to discover all available pages before exploring further.
Start with Parse
When extractions return incorrect values, the root cause is often parsing, not extraction.Extract never works directly on your original file. It only sees the structured output generated by Parse.
- Enabling agentic mode for tables with misaligned columns or OCR errors
- Changing table format to HTML for complex tables with merged cells
- Adding formatting detection for signatures, change tracking, and hyperlinks
- Setting a document password for password-protected PDFs
Schema Design Principles
Your schema is the primary input to the extraction LLM. It determines not just the output structure, but also what the model looks for in the document.Use descriptive field names
The LLM uses field names as search hints. A field calledpo_number will be matched against text like “PO Number” or “Purchase Order #” in the document. Generic names like field1 or data give the model nothing to work with.
Write descriptions that locate values
Field descriptions aren’t just documentation. The LLM reads them to understand what to extract. A good description tells the model where to look and what distinguishes this field from similar ones.Constrain values with enums
When a field has a known set of possible values, use an enum. This prevents hallucination and ensures consistent output formatting.Keep nesting shallow
Deeply nested schemas reduce extraction accuracy. Each level of nesting adds cognitive load for the LLM, increasing the chance of structural errors.Extract only what exists
Extract can only return values that appear in the document. If you request calculated fields or inferred data, the model may hallucinate. This principle extends to any transformation: currency conversion, date formatting, string concatenation. Extract the raw data and transform it yourself.System Prompts
The system prompt provides document-level context. It’s where you describe what kind of document this is and how to handle ambiguity. The system prompt should have the following:- Document type context: “This is a commercial real estate lease agreement” or “These are bank statements from various institutions”
- Global extraction rules: “Extract all individual transactions. Exclude summary rows, headers, and running totals.”
- Edge case handling: “Some invoices split line items across pages. Treat these as single items.”
- Precision guidance: “Be thorough and process all pages in the document.”
Citations
Citations link each extracted value back to its source location in the document. Enable them when you need to audit extractions, show users where values came from, or debug extraction accuracy.Related
Array Extraction
Handle long documents with repeating data.
Citations
Link values to source locations.
Extract Overview
Endpoint basics and parameters.