Overview

Designing reliable extractions is about balancing schema design and prompt writing. A well-structured schema ensures outputs are predictable and machine-readable, while a concise, targeted prompt guides the model to interpret documents accurately. This page shares best practices for:
  • Schemas: how to define clear fields, use enums and descriptions, avoid unnecessary nesting, and focus on what matters in the source document.
  • Prompts: how to write system-level instructions that clarify edge cases without over-constraining the model.
By applying these practices, you can reduce hallucinations, improve field completeness, and make results easier to consume downstream. Treat parsing as the foundation—debug there first—then layer schema and prompt design for extractions that are consistent, accurate, and scalable.
When debugging incorrect or missing outputs, first check the parse output. Parse configurations might need adjusting before extract prompts can be tuned.

JSON schema

Your extraction schema plays a big part in the quality of your outputs. You can play around with the structure, prompts/descriptions, and typing to make the most out of your extraction. 1. Have field names that closely match document contents and good descriptions for each. Use field names and descriptions that align with how information typically appears in the source documents—this makes it easier for the LLM to identify and extract the correct values. If you’re extracting from tables, using the headers is helpful. 2. Add an optional enum type if your data has limited outcomes. If a field has a predictable set of values (e.g., “Yes” / “No” or predefined categories), use an enum to constrain the output and improve consistency.
"properties": {
      "currency_type": {
         "type": "string",
          "enum": [
          "USD",
          "EUR",
          "JPY",
          "CAD",
          "AUD",
          "Other"
          ],
          "description": "International currency codes"
      }
  }
3. Don’t create new data inside your schema. Do downstream data manipulation. If you try to extract data that doesn’t exist on the document, you’re more likely to run into hallucinations. Avoid asking the LLM to compute values.
// Schema is extracting a monthly cost
"properties": {
      "monthly_cost": { 
          "type": "number",
          "description": "The total monthly cost for 
          this service." 
      },
}
.
.
.
// Later on calculating annual cost
total_annual_price = 
  extract_result.json()["result"][0]["monthly_cost"] * 12

4. Don’t nest too deeply. Keep your schema structure relatively flat to avoid confusion. Deeply nesting past 2 or 3 levels can cause lower extraction accuracy. 5. Use an array type for long lists. If you’re trying to extract a long list of items (i.e. list of orders from an invoice table), use an array at the top level. This helps the model thoroughly get each item in the list, not missing items at the end.

System prompt

The best system prompts are pretty simple and strike a balance: being general enough to cover a range of formats, while still giving enough hints and edge-case examples for the LLM to handle ambiguity effectively. Most of your field specific requirements should live in your schema. If you’re noticing a common pattern in your extraction outputs, the system prompt can help solve holistic patterns. For example, if a certain type of block is getting skipped.
Extraction only returns exactly what is on the document. Do not add post-processing or logic steps inside the system prompt.
Some common things to mention:
  • How you would like the extraction to be (“Be thorough and precise, make sure to process all the pages.”)
  • General information about how the document is structured or its use case.
  • Special instructions for certain edge cases or complex layouts.
The length of your system prompt doesn’t degrade performance overall.