JSON Schema

Your extraction schema plays a big part in the quality of your outputs. You can play around with the structure, prompts/descriptions, and typing to make the most out of your extraction.

1. Have field names that closely match document contents and good descriptions for each.

Use field names and descriptions that align with how information typically appears in the source documents—this makes it easier for the LLM to identify and extract the correct values. If you’re extracting from tables, using the headers is helpful.

2. Add an optional enum type if your data has limited outcomes.

If a field has a predictable set of values (e.g., “Yes” / “No” or predefined categories), use an enum to constrain the output and improve consistency.

"properties": {
      "currency_type": {
         "type": "string",
          "enum": [
          "USD",
          "EUR",
          "JPY",
          "CAD",
          "AUD",
          "Other"
          ],
          "description": "International currency codes"
      }
  }

3. Don’t create new data inside your schema. Do downstream data manipulation.

If you try to extract data that doesn’t exist on the document, you’re more likely to run into hallucinations. Avoid asking the LLM to compute values.

// Schema is extracting a monthly cost
"properties": {
      "monthly_cost": { 
          "type": "number",
          "description": "The total monthly cost for 
          this service." 
      },
}
.
.
.
// Later on calculating annual cost
total_annual_price = 
  extract_result.json()["result"][0]["monthly_cost"] * 12

4. Don’t nest too deeply.

Keep your schema structure relatively flat to avoid confusion. Deeply nesting past 2 or 3 levels can cause lower extraction accuracy.

System Prompt

The best system prompts are pretty simple and strike a balance: being general enough to cover a range of formats, while still giving enough hints and edge-case examples for the LLM to handle ambiguity effectively. Most of your requirements should live in your schema.

Some common things to mention:

  • How you would like the extraction to be (“Be thorough and precise, make sure to process all the pages.”)
  • General information about how the document is structured or its use case.
  • Special instructions for certain edge cases or complex layouts.