Best Practices for Extraction
Tips on how to create quality JSON schemas and prompts
JSON Schema
Your extraction schema plays a big part in the quality of your outputs. You can play around with the structure, prompts/descriptions, and typing to make the most out of your extraction.
1. Have field names that closely match document contents and good descriptions for each.
Use field names and descriptions that align with how information typically appears in the source documents—this makes it easier for the LLM to identify and extract the correct values. If you’re extracting from tables, using the headers is helpful.
2. Add an optional enum
type if your data has limited outcomes.
If a field has a predictable set of values (e.g., “Yes” / “No” or predefined categories), use an enum
to constrain the output and improve consistency.
3. Don’t create new data inside your schema. Do downstream data manipulation.
If you try to extract data that doesn’t exist on the document, you’re more likely to run into hallucinations. Avoid asking the LLM to compute values.
4. Don’t nest too deeply.
Keep your schema structure relatively flat to avoid confusion. Deeply nesting past 2 or 3 levels can cause lower extraction accuracy.
System Prompt
The best system prompts are pretty simple and strike a balance: being general enough to cover a range of formats, while still giving enough hints and edge-case examples for the LLM to handle ambiguity effectively. Most of your requirements should live in your schema.
Some common things to mention:
- How you would like the extraction to be (“Be thorough and precise, make sure to process all the pages.”)
- General information about how the document is structured or its use case.
- Special instructions for certain edge cases or complex layouts.