Handling handwriting or small text
- Use Agentic OCR.
- Best for: forms, signatures, handwritten notes.
- Tradeoff: slightly higher cost and latency in exchange for higher accuracy.
Working with non-Germanic languages
- Use the Multilingual OCR system.
- Needed for: languages outside English, Spanish, Italian, Portuguese, French, or German.
- Also improves parsing of Unicode and special characters.
Unexpected symbols in your output
- Cause: metadata embeddings in PDFs may contain corrupted or hidden text.
- Fix: switch to OCR extraction mode.
- Avoid: Hybrid or Metadata extraction modes unless you are sure the metadata is reliable.
Missing checkboxes or images
- Enable experimental options in parse configs:
enable_checkboxes
: Detects and returns checkboxes with True/False.return_figure_images
: Detects and returns figures in the document.return_table_images
: Detects and returns tables in the document.
URLs from
Download immediately if you need permanent storage.
return_figure_images
and return_table_images
expire in ~24 hours.Download immediately if you need permanent storage.
Complex tables causing problems?
-
Option 1: Enrich with table mode
- Enable
enrich_mode
withtable
. - Add prompts to guide how rows and columns should align.
- Enable
-
Option 2: AI JSON format
- Use Table Output Format → ai_json.
- Passes the table image to a model for structural analysis.
- Tradeoff: higher latency, sometimes higher accuracy.
Spreadsheet related configurations
Only three configurations change outputs for spreadsheets:include_color_information
: adds Excel cell color details with LaTeX to the parse output.spreadsheet_table_clustering
: splits up individual tables inside of multi-table spreadsheets.large_table_chunking
: splits very large tables into smaller, manageable chunks for downstream processing.