Parse options affect accuracy, cost, and speed. The following scenarios highlight when to use specific configurations.

Handling handwriting or small text

  • Use Agentic OCR.
  • Best for: forms, signatures, handwritten notes.
  • Tradeoff: slightly higher cost and latency in exchange for higher accuracy.

Working with non-Germanic languages

  • Use the Multilingual OCR system.
  • Needed for: languages outside English, Spanish, Italian, Portuguese, French, or German.
  • Also improves parsing of Unicode and special characters.

Unexpected symbols in your output

  • Cause: metadata embeddings in PDFs may contain corrupted or hidden text.
  • Fix: switch to OCR extraction mode.
  • Avoid: Hybrid or Metadata extraction modes unless you are sure the metadata is reliable.

Missing checkboxes or images

  • Enable experimental options in parse configs:
    • enable_checkboxes: Detects and returns checkboxes with True/False.
    • return_figure_images: Detects and returns figures in the document.
    • return_table_images: Detects and returns tables in the document.
URLs from return_figure_images and return_table_images expire in ~24 hours.
Download immediately if you need permanent storage.

Complex tables causing problems?

  • Option 1: Enrich with table mode
    • Enable enrich_mode with table.
    • Add prompts to guide how rows and columns should align.
  • Option 2: AI JSON format
Only three configurations change outputs for spreadsheets:
  • include_color_information: adds Excel cell color details with LaTeX to the parse output.
  • spreadsheet_table_clustering: splits up individual tables inside of multi-table spreadsheets.
  • large_table_chunking: splits very large tables into smaller, manageable chunks for downstream processing.