Skip to main content
Parse options affect accuracy, cost, and speed. The following scenarios highlight when to use specific configurations.

Handling handwriting or small text

Working with non-Germanic languages

  • Use settings.ocr_system="standard" (the default in v3).
  • Needed for: languages outside English, Spanish, Italian, Portuguese, French, or German.
  • Also improves parsing of Unicode and special characters.

Unexpected symbols in your output

  • Cause: metadata embeddings in PDFs may contain corrupted or hidden text.
  • Note: In v3, the extraction mode is automatically optimized and no longer configurable. If you continue to see unexpected symbols, please contact support.

Missing checkboxes or images

  • Checkboxes: In v3, checkbox detection is automatically enabled and no longer needs configuration.
  • Images: To return images for figures or tables, use settings.return_images:
    • settings.return_images=["figure"]: Returns figure images
    • settings.return_images=["table"]: Returns table images
    • settings.return_images=["figure", "table"]: Returns both
Image URLs expire in ~24 hours. Download immediately if you need permanent storage.

Complex tables causing problems?

  • Option 1: Use agentic table mode
    • Enable enhance.agentic with [{"scope": "table"}].
    • Add prompts to guide how rows and columns should align using prompt field.
  • Option 2: AI JSON format
    • Use formatting.table_output_format set to "ai_json".
    • Passes the table image to a model for structural analysis.
    • Tradeoff: higher latency, sometimes higher accuracy.
The main configurations for spreadsheet outputs are:
I