Parse options affect accuracy, cost, and speed. The following scenarios highlight when to use specific configurations.
Handling handwriting or small text
Working with non-Germanic languages
- Use
settings.ocr_system="standard" (the default in v3).
- Needed for: languages outside English, Spanish, Italian, Portuguese, French, or German.
- Also improves parsing of Unicode and special characters.
Unexpected symbols in your output
- Cause: metadata embeddings in PDFs may contain corrupted or hidden text.
- Note: In v3, the extraction mode is automatically optimized and no longer configurable. If you continue to see unexpected symbols, please contact support.
Missing checkboxes or images
- Checkboxes: In v3, checkbox detection is automatically enabled and no longer needs configuration.
- Images: To return images for figures or tables, use
settings.return_images:
settings.return_images=["figure"]: Returns figure images
settings.return_images=["table"]: Returns table images
settings.return_images=["figure", "table"]: Returns both
Image URLs expire in ~24 hours. Download immediately if you need permanent storage.
Complex tables causing problems?
-
Option 1: Use agentic table mode
- Enable
enhance.agentic with [{"scope": "table"}].
- Add prompts to guide how rows and columns should align using
prompt field.
-
Option 2: AI JSON format
- Use
formatting.table_output_format set to "ai_json".
- Passes the table image to a model for structural analysis.
- Tradeoff: higher latency, sometimes higher accuracy.
The main configurations for spreadsheet outputs are: