Handling handwriting or small text
- Use
enhance.agentic=[{"scope": "text"}]. - Best for: forms, signatures, handwritten notes.
- Tradeoff: slightly higher cost and latency in exchange for higher accuracy.
Working with non-Germanic languages
- Use
settings.ocr_system="standard"(the default in v3). - Needed for: languages outside English, Spanish, Italian, Portuguese, French, or German.
- Also improves parsing of Unicode and special characters.
Unexpected symbols in your output
- Cause: metadata embeddings in PDFs may contain corrupted or hidden text.
- Note: In v3, the extraction mode is automatically optimized and no longer configurable. If you continue to see unexpected symbols, please contact support.
Missing checkboxes or images
- Checkboxes: In v3, checkbox detection is automatically enabled and no longer needs configuration.
- Images: To return images for figures or tables, use
settings.return_images:settings.return_images=["figure"]: Returns figure imagessettings.return_images=["table"]: Returns table imagessettings.return_images=["figure", "table"]: Returns both
Image URLs expire in ~24 hours. Download immediately if you need permanent storage.
Complex tables causing problems?
-
Option 1: Use agentic table mode
- Enable
enhance.agenticwith[{"scope": "table"}]. - Add prompts to guide how rows and columns should align using
promptfield.
- Enable
-
Option 2: AI JSON format
- Use
formatting.table_output_formatset to"ai_json". - Passes the table image to a model for structural analysis.
- Tradeoff: higher latency, sometimes higher accuracy.
- Use
Spreadsheet related configurations
The main configurations for spreadsheet outputs are:spreadsheet.include=["cell_colors"]: adds Excel cell color details with LaTeX to the parse output.spreadsheet.clustering: splits up individual tables inside of multi-table spreadsheets (use"accurate","fast", or"disabled").spreadsheet.split_large_tables: splits very large tables into smaller, manageable chunks for downstream processing.