Quick Start
What You Get Back
| Field | What it is |
|---|---|
chunks[].content | The extracted content, formatted as Markdown (headers become #, tables become Markdown/HTML tables). Ready to pass to an LLM. |
chunks[].embed | Same content but optimized for embeddings. When figure/table summaries are enabled, this field contains natural language descriptions instead of raw table markup. |
chunks[].blocks | The individual elements (paragraphs, tables, figures) with their positions and types. Useful for highlighting or linking back to source. |
result.type | Either "full" (content inline) or "url" (content at a URL). Large documents return "url" to avoid HTTP size limits. |
Response Format Details
Full breakdown of chunks, blocks, bounding boxes, and confidence scores.
Input Options
Theinput field accepts four formats:
- Upload response (
reducto://...): After uploading via/upload, use the returnedfile_id. This is the most common method for local files. - Public URL: Any publicly accessible URL. Reducto fetches the file directly.
- Presigned URL: S3, GCS, or Azure Blob presigned URLs work. Useful when files are in your cloud storage.
- Previous job ID (
jobid://...): Reprocess a document from a previous parse job without re-uploading. Useful for testing different configurations.
Sync vs Async
Parse has both synchronous (/parse) and asynchronous (/parse_async) endpoints. Use async for large documents or when you need webhook delivery.
Sync vs Async Guide
When to use each, how priority works, webhook setup.
Configuration
Parse has several configuration groups. Here are the most commonly changed options:Chunking
By default, Parse returns the entire document as one chunk. For RAG applications, you want smaller chunks that can be embedded and retrieved independently.| Mode | Behavior |
|---|---|
disabled | One chunk for the whole document (default) |
variable | Splits at semantic boundaries (sections, tables, figures stay intact). Best for RAG. |
page | One chunk per page |
section | Splits at section headers |
Table Output Format
Controls how tables appear in the output.| Format | When to use |
|---|---|
dynamic | Auto-selects HTML or Markdown based on table complexity (default) |
html | Complex tables with merged cells, nested headers |
md | Simple tables, Markdown-based workflows |
json | Programmatic processing, need cell-level access |
csv | Export to spreadsheets |
Figure Summaries
By default, Parse uses a vision model to generate descriptions for figures and images. This helps with RAG (theembed field contains the description) but adds latency.
Agentic Mode
Uses an LLM to review and correct parsing output. Adds latency with additional credit usage. Enable it when:scope: "text": Handwritten text, faded scans, documents with unusual fonts, or when you see garbled characters in the output.scope: "table": Tables with misaligned columns, merged cells that didn’t parse correctly, or numbers that appear in wrong columns.scope: "figure": Charts and graphs that need data extraction, including advanced chart extraction with structured data output.
Filter Blocks
Remove specific content types from the output. The blocks still appear inblocks metadata but are excluded from content and embed.
Page Range
Process only specific pages.Return Images
Get image URLs for figures and tables in the document.["figure"], ["table"], or ["figure", "table"]. By default, no images are returned.
Additional Settings
| Setting | Default | Description |
|---|---|---|
persist_results | false | Keep results indefinitely instead of expiring after 24 hours |
timeout | null | Custom timeout in seconds for processing |
force_url_result | false | Always return results as a URL (useful for consistent handling) |
embed_pdf_metadata | false | Embed OCR metadata into returned PDF |
For complete configuration reference including OCR settings, spreadsheet options, and more, see the Configuration section.
Troubleshooting
Tables look wrong
Tables look wrong
Try
formatting.table_output_format: "html". HTML handles merged cells and complex headers better than Markdown.Still broken? Enable enhance.agentic: [{"scope": "table"}] to use an LLM for alignment fixes.Response is slow
Response is slow
Main causes:
enhance.agentic can add latency with higher accuracyenhance.summarize_figuresadds latency with figures- Large documents take longer linearly
async_priority should be True for faster priority processing
Password-protected PDF
Password-protected PDF
Response is a URL instead of content
Response is a URL instead of content
Large documents return To always get a URL (consistent handling):
result.type: "url" instead of inline content to avoid HTTP size limits. Fetch the content:settings.force_url_result: true