Skip to main content

1. Use Variable Chunking for RAG

The default chunking mode (disabled) returns the entire document as one chunk. For RAG applications you need smaller chunks that can be embedded and retrieved independently. Variable chunking splits at semantic boundaries like section headers, tables, and figures, keeping related content together while creating chunks sized for embedding models.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"},
        "embedding_optimized": True
    }
)

# Use embed field for vector database, content for display
for chunk in result.result.chunks:
    vector_db.insert(
        embedding=embed(chunk.embed),
        metadata={"content": chunk.content}
    )
The embed field contains table and figure summaries as natural language, which embeds better than raw Markdown tables. The content field preserves the original formatting for display.

2. Enable Agentic Mode Only When Needed

Agentic mode uses an LLM to review and correct parsing output. It adds latency with additional credit usage, so only enable it when needed. When to enable scope: "text":
  • Handwritten documents or signatures
  • Faded or low-quality scans
  • Documents with unusual fonts
  • When you see garbled characters in output
When to enable scope: "table":
  • Tables with misaligned columns after parsing
  • Merged cells that didn’t parse correctly
  • Numbers appearing in wrong columns
  • Financial documents where accuracy is critical
When to enable scope: "figure":
  • Charts and graphs that need data extraction
  • Advanced chart extraction with structured data output
  • Diagrams requiring detailed descriptions
  • Visual elements where you need numeric data from bar charts, line graphs, or pie charts
result = client.parse.run(
    input=upload.file_id,
    enhance={
        "agentic": [
            {"scope": "text"},
            {"scope": "table"},
            {"scope": "figure", "advanced_chart_agent": True}
        ]
    }
)
Clean digital PDFs (native text, not scanned) parse correctly without agentic mode. Test your document types without it first, then enable selectively.

3. Set Priority for Async Requests

Parse has sync (/parse) and async (/parse_async) endpoints. Async requests without priority: true enter a queue and may experience delays during high traffic. If you’re using async for latency-sensitive requests (user-facing features, real-time processing), always set priority.
job = client.parse.run_job(
    input=upload.file_id,
    async_config={"priority": True}
)
Use async with priority or sync for documents that need speed. Use async without priority for batch processing where latency doesn’t matter.

4. Use HTML for Complex Tables

The default table format (dynamic) auto-selects HTML or Markdown based on complexity. For documents with complex tables (merged cells, nested headers, multi-row cells), explicitly request HTML.
result = client.parse.run(
    input=upload.file_id,
    formatting={
        "table_output_format": "html"
    }
)
Markdown tables can’t represent merged cells or complex structures. If your tables look broken, switching to HTML usually fixes it. For programmatic access to cell data, use json format instead.

5. Filter Headers and Footers for RAG

Page headers, footers, and page numbers add noise to RAG retrieval. When a user asks about invoice totals, you don’t want to retrieve chunks containing “Page 1 of 5” or “Confidential - Do Not Distribute”.
result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)
The filtered blocks still appear in chunks[].blocks metadata (so you can access them if needed), but they’re excluded from content and embed fields.

Common Pitfalls

If you’re using agentic mode, it adds latency since it runs an LLM pass over the output. Disable it and only enable for document types that actually need correction. For async calls, make sure you have priority: true set.
Check your chunking. If you’re using disabled (default), the entire document is one chunk. Switch to variable for semantic chunking.
Switch from dynamic to html format. Markdown can’t handle merged cells.
You forgot to set priority: true. Without it, jobs enter a queue.
The PDF file may be malformed or use unsupported encryption. Try opening the file in a PDF viewer to verify it’s valid. If the file opens but Reducto fails, the PDF may use non-standard formatting. Re-save it using a tool like Adobe Acrobat or a PDF printer, then retry.

Configuration Reference

For complete details on all options mentioned above, see the dedicated configuration pages: