Parse Best Practices

1. Use Variable Chunking for RAG

The default chunking mode (disabled) returns the entire document as one chunk. For RAG applications you need smaller chunks that can be embedded and retrieved independently. Variable chunking splits at semantic boundaries like section headers, tables, and figures, keeping related content together while creating chunks sized for embedding models.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "chunking": {"chunk_mode": "variable"},
        "embedding_optimized": True
    }
)

# Use embed field for vector database, content for display
for chunk in result.result.chunks:
    vector_db.insert(
        embedding=embed(chunk.embed),
        metadata={"content": chunk.content}
    )

The embed field contains table and figure summaries as natural language, which embeds better than raw Markdown tables. The content field preserves the original formatting for display.

2. Enable Agentic Mode Only When Needed

Agentic mode uses an LLM to review and correct parsing output. It adds latency with additional credit usage, so only enable it when needed. When to enable scope: "text":

Handwritten documents or signatures
Faded or low-quality scans
Documents with unusual fonts
When you see garbled characters in output

When to enable scope: "table":

Tables with misaligned columns after parsing
Merged cells that didn’t parse correctly
Numbers appearing in wrong columns
Financial documents where accuracy is critical

When to enable scope: "figure":

Charts and graphs that need data extraction
Advanced chart extraction with structured data output
Diagrams requiring detailed descriptions
Visual elements where you need numeric data from bar charts, line graphs, or pie charts

result = client.parse.run(
    input=upload.file_id,
    enhance={
        "agentic": [
            {"scope": "text"},
            {"scope": "table"},
            {"scope": "figure", "advanced_chart_agent": True}
        ]
    }
)

Clean digital PDFs (native text, not scanned) parse correctly without agentic mode. Test your document types without it first, then enable selectively.

3. Set Priority for Async Requests

Parse has sync (/parse) and async (/parse_async) endpoints. Async requests without priority: true enter a queue and may experience delays during high traffic. If you’re using async for latency-sensitive requests (user-facing features, real-time processing), always set priority.

job = client.parse.run_job(
    input=upload.file_id,
    async_config={"priority": True}
)

Use async with priority or sync for documents that need speed. Use async without priority for batch processing where latency doesn’t matter.

4. Use HTML for Complex Tables

The default table format (dynamic) auto-selects HTML or Markdown based on complexity. For documents with complex tables (merged cells, nested headers, multi-row cells), explicitly request HTML.

result = client.parse.run(
    input=upload.file_id,
    formatting={
        "table_output_format": "html"
    }
)

Markdown tables can’t represent merged cells or complex structures. If your tables look broken, switching to HTML usually fixes it. For programmatic access to cell data, use json format instead.

5. Filter Headers and Footers for RAG

Page headers, footers, and page numbers add noise to RAG retrieval. When a user asks about invoice totals, you don’t want to retrieve chunks containing “Page 1 of 5” or “Confidential - Do Not Distribute”.

result = client.parse.run(
    input=upload.file_id,
    retrieval={
        "filter_blocks": ["Header", "Footer", "Page Number"]
    }
)

The filtered blocks still appear in chunks[].blocks metadata (so you can access them if needed), but they’re excluded from content and embed fields.

Common Pitfalls

Requests taking too long

If you’re using agentic mode, it adds latency since it runs an LLM pass over the output. Disable it and only enable for document types that actually need correction. For async calls, make sure you have priority: true set.

RAG retrieval quality is poor

Check your chunking. If you’re using disabled (default), the entire document is one chunk. Switch to variable for semantic chunking.

Tables render incorrectly

Switch from dynamic to html format. Markdown can’t handle merged cells.

Async jobs are slow even for small docs

You forgot to set priority: true. Without it, jobs enter a queue.

Document fails with 'corrupted' or 'invalid' error

The PDF file may be malformed or use unsupported encryption. Try opening the file in a PDF viewer to verify it’s valid. If the file opens but Reducto fails, the PDF may use non-standard formatting. Re-save it using a tool like Adobe Acrobat or a PDF printer, then retry.

Configuration Reference

For complete details on all options mentioned above, see the dedicated configuration pages:

Chunking Methods

All chunking modes and their use cases.

Agentic Mode

When and how to use LLM-assisted parsing.

Table Formats

HTML, Markdown, JSON, CSV options.

Configuration Overview

Full reference of all configuration options.

Parse Overview

Quick start and basic usage.

Response Format

Understanding chunks, blocks, and bounding boxes.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

1. Use Variable Chunking for RAG

2. Enable Agentic Mode Only When Needed

3. Set Priority for Async Requests

4. Use HTML for Complex Tables

5. Filter Headers and Footers for RAG

Common Pitfalls

Configuration Reference

Chunking Methods

Agentic Mode

Table Formats

Configuration Overview

Parse Overview

Response Format

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​1. Use Variable Chunking for RAG

​2. Enable Agentic Mode Only When Needed

​3. Set Priority for Async Requests

​4. Use HTML for Complex Tables

​5. Filter Headers and Footers for RAG

​Common Pitfalls

​Configuration Reference

Chunking Methods

Agentic Mode

Table Formats

Configuration Overview

​Related

Parse Overview

Response Format

1. Use Variable Chunking for RAG

2. Enable Agentic Mode Only When Needed

3. Set Priority for Async Requests

4. Use HTML for Complex Tables

5. Filter Headers and Footers for RAG

Common Pitfalls

Configuration Reference

Related