Parsing Overview

Parse is used for parsing all of the content in a document and takes care of post-processing steps like chunking. Check out the response format structure here.

Example Use cases

Parsing PDFs and scans into chunked context for Retrieval-Augmented-Generation (RAG), improving LLM accuracy and relevancy
Building custom LLM apps such as chatbots, summarizers, internal copilots, etc.

Getting Started

There are four main steps for getting started: 1. Choose your document upload method

You can upload directly to Reducto using the /upload endpoint. The quickstart shows how to use the endpoint effectively.
You can simply send us a URL with direct access to the file in the document_url field. This is typically a pre-signed URL using a storage bucket in your cloud (e.g. S3).

2. Choose your configuration options The parse endpoint has a few different configuration options so you can optimize performance for your use case. The default configurations will work well, but customers most frequently update:

Chunk mode: We return a list of chunks in the JSON response for use cases like RAG. You can learn about each chunking mode here
OCR mode: Control how OCR is performed on tables in your documents. Learn more about OCR options here
Table summaries: Table summaries create a natural language version of tables for the embed field to improve retrieval scores. This is enabled by default and is really helpful for RAG, but can increase latency.
Figure summaries: Figure summaries use a large vision model to process figures, including summarization for images and data extraction for graphs. By default this is disabled due to latency, but you should enable this if figures in your documents include valuable info.
Custom prompting: We expose custom prompting for both table and figure summaries to change behavior. Custom prompting is uncommon, but you may find that it can improve performance for your specific use case.
Ignore blocks: You can choose to filter out specific block types (like page numbers, footers, etc.) from chunks using this parameter. By default, no blocks are filtered, but you can specify which block types to remove. The blocks will still be available in chunk metadata.

Explore all the customizable options at our Playground or Configurations section. 3. Making your request Next, you should decide between synchronous and asynchronous processing. Both options offer comparable latency, but async is generally recommended for longer documents to prevent issues from a connection time out. If you choose to process a document with async, you can choose between polling the /job endpoint or setting up a webhook to be notified when your job is complete. 4. Receiving the API response For sync requests the API will return the processed outputs, and for async requests the API will return a job id immediately. You can poll for the processed outputs with the job id using the /job endpoint. Due to HTTP size limits, longer documents are returned with a URL to download the output JSON while standard documents are returned directly. You can choose to always receive a URL response with the force_url_result config option.

Parsing FAQ

How do I parse password-protected PDFs?

To parse a password-protected PDF, include the password in the document_password field of advanced_options in your request. The API will use this password to decrypt the document before processing. If the password is incorrect or not provided for a password-protected document, you’ll receive an error response.

{
"document_url": "https://example.com/protected.pdf",
    "advanced_options": {
        "document_password": "your-pdf-password"
    }
}

Can I just get the whole document returned in markdown format (mdx)?

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

Example Use cases

Getting Started

Parsing FAQ

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

​Example Use cases

​Getting Started

​Parsing FAQ

Example Use cases

Getting Started

Parsing FAQ