Example use cases
- Parsing PDFs and scans into chunked context for Retrieval-Augmented-Generation (RAG), improving LLM accuracy and relevancy
- Building custom LLM apps such as chatbots, summarizers, internal copilots, etc.
Getting started
There are four main steps for getting started: 1. Choose your document upload method- You can upload directly to Reducto using the /upload endpoint. The quickstart shows how to use the endpoint effectively.
- You can simply send us a URL with direct access to the file in the document_url field. This is typically a pre-signed URL using a storage bucket in your cloud (e.g. S3).
- Chunk mode: We return a list of chunks in the JSON response for use cases like RAG. You can learn about each chunking mode here
- OCR mode: Control how OCR is performed on tables in your documents. Learn more about OCR options here
- Table summaries: Table summaries create a natural language version of tables for the embed field to improve retrieval scores. This is enabled by default and is really helpful for RAG, but can increase latency.
- Figure summaries: Figure summaries use a large vision model to process figures, including summarization for images and data extraction for graphs. By default this is disabled due to latency, but you should enable this if figures in your documents include valuable info.
- Custom prompting: We expose custom prompting for both table and figure summaries to change behavior. Custom prompting is uncommon, but you may find that it can improve performance for your specific use case.
- Ignore blocks: You can choose to filter out specific block types (like page numbers, footers, etc.) from chunks using this parameter. By default, no blocks are filtered, but you can specify which block types to remove. The blocks will still be available in chunk metadata.
Parse FAQ
How do I parse password-protected PDFs?
How do I parse password-protected PDFs?
To parse a password-protected PDF, include the password in the
document_password
field of advanced_options
in your request. The API will use this password to decrypt the document before processing. If the password is incorrect or not provided for a password-protected document, you’ll receive an error response.Can I just get the whole document returned in markdown format (mdx)?
Can I just get the whole document returned in markdown format (mdx)?
If you are just looking for a markdown representation of a given document, you can disable chunking altogether and just use
parse_response['result']['chunks'][0]['content']
.