Parsing Overview
Understanding Reducto’s Parse endpoint
Parse is used for parsing all of the content in a document and takes care of post-processing steps like chunking. Check out the response format structure here.
Example Use cases
- Parsing PDFs and scans into chunked context for Retrieval-Augmented-Generation (RAG), improving LLM accuracy and relevancy
- Building custom LLM apps such as chatbots, summarizers, internal copilots, etc.
Getting Started
There are four main steps for getting started:
1. Choose your document upload method
- You can upload directly to Reducto using the /upload endpoint. The quickstart shows how to use the endpoint effectively.
- You can simply send us a URL with direct access to the file in the document_url field. This is typically a pre-signed URL using a storage bucket in your cloud (e.g. S3).
2. Choose your configuration options
The parse endpoint has a few different configuration options so you can optimize performance for your use case. The default configurations will work well, but customers most frequently update:
- Chunk mode: We return a list of chunks in the JSON response for use cases like RAG. You can learn about each chunking mode here
- OCR mode: Control how OCR is performed on tables in your documents. Learn more about OCR options here
- Table summaries: Table summaries create a natural language version of tables for the embed field to improve retrieval scores. This is enabled by default and is really helpful for RAG, but can increase latency.
- Figure summaries: Figure summaries use a large vision model to process figures, including summarization for images and data extraction for graphs. By default this is disabled due to latency, but you should enable this if figures in your documents include valuable info.
- Custom prompting: We expose custom prompting for both table and figure summaries to change behavior. Custom prompting is uncommon, but you may find that it can improve performance for your specific use case.
- Ignore blocks: You can choose to filter out specific block types (like page numbers, footers, etc.) from chunks using this parameter. By default, no blocks are filtered, but you can specify which block types to remove. The blocks will still be available in chunk metadata.
Explore all the customizable options at our Playground or Configurations section.
3. Making your request
Next, you should decide between synchronous and asynchronous processing. Both options offer comparable latency, but async is generally recommended for longer documents to prevent issues from a connection time out.
If you choose to process a document with async, you can choose between polling the /job endpoint or setting up a webhook to be notified when your job is complete.
4. Receiving the API response
For sync requests the API will return the processed outputs, and for async requests the API will return a job id immediately. You can poll for the processed outputs with the job id using the /job endpoint.
Due to HTTP size limits, longer documents are returned with a URL to download the output JSON while standard documents are returned directly. You can choose to always receive a URL response with the force_url_result config option.