Getting Started: Reducto Parse
Parse is used for parsing all of the content in a document and takes care of post-processing steps like chunking. There are four main steps for getting started:
1. Choose your document upload method
- You can upload directly to Reducto using the /upload endpoint. This recipe shows how to use the endpoint effectively.
- You can simply send us a URL with direct access to the file in the document_url field. This is typically a pre-signed URL using a storage bucket in your cloud (e.g. S3).
2. Choose your configuration options
The parse endpoint has a few different configuration options so you can optimize performance for your use case. The default configurations will work well, but customers most frequently update:
- Chunk mode: We return a list of chunks in the JSON response for use cases like RAG. You can learn about each chunking mode here
- Table summaries: Table summaries create a natural language version of tables for the embed field to improve retrieval scores. This is enabled by default and is really helpful for RAG, but can increase latency.
- Figure summaries: Figure summaries use a large vision model to process figures, including summarization for images and data extraction for graphs. By default this is disabled due to latency, but you should enable this if figures in your documents include valuable info.
- Custom prompting: We expose custom prompting for both table and figure summaries to change behavior. Custom prompting is uncommon, but you may find that it can improve performance for your specific use case.
- Ignore blocks: Sometimes content like page numbers like footers can introduce noise in your extracted data, so we remove them from chunks by default. You can choose the exact set of blocks that are filtered using this parameter. The blocks will still be available in chunk metadata.
3. Making your request
Next, you should decide between synchronous and asynchronous processing. Both options offer comparable latency, but async is generally recommended for longer documents to prevent issues from a connection time out.
If you choose to process a document with async, you can choose between polling the /job endpoint or setting up a webhook to be notified when your job is complete.
4. Receiving the API response
For sync requests the API will return the processed outputs, and for async requests the API will return a job id immediately. You can poll for the processed outputs with the job id using the /job endpoint.
Due to HTTP size limits, longer documents are returned with a URL to download the output JSON while standard documents are returned directly. You can choose to always receive a URL response with the force_url_result config option.
Updated 3 months ago