Handling Large Chunks
Parsing Large Files with Reducto API
This guide will walk you through the process of parsing large files using the Reducto API. We'll cover using the asynchronous API, working with webhooks, and handling URL responses.
Table of Contents
Introduction
The Reducto API provides powerful document processing capabilities, including the ability to handle large files. When dealing with substantial documents, it's recommended to use the asynchronous API to avoid timeout issues and to process files efficiently.
Using the Asynchronous API
To parse large files, you should use the asynchronous API by setting the async
parameter in your request. Here's an example of how to do this using Python:
import requests
import json
import time
API_BASE_URL = "https://platform.reducto.ai"
API_KEY = "your_api_key_here"
def parse_large_file(document_url):
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"document_url": document_url,
"options": {
"table_summary": {"enabled": True},
"figure_summary": {"enabled": True}
}
}
response = requests.post(f"{API_BASE_URL}/parse_async", headers=headers, json=payload)
response.raise_for_status()
job_id = response.json()["job_id"]
return job_id
def check_job_status(job_id):
headers = {
"Authorization": f"Bearer {API_KEY}"
}
response = requests.get(f"{API_BASE_URL}/job/{job_id}", headers=headers)
response.raise_for_status()
return response.json()
# Usage
document_url = "https://example.com/large-document.pdf"
job_id = parse_large_file(document_url)
print(f"Job ID: {job_id}")
# Poll for job status
while True:
status = check_job_status(job_id)
if status["status"] in ["Completed", "Failed"]:
print(f"Job {status['status']}")
if status["status"] == "Completed":
result = status["result"]
# Process the result
break
time.sleep(10) # Wait for 10 seconds before checking again
Working with Webhooks
Here's an example of a simple Flask server that can handle webhook notifications:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/callback', methods=['POST'])
def webhook_callback():
data = request.json
job_id = data['job_id']
status = data['status']
if status == 'Completed':
# Process the completed job
process_completed_job(job_id)
elif status == 'Failed':
# Handle the failed job
handle_failed_job(job_id)
return jsonify({"status": "received"}), 200
def process_completed_job(job_id):
# Retrieve the job result and process it
# You may need to call the /job/{job_id} endpoint to get the full result
pass
def handle_failed_job(job_id):
# Handle the failed job, e.g., log the error, notify the user, etc.
pass
if __name__ == '__main__':
app.run(port=5000)
Handling URL Responses
For very large results, the API may return a URL instead of the full result. This URL points to a JSON array of chunks. Here's how to handle URL responses:
import requests
def process_url_result(result):
if result["type"] == "url":
# Fetch the content from the URL
response = requests.get(result["url"])
response.raise_for_status()
chunks = response.json()
# Process the chunks
for chunk in chunks:
process_chunk(chunk)
else:
# Handle full result
for chunk in result["chunks"]:
process_chunk(chunk)
def process_chunk(chunk):
# Process individual chunk
content = chunk["content"]
embed = chunk["embed"]
enriched = chunk["enriched"]
blocks = chunk["blocks"]
# Your processing logic here
pass
# Usage
job_result = check_job_status(job_id)["result"]
process_url_result(job_result)
When dealing with URL responses, remember:
- The URL in the JSON response points to a JSON array of chunks.
- There's no additional metadata in the URL response, just the array of chunks.
- Each chunk contains the same structure as in the full result.
By following this guide, you should be able to effectively parse large files using the Reducto API, leveraging the asynchronous API, webhooks, and handling URL responses when necessary.
Updated 4 months ago