This guide will walk you through the process of parsing large files using the Reducto API. We’ll cover using the asynchronous API, working with webhooks, and handling URL responses.
To parse large files, you should use the asynchronous API by setting the async parameter in your request. Here’s an example of how to do this using Python:
Copy
Ask AI
import requestsimport jsonimport timeAPI_BASE_URL = "https://platform.reducto.ai"API_KEY = "your_api_key_here"def parse_large_file(document_url): headers = { "Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json" } payload = { "document_url": document_url, "options": { "table_summary": {"enabled": True}, "figure_summary": {"enabled": True} } } response = requests.post(f"{API_BASE_URL}/parse_async", headers=headers, json=payload) response.raise_for_status() job_id = response.json()["job_id"] return job_iddef check_job_status(job_id): headers = { "Authorization": f"Bearer {API_KEY}" } response = requests.get(f"{API_BASE_URL}/job/{job_id}", headers=headers) response.raise_for_status() return response.json()# Usagedocument_url = "https://example.com/large-document.pdf"job_id = parse_large_file(document_url)print(f"Job ID: {job_id}")# Poll for job statuswhile True: status = check_job_status(job_id) if status["status"] in ["Completed", "Failed"]: print(f"Job {status['status']}") if status["status"] == "Completed": result = status["result"] # Process the result break time.sleep(10) # Wait for 10 seconds before checking again
import requestsresponse = requests.post( "https://platform.reducto.ai/configure_webhook", headers={"Authorization": "Bearer <api key>"},)# Go to this link to configure your webhook receiving applicationprint(response.text)
Here’s an example of a simple Flask server that can handle webhook notifications:
Copy
Ask AI
from flask import Flask, request, jsonifyapp = Flask(__name__)@app.route('/callback', methods=['POST'])def webhook_callback(): data = request.json job_id = data['job_id'] status = data['status'] if status == 'Completed': # Process the completed job process_completed_job(job_id) elif status == 'Failed': # Handle the failed job handle_failed_job(job_id) return jsonify({"status": "received"}), 200def process_completed_job(job_id): # Retrieve the job result and process it # You may need to call the /job/{job_id} endpoint to get the full result passdef handle_failed_job(job_id): # Handle the failed job, e.g., log the error, notify the user, etc. passif __name__ == '__main__': app.run(port=5000)
For very large results, the API may return a URL instead of the full result. This URL points to a JSON array of chunks. Here’s how to handle URL responses:
Copy
Ask AI
import requestsdef process_url_result(result): if result["type"] == "url": # Fetch the content from the URL response = requests.get(result["url"]) response.raise_for_status() chunks = response.json() # Process the chunks for chunk in chunks: process_chunk(chunk) else: # Handle full result for chunk in result["chunks"]: process_chunk(chunk)def process_chunk(chunk): # Process individual chunk content = chunk["content"] embed = chunk["embed"] enriched = chunk["enriched"] blocks = chunk["blocks"] # Your processing logic here pass# Usagejob_result = check_job_status(job_id)["result"]process_url_result(job_result)
When dealing with URL responses, remember:
The URL in the JSON response points to a JSON array of chunks.
There’s no additional metadata in the URL response, just the array of chunks.
Each chunk contains the same structure as in the full result.
By following this guide, you should be able to effectively parse large files using the Reducto API, leveraging the asynchronous API, webhooks, and handling URL responses when necessary.