This guide will walk you through the process of parsing large files using the Reducto API. We’ll cover using the asynchronous API, working with webhooks, and handling URL responses.
To parse large files, you should use the asynchronous API by setting the async parameter in your request. Here’s an example of how to do this using Python:
import requestsimport jsonimport timeAPI_BASE_URL ="https://platform.reducto.ai"API_KEY ="your_api_key_here"defparse_large_file(document_url): headers ={"Authorization":f"Bearer {API_KEY}","Content-Type":"application/json"} payload ={"document_url": document_url,"options":{"table_summary":{"enabled":True},"figure_summary":{"enabled":True}}} response = requests.post(f"{API_BASE_URL}/parse_async", headers=headers, json=payload) response.raise_for_status() job_id = response.json()["job_id"]return job_iddefcheck_job_status(job_id): headers ={"Authorization":f"Bearer {API_KEY}"} response = requests.get(f"{API_BASE_URL}/job/{job_id}", headers=headers) response.raise_for_status()return response.json()# Usagedocument_url ="https://example.com/large-document.pdf"job_id = parse_large_file(document_url)print(f"Job ID: {job_id}")# Poll for job statuswhileTrue: status = check_job_status(job_id)if status["status"]in["Completed","Failed"]:print(f"Job {status['status']}")if status["status"]=="Completed": result = status["result"]# Process the resultbreak time.sleep(10)# Wait for 10 seconds before checking again
import requestsresponse = requests.post("https://platform.reducto.ai/configure_webhook", headers={"Authorization":"Bearer <api key>"},)# Go to this link to configure your webhook receiving applicationprint(response.text)
Here’s an example of a simple Flask server that can handle webhook notifications:
from flask import Flask, request, jsonifyapp = Flask(__name__)@app.route('/callback', methods=['POST'])defwebhook_callback(): data = request.json job_id = data['job_id'] status = data['status']if status =='Completed':# Process the completed job process_completed_job(job_id)elif status =='Failed':# Handle the failed job handle_failed_job(job_id)return jsonify({"status":"received"}),200defprocess_completed_job(job_id):# Retrieve the job result and process it# You may need to call the /job/{job_id} endpoint to get the full resultpassdefhandle_failed_job(job_id):# Handle the failed job, e.g., log the error, notify the user, etc.passif __name__ =='__main__': app.run(port=5000)
For very large results, the API may return a URL instead of the full result. This URL points to a JSON array of chunks. Here’s how to handle URL responses:
import requestsdefprocess_url_result(result):if result["type"]=="url":# Fetch the content from the URL response = requests.get(result["url"]) response.raise_for_status() chunks = response.json()# Process the chunksfor chunk in chunks: process_chunk(chunk)else:# Handle full resultfor chunk in result["chunks"]: process_chunk(chunk)defprocess_chunk(chunk):# Process individual chunk content = chunk["content"] embed = chunk["embed"] enriched = chunk["enriched"] blocks = chunk["blocks"]# Your processing logic herepass# Usagejob_result = check_job_status(job_id)["result"]process_url_result(job_result)
When dealing with URL responses, remember:
The URL in the JSON response points to a JSON array of chunks.
There’s no additional metadata in the URL response, just the array of chunks.
Each chunk contains the same structure as in the full result.
By following this guide, you should be able to effectively parse large files using the Reducto API, leveraging the asynchronous API, webhooks, and handling URL responses when necessary.