Handling Large Chunks

Parsing Large Files with Reducto API

This guide will walk you through the process of parsing large files using the Reducto API. We'll cover using the asynchronous API, working with webhooks, and handling URL responses.

Table of Contents

  1. Introduction
  2. Using the Asynchronous API
  3. Working with Webhooks
  4. Handling URL Responses

Introduction

The Reducto API provides powerful document processing capabilities, including the ability to handle large files. When dealing with substantial documents, it's recommended to use the asynchronous API to avoid timeout issues and to process files efficiently.

Using the Asynchronous API

To parse large files, you should use the asynchronous API by setting the async parameter in your request. Here's an example of how to do this using Python:

import requests
import json
import time

API_BASE_URL = "https://platform.reducto.ai"
API_KEY = "your_api_key_here"

def parse_large_file(document_url):
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "document_url": document_url,
        "options": {
            "table_summary": {"enabled": True},
            "figure_summary": {"enabled": True}
        }
    }
    
    response = requests.post(f"{API_BASE_URL}/parse_async", headers=headers, json=payload)
    response.raise_for_status()
    
    job_id = response.json()["job_id"]
    return job_id

def check_job_status(job_id):
    headers = {
        "Authorization": f"Bearer {API_KEY}"
    }
    
    response = requests.get(f"{API_BASE_URL}/job/{job_id}", headers=headers)
    response.raise_for_status()
    
    return response.json()

# Usage
document_url = "https://example.com/large-document.pdf"
job_id = parse_large_file(document_url)
print(f"Job ID: {job_id}")

# Poll for job status
while True:
    status = check_job_status(job_id)
    if status["status"] in ["Completed", "Failed"]:
        print(f"Job {status['status']}")
        if status["status"] == "Completed":
            result = status["result"]
            # Process the result
        break
    time.sleep(10)  # Wait for 10 seconds before checking again

Working with Webhooks


Here's an example of a simple Flask server that can handle webhook notifications:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/callback', methods=['POST'])
def webhook_callback():
    data = request.json
    job_id = data['job_id']
    status = data['status']
    
    if status == 'Completed':
        # Process the completed job
        process_completed_job(job_id)
    elif status == 'Failed':
        # Handle the failed job
        handle_failed_job(job_id)
    
    return jsonify({"status": "received"}), 200

def process_completed_job(job_id):
    # Retrieve the job result and process it
    # You may need to call the /job/{job_id} endpoint to get the full result
    pass

def handle_failed_job(job_id):
    # Handle the failed job, e.g., log the error, notify the user, etc.
    pass

if __name__ == '__main__':
    app.run(port=5000)

Handling URL Responses

For very large results, the API may return a URL instead of the full result. This URL points to a JSON array of chunks. Here's how to handle URL responses:

import requests

def process_url_result(result):
    if result["type"] == "url":
        # Fetch the content from the URL
        response = requests.get(result["url"])
        response.raise_for_status()
        chunks = response.json()
        
        # Process the chunks
        for chunk in chunks:
            process_chunk(chunk)
    else:
        # Handle full result
        for chunk in result["chunks"]:
            process_chunk(chunk)

def process_chunk(chunk):
    # Process individual chunk
    content = chunk["content"]
    embed = chunk["embed"]
    enriched = chunk["enriched"]
    blocks = chunk["blocks"]
    
    # Your processing logic here
    pass

# Usage
job_result = check_job_status(job_id)["result"]
process_url_result(job_result)

When dealing with URL responses, remember:

  1. The URL in the JSON response points to a JSON array of chunks.
  2. There's no additional metadata in the URL response, just the array of chunks.
  3. Each chunk contains the same structure as in the full result.

By following this guide, you should be able to effectively parse large files using the Reducto API, leveraging the asynchronous API, webhooks, and handling URL responses when necessary.