Parsing large files

Overview

This guide will walk you through the process of parsing large files using the Reducto API. We’ll cover using the asynchronous API, working with webhooks, and handling URL responses.

Workflow at a glance

Parse asynchronously to avoid timeouts on large files.
(Optional) Configure a webhook to receive job status updates.
Handle URL responses when results are returned as a downloadable JSON array of chunks.

How‑to: Parse large files asynchronously

To parse large files, you should use the asynchronous API by setting the async parameter in your request. Here’s an example of how to do this using Python:

import requests
import json
import time

API_BASE_URL = "https://platform.reducto.ai"
API_KEY = "your_api_key_here"

def parse_large_file(document_url):
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }

    payload = {
        "input": document_url,
        "retrieval": {
            "embedding_optimized": True
        },
        "enhance": {
            "summarize_figures": True
        }
    }
    
    response = requests.post(f"{API_BASE_URL}/parse_async", headers=headers, json=payload)
    response.raise_for_status()
    
    job_id = response.json()["job_id"]
    return job_id

def check_job_status(job_id):
    headers = {
        "Authorization": f"Bearer {API_KEY}"
    }
    
    response = requests.get(f"{API_BASE_URL}/job/{job_id}", headers=headers)
    response.raise_for_status()
    
    return response.json()

# Usage
document_url = "https://example.com/large-document.pdf"
job_id = parse_large_file(document_url)
print(f"Job ID: {job_id}")

# Poll for job status
while True:
    status = check_job_status(job_id)
    if status["status"] in ["Completed", "Failed"]:
        print(f"Job {status['status']}")
        if status["status"] == "Completed":
            result = status["result"]
            # Process the result
        break
    time.sleep(10)  # Wait for 10 seconds before checking again

How‑to: Receive results via webhooks

Configure Webhook

import requests

response = requests.post(
    "https://platform.reducto.ai/configure_webhook",
    headers={"Authorization": "Bearer <api key>"},
)

# Go to this link to configure your webhook receiving application
print(response.text)

Send Parse Request with Webhook

response = requests.post(
    "https://platform.reducto.ai/parse_async",
    json={
        "input": "https://pdfobject.com/pdf/sample.pdf",
        "webhook": {"mode": "svix"},
    },
    headers={"Authorization": "Bearer <api key>"},
)

Here’s an example of a simple Flask server that can handle webhook notifications:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/callback', methods=['POST'])
def webhook_callback():
    data = request.json
    job_id = data['job_id']
    status = data['status']
    
    if status == 'Completed':
        # Process the completed job
        process_completed_job(job_id)
    elif status == 'Failed':
        # Handle the failed job
        handle_failed_job(job_id)
    
    return jsonify({"status": "received"}), 200

def process_completed_job(job_id):
    # Retrieve the job result and process it
    # You may need to call the /job/{job_id} endpoint to get the full result
    pass

def handle_failed_job(job_id):
    # Handle the failed job, e.g., log the error, notify the user, etc.
    pass

if __name__ == '__main__':
    app.run(port=5000)

How‑to: Handle URL responses

For very large results, the API may return a URL instead of the full result. This URL points to a JSON array of chunks. Here’s how to handle URL responses:

import requests

def process_url_result(result):
    if result["type"] == "url":
        # Fetch the content from the URL
        response = requests.get(result["url"])
        response.raise_for_status()
        chunks = response.json()
        
        # Process the chunks
        for chunk in chunks:
            process_chunk(chunk)
    else:
        # Handle full result
        for chunk in result["chunks"]:
            process_chunk(chunk)

def process_chunk(chunk):
    # Process individual chunk
    content = chunk["content"]
    embed = chunk["embed"]
    enriched = chunk["enriched"]
    blocks = chunk["blocks"]
    
    # Your processing logic here
    pass

# Usage
job_result = check_job_status(job_id)["result"]
process_url_result(job_result)

When dealing with URL responses, remember:

The URL in the JSON response points to a JSON array of chunks.
There’s no additional metadata in the URL response, just the array of chunks.
Each chunk contains the same structure as in the full result.

By following this guide, you should be able to effectively parse large files using the Reducto API, leveraging the asynchronous API, webhooks, and handling URL responses when necessary.

Get Started

Examples

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

Overview

Workflow at a glance

How‑to: Parse large files asynchronously

How‑to: Receive results via webhooks

How‑to: Handle URL responses

Get Started

Examples

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

​Overview

​Workflow at a glance

​How‑to: Parse large files asynchronously

​How‑to: Receive results via webhooks

​How‑to: Handle URL responses

Overview

Workflow at a glance

How‑to: Parse large files asynchronously

How‑to: Receive results via webhooks

How‑to: Handle URL responses