Building Pipelines

Pipelines are workflows built by chaining multiple API calls together. The /extract and /split calls can take in a job_id output from a parse endpoint preventing parsing the same document multiple times — this can enable workflows that were previously too high latency or expensive to manage. Common pipelines look like this:

Parse —> Split —> Extract
Parse —> Extract

Below is an example using the /extract and /extract_async, where we’d like to extract two different sets of information from the same set of documents.

Problem

Lets say we have users uploading documents of two types: W2 forms and Passports. We’d like to extract different sets of information from each set of documents:

For the passport: Passport number, name, and DOB.
For the W2: Total Wages, Calendar Year, Employer Name

Without pipelining, we would have to either provide a schema that has fields for both types of documents, or we would have to make three separate extraction calls. Without pipelining, this would mean a lot of extra latency and being charged 6x for the same jobs. Pipeline Diagram

Pipeline Approach

import requests

document_url = (
    "https://support.adp.com/adp_payroll/content/hybrid/PDF/W2_Interactive.pdf"
)
headers = {"Authorization": f"Bearer {REDUCTO_API_KEY}"}

parse_response = requests.post(
    "https://platform.reducto.ai/parse", json={"document_url": document_url}, headers=headers
)

assert parse_response.status_code == 200

job_id = parse_response.json()["job_id"]

classification_response = requests.post(
    "https://platform.reducto.ai/extract",
    json={
        "document_url": f"jobid://{job_id}",
        "schema": {
            "type": "object",
            "properties": {
                "document_type": {"type": "string", "enum": ["W2", "Passport", "Other"]}
            },
            "required": ["document_type"],
        },
    },
    headers=headers,
)

assert classification_response.status_code == 200

classification = classification_response.json()["result"][0]["document_type"]

if classification == "W2":
    schema = {
        "type": "object",
        "properties": {
            "total_wages": {"type": "number"},
            "calendar_year": {"type": "integer"},
            "employer_name": {"type": "string"}
        },
        "required": ["total_wages", "calendar_year", "employer_name"]
    }
elif classification == "Passport":
    schema = {
        "type": "object",
        "properties": {
            "passport_number": {"type": "string"},
            "name": {"type": "string"},
            "date_of_birth": {"type": "string"},
        },
    }

extract_response = requests.post(
    "https://platform.reducto.ai/extract",
    json={
        "document_url": f"jobid://{job_id}",
        "schema": schema,
    },
    headers=headers,
)

assert extract_response.status_code == 200

The code snippet above walks through the approach. First, we parse the document. Then, we feed the returned job id each time we make an extraction call, in this way the initial parsing work does not need to be repeated on each call, and we save on 50% on each extract call.

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

Problem

Pipeline Approach

Get Started

Core Functions

Configurations

FAQ

Security and Privacy

On-Premise

​Problem

​Pipeline Approach

Problem

Pipeline Approach