Extraction Pipelining

The /extract and /extract_async can be provided with a reference to a completed parse job to allow for repeated extractions from the same document or if you are interested in parsing and extracting separately from a document.This can enable workflows that were previously too high latency or expensive to manage.

Let's take one example.

Problem

Lets say we have users uploading documents of two types: W2 forms and Passports. We'd like to extract different sets of information from each set of documents:

  1. For the passport: Passport number, name, and DOB.
  2. For the W2: Total Wages, Calendar Year, Employer Name

Without pipelining, we would have to either provide a schema that has fields for both types of documents, or we would have to make three separate extraction calls. Without pipelining, this would mean a lot of extra latency and being charged 6x for the same jobs.


Pipeline Approach

The code snippet above walks through the approach. First, we parse the document. Then, we feed the returned job id each time we make an extraction call, in this way the initial parsing work does not need to be repeated on each call, and we save on 50% on each extract call.