Overview

Split is an endpoint designed for long documents with multiple sections that require different treatments or extraction schemas. It uses a natural language prompt to intelligently divide a document into meaningful page ranges. This step is typically used before extraction or after parsing to make the process more targeted and effective.

Example Use Cases

  1. Financial 10-K Report with sections like “Business Overview,” “Financial Data,” and “Corporate Governance”.
  2. Multi-page medical record book with intake packets of several patients.
  3. Employee onboarding packet with sections like policies, benefits, tax forms, and more.

Key Features:

Under the hood, split runs a parsing call followed by LLM calls to determine which pages match your inputted description.

Key Inputs:

  1. split_description

    • name: Name of the category of pages you’re trying to extract
    • description: Details describing the characteristics of the pages you want
    • partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
      • Examples: account number, index, name, holding type. If your partition key is account number and the split description name is account holdings, then the output mappings will be account holdings 111, account holdings 112, etc.
  2. split_rules

    • Optional field for changing the default rules
      • Example: Each page can only be classified into a single category

Output:

  1. splits
    • List of objects with name, pages, conf, and partitions fields
      • name: Comes from the names that are filled into for split_description
      • pages: Pages for this category
      • conf: Confidence for the categorization, either “low” or “high”
      • partitions: Either a list of objects with name, conf, and pages or null if partition_key is not defined
        • name: The value filled in from the content
        • pages: List of pages for this partition
        • conf: Confidence for the categorization of this partition, either “low” or “high”
{  
  "result": {  
    "splits": [
      {
        "name": "account holdings", // Category name from split_descriptions
        "pages": [1, 2, 3, 7, 8, 9, 10, 11], // All the pages that this category is on (1-indexed)
        "conf": "high", // "low" or "high" confidence
        "partitions": [
          {
            "name": "1111", // partition value that was filled in
            "conf": "high", // "low" or "high" confidence
            "pages": [1, 2, 3] // pages of this specific partition
          },
          {
            "name": "2222",
            "conf": "high",
            "pages": [7, 8, 9, 10, 11]
          }
        ]
      }
    ]
  }  
}

The default split_rules only allows overlap for sections at the first and last pages.

Pipelining with Split

Split is powerful when used in conjunction with other endpoints, specifically extraction. When you have a long document with each subsection needing a different schema, splitting helps you break up the document into logically grouped subsets. This process is known as pipelining.

Sample Scenarios

Scenario 1:

You have a long document and the information that you plan to extract is only on some subset of the pages.

  1. Run /parse on the file and get the parse job id.

  2. Run /split on the parse output (feed in the parse job id).

Example config:

{
    "document_url": "jobid://{parse_job_id}",
    "split_description": [
        {
            "name": "Clinical Scenarios",
            "description": "Pages of the Clinical Scenarios"
        },
        {
            "name": "Recommendation",
            "description": "Page that includes the recommendation at the top of the page"
        }
    ]
}

Example response:

{
  "usage": {
    "num_pages": 20
  },
  "result": {
     "splits": [
       {
         "name": "Clinical Scenarios",
         "pages": [
           1,
           2,
           3
         ],
         "conf": "high",
         "partitions": null
       },
       {
         "name": "Recommendation",
         "pages": [
           1
         ],
         "conf": "high",
         "partitions": null
       }
     ]
   }
}
  1. Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:

    {
        "document_url": "jobid://{parse_job_id}",
        "advanced_options": 
        { "page_range": [
                {
                    "start": 1,
                    "end": 2
                }
            ] },
        "schema": extract_schema
    }
    

Scenario 2:

You have a long document with information that can be clearly broken down into subsections.

An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:

{  
  "type": "object",  
  "properties": {  
    "acct_holdings": {  
      "type": "array",  
      "items": {  
        "type": "object",  
        "properties": {  
          "price": {  
            "type": "number",  
            "description": "The price of the holding per unit."  
          },  
          "quantity": {  
            "type": "number",  
            "description": "The quantity of the holding"  
          },  
          "ticker": {  
            "type": "string",  
            "description": "The ticker of the holding."  
          },  
          "category": {  
            "type": "string",  
            "description": "The category of the holding, such as equity, fixed income, or cash."  
          },  
          "acct_number": {  
            "type": "string",  
            "description": "The unique number for the account."  
          }  
        }  
      },  
      "description": "A list of holdings across all the accounts."  
    }  
  }  
}

The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.

Example config:

  "split_description": [  
    {  
      "name": "account holdings",  
      "description": "Pages for each separate account holdings",  
      "partition_key": "account number" 
    }  
  ],  
  "split_rules": "Split the document into the applicable sections"  

Example response:

{  
  "usage": {  
    "num_pages": 58  
  },  
  "result": {  
    "splits": [
      {
        "name": "account holdings",
        "pages": [1, 2, 3, 7, 8, 9, 10, 11, 12, 19, 20, 21, 22, 23, 24],
        "conf": "high",
        "partitions": [
          {
            "name": "1111",
            "conf": "high",
            "pages": [1, 2, 3]
          },
          {
            "name": "2222",
            "conf": "high",
            "pages": [7, 8, 9, 10, 11, 12]
          },
          {
            "name": "3333",
            "conf": "high",
            "pages": [19, 20, 21, 22, 23, 24]
          }
        ]
      }
    ]
  }  
}

You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.