Overview

Split is an endpoint designed for long documents with multiple sections that require different treatments or extraction schemas. It uses a natural language prompt to intelligently divide a document into meaningful page ranges. This step is typically used before extraction or after parsing to make the process more targeted and effective.

Example Use Cases

  1. Financial 10-K Report with sections like “Business Overview,” “Financial Data,” and “Corporate Governance”.
  2. Multi-page medical record book with intake packets of several patients.
  3. Employee onboarding packet with sections like policies, benefits, tax forms, and more.

Key Features:

Under the hood, split runs a parsing call followed by LLM calls to determine which pages match your inputted description.

Key Inputs:

  1. split_description

    • name: Name of the category of pages you’re trying to extract
    • description: Details describing the characteristics of the pages you want
    • partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
      • Examples: account number, index, name, holding type. If your partition key is account number and the split description name is account holdings, then the output mappings will be account holdings 111, account holdings 112, etc.
  2. split_rules

    • Optional field for changing the default rules
      • Example: Each page can only be classified into a single category

Output:

  1. section_mapping
  • Each section will contain an array of pages that belong in that section.

The default split_rules only allows overlap for sections at the first and last pages.

Pipelining with Split

Split is powerful when used in conjunction with other endpoints, specifically extraction. When you have a long document with each subsection needing a different schema, splitting helps you break up the document into logically grouped subsets. This process is known as pipelining.

Sample Scenarios

Scenario 1:

You have a long document and the information that you plan to extract is only on some subset of the pages.

  1. Run /parse on the file and get the parse job id.

  2. Run /split on the parse output (feed in the parse job id).

Example config:

{
    "document_url": "jobid://{parse_job_id}",
    "split_description": [
        {
            "name": "Clinical Scenarios",
            "description": "Pages of the Clinical Scenarios"
        },
        {
            "name": "Recommendation",
            "description": "Page that includes the recommendation at the top of the page"
        }
    ]
}

Example response:

{
  "usage": {
    "num_pages": 20
  },
  "result": {
    "section_mapping": {
      "Clinical Scenarios": [
        1,
        2,
        3
      ],
      "Recommendation": [
        1
      ]
    }
  }
}
  1. Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:

    {
        "document_url": "jobid://{parse_job_id}",
        "advanced_options": 
        { "page_range": [
                {
                    "start": 1,
                    "end": 2
                }
            ] },
        "schema": extract_schema
    }
    

Scenario 2:

You have a long document with information that can be clearly broken down into subsections.

An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:

{  
  "type": "object",  
  "properties": {  
    "acct_holdings": {  
      "type": "array",  
      "items": {  
        "type": "object",  
        "properties": {  
          "price": {  
            "type": "number",  
            "description": "The price of the holding per unit."  
          },  
          "quantity": {  
            "type": "number",  
            "description": "The quantity of the holding"  
          },  
          "ticker": {  
            "type": "string",  
            "description": "The ticker of the holding."  
          },  
          "category": {  
            "type": "string",  
            "description": "The category of the holding, such as equity, fixed income, or cash."  
          },  
          "acct_number": {  
            "type": "string",  
            "description": "The unique number for the account."  
          }  
        }  
      },  
      "description": "A list of holdings across all the accounts."  
    }  
  }  
}

The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.

Example config:

  "split_description": [  
    {  
      "name": "account holdings",  
      "description": "Pages for each separate account holdings",  
      "partition_key": "account number" 
    }  
  ],  
  "split_rules": "Split the document into the applicable sections"  
}

Example response:

{  
  "usage": {  
    "num_pages": 58  
  },  
  "result": {  
    "section_mapping": {  
      "account holdings 1111": [  
        1,  
        2,  
        3  
      ],  
      "account holdings 2222": [  
        7,  
        8,  
        9,  
        10,  
        11,  
        12  
      ],
      "account holdings 3333": [  
        19,  
        20,  
        21,  
        22,  
        23,  
        24  
      ]
    }  
  }  
}

You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.