Split
Overview
Split is an endpoint designed for long documents with multiple sections that require different treatments or extraction schemas. It uses a natural language prompt to intelligently divide a document into meaningful page ranges. This step is typically used before extraction or after parsing to make the process more targeted and effective.
Example Use Cases
- Financial 10-K Report with sections like “Business Overview,” “Financial Data,” and “Corporate Governance”.
- Multi-page medical record book with intake packets of several patients.
- Employee onboarding packet with sections like policies, benefits, tax forms, and more.
Key Features:
Under the hood, split runs a parsing call followed by LLM calls to determine which pages match your inputted description.
Key Inputs:
-
split_description
- name: Name of the category of pages you’re trying to extract
- description: Details describing the characteristics of the pages you want
- partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
- Examples: account number, index, name, holding type. If your partition key is
account number
and the split description name isaccount holdings
, then the output mappings will beaccount holdings 111
,account holdings 112
, etc.
- Examples: account number, index, name, holding type. If your partition key is
-
split_rules
- Optional field for changing the default rules
- Example: Each page can only be classified into a single category
- Optional field for changing the default rules
Output:
- section_mapping
- Each section will contain an array of pages that belong in that section.
The default split_rules only allows overlap for sections at the first and last pages.
Pipelining with Split
Split is powerful when used in conjunction with other endpoints, specifically extraction. When you have a long document with each subsection needing a different schema, splitting helps you break up the document into logically grouped subsets. This process is known as pipelining.
Sample Scenarios
Scenario 1:
You have a long document and the information that you plan to extract is only on some subset of the pages.
-
Run /parse on the file and get the parse job id.
-
Run /split on the parse output (feed in the parse job id).
Example config:
Example response:
-
Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:
Scenario 2:
You have a long document with information that can be clearly broken down into subsections.
An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:
The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.
Example config:
Example response:
You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.