Splitting Documents
Background
Key Features:
- Uses /parse with LLM calls to determine what pages match your description
- Can be chained with /extract to determine which pages match the type that you would want to extract from
Key Inputs:
-
split_description
- name: Name of the category of pages you’re trying to extract
- description: Details describing the characteristics of the pages you want
- partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
- Examples: account number, index, name, holding type
-
split_rules
- Optional field for changing the default rules
- Example: Each page can only be classified into a single category
- Optional field for changing the default rules
Example Use Cases
- Scenario 1: You have a long document and the information that you plan to extract is only on some subset of the pages.
- Scenario 2: You have a long document with information that can be clearly broken down into subsections.
Scenario 1:
You have a long document and the information that you plan to extract is only on some subset of the pages.
-
Run /parse on the file and get the parse job id
-
Run /split on the parse output. Example config:
Example response:
-
Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:
Scenario 2:
You have a long document with information that can be clearly broken down into subsections.
An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:
The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.
Example config:
Example response:
You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.