Splitting Documents
Background
Key Features:
- Uses /parse with LLM calls to determine what pages match your description
- Can be chained with /extract to determine which pages match the type that you would want to extract from
Key Inputs:
- split_description
- name: Name of the category of pages you’re trying to extract
- description: Details describing the characteristics of the pages you want
- partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
- Examples: account number, index, name, holding type
- split_rules
- Optional field for changing the default rules
- Example: Each page can only be classified into a single category
- Optional field for changing the default rules
Example Use Cases
- Scenario 1: You have a long document and the information that you plan to extract is only on some subset of the pages.
- Scenario 2: You have a long document with information that can be clearly broken down into subsections.
Scenario 1:
You have a long document and the information that you plan to extract is only on some subset of the pages.
-
Run /parse on the file and get the parse job id
-
Run /split on the parse output.
Example config:{ "document_url": "jobid://{parse_job_id}", "split_description": [ { "name": "Clinical Scenarios", "description": "Pages of the Clinical Scenarios" }, { "name": "Recommendation", "description": "Page that includes the recommendation at the top of the page" } ] }
Example response:
{ "usage": { "num_pages": 20 }, "result": { "section_mapping": { "Clinical Scenarios": [ 1, 2, 3 ], "Recommendation": [ 1 ] } } }
-
Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:
{ "document_url": "jobid://{parse_job_id}", "advanced_options": { "page_range": [ { "start": 1, "end": 2 } ] }, "schema": extract_schema }
Scenario 2:
You have a long document with information that can be clearly broken down into subsections.
An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:
{
"type": "object",
"properties": {
"acct_holdings": {
"type": "array",
"items": {
"type": "object",
"properties": {
"price": {
"type": "number",
"description": "The price of the holding per unit."
},
"quantity": {
"type": "number",
"description": "The quantity of the holding"
},
"ticker": {
"type": "string",
"description": "The ticker of the holding."
},
"category": {
"type": "string",
"description": "The category of the holding, such as equity, fixed income, or cash."
},
"acct_number": {
"type": "string",
"description": "The unique number for the account."
}
}
},
"description": "A list of holdings across all the accounts."
}
}
}
The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.
Example config:
"split_description": [
{
"name": "account holdings",
"description": "Pages for each separate account holdings",
"partition_key": "account number"
}
],
"split_rules": "Split the document into the applicable sections"
}
Example response:
{
"usage": {
"num_pages": 58
},
"result": {
"section_mapping": {
"account holdings 1111": [
1,
2,
3
],
"account holdings 2222": [
7,
8,
9,
10,
11,
12
],
"account holdings 3333": [
19,
20,
21,
22,
23,
24
]
}
}
}
You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.
Updated 4 days ago