GuidesRecipesAPI Reference
Log In
Guides

Splitting Documents

Background

Key Features:

  • Uses /parse with LLM calls to determine what pages match your description
  • Can be chained with /extract to determine which pages match the type that you would want to extract from

Key Inputs:

  1. split_description
    • name: Name of the category of pages you’re trying to extract
    • description: Details describing the characteristics of the pages you want
    • partition_key (optional): Identifier for splitting sections by. In the output the partition_key will be filled in with values from the content and appended after the split_description name
      • Examples: account number, index, name, holding type
  2. split_rules
    • Optional field for changing the default rules
      • Example: Each page can only be classified into a single category

Example Use Cases

  1. Scenario 1: You have a long document and the information that you plan to extract is only on some subset of the pages.
  2. Scenario 2: You have a long document with information that can be clearly broken down into subsections.

Scenario 1:

You have a long document and the information that you plan to extract is only on some subset of the pages.

  1. Run /parse on the file and get the parse job id

  2. Run /split on the parse output.
    Example config:

    {
        "document_url": "jobid://{parse_job_id}",
        "split_description": [
            {
                "name": "Clinical Scenarios",
                "description": "Pages of the Clinical Scenarios"
            },
            {
                "name": "Recommendation",
                "description": "Page that includes the recommendation at the top of the page"
            }
        ]
    }
    

    Example response:

    {
      "usage": {
        "num_pages": 20
      },
      "result": {
        "section_mapping": {
          "Clinical Scenarios": [
            1,
            2,
            3
          ],
          "Recommendation": [
            1
          ]
        }
      }
    }
    
    
  3. Run /extract on the parse output using the same document_url you sent into split. Set the page range as the pages returned from split:

    {
        "document_url": "jobid://{parse_job_id}",
        "advanced_options": 
        { "page_range": [
                {
                    "start": 1,
                    "end": 2
                }
            ] },
        "schema": extract_schema
    }
    

Scenario 2:

You have a long document with information that can be clearly broken down into subsections.

An example of this would be a financial statement with multiple accounts. Each account will have a unique account number. Consider the following schema:

{  
  "type": "object",  
  "properties": {  
    "acct_holdings": {  
      "type": "array",  
      "items": {  
        "type": "object",  
        "properties": {  
          "price": {  
            "type": "number",  
            "description": "The price of the holding per unit."  
          },  
          "quantity": {  
            "type": "number",  
            "description": "The quantity of the holding"  
          },  
          "ticker": {  
            "type": "string",  
            "description": "The ticker of the holding."  
          },  
          "category": {  
            "type": "string",  
            "description": "The category of the holding, such as equity, fixed income, or cash."  
          },  
          "acct_number": {  
            "type": "string",  
            "description": "The unique number for the account."  
          }  
        }  
      },  
      "description": "A list of holdings across all the accounts."  
    }  
  }  
}

The schema is extracting the holdings. So we can make a split request to get the holdings for each account. From the schema description, it’s clear that the acct_number is a unique field. By setting the account number as the partition key, we will get the account holdings returned in multiple sections where each section has a different account number. If the partition key is not set in this case, all of the account holdings would be returned in a single array.

Example config:

  "split_description": [  
    {  
      "name": "account holdings",  
      "description": "Pages for each separate account holdings",  
      "partition_key": "account number" 
    }  
  ],  
  "split_rules": "Split the document into the applicable sections"  
}

Example response:

{  
  "usage": {  
    "num_pages": 58  
  },  
  "result": {  
    "section_mapping": {  
      "account holdings 1111": [  
        1,  
        2,  
        3  
      ],  
      "account holdings 2222": [  
        7,  
        8,  
        9,  
        10,  
        11,  
        12  
      ],
      "account holdings 3333": [  
        19,  
        20,  
        21,  
        22,  
        23,  
        24  
      ]
    }  
  }  
}

You can then run extract on each subset of the pages. The “acct_holdings” can be concatenated across the extract calls since its an array.