Skip to main content
Split identifies sections in a document based on natural language descriptions. This page covers the configuration options that control how sections are identified and returned. For basic usage and the full workflow, see the Split endpoint documentation.

split_description

The split_description array defines what sections to look for. Each entry has three fields:
split_description=[
    {
        "name": "Account Summary",
        "description": "Overview section with balances and totals at the top of the statement",
        "partition_key": "account_number"  # Optional
    }
]
name: The identifier returned in results. Use names that make sense for your downstream processing logic. description: Natural language description of the section’s content. The LLM uses this to classify pages. Be specific about what makes this section recognizable: content type, position in document, visual characteristics. partition_key: For sections that repeat with different identifiers (multiple accounts, multiple patients, multiple companies). When set, Split extracts the identifier value from the document and groups pages by that value.

Writing Effective Descriptions

The description is passed to an LLM that classifies each page. Vague descriptions lead to ambiguous classifications.
# Vague - could match many things
{"name": "Tables", "description": "Pages with tables"}

# Specific - clear criteria for classification
{"name": "Transaction History", "description": "Table showing individual transactions with dates, descriptions, and amounts. Usually appears after the account summary section."}
Include distinguishing characteristics:
  • Content type (tables, narrative text, forms)
  • Position (beginning, end, after section X)
  • Visual elements (headers, logos, signature lines)
  • What it does NOT include (to avoid confusion with similar sections)

partition_key

Partition key handles a common scenario: the same section type repeating for different entities. A consolidated statement has holdings for multiple accounts. A medical record packet has intake forms for multiple patients. Without partition key, Split returns all matching pages as one group. You’d then need to figure out where one entity ends and the next begins. Partition key does this automatically.
split_description=[
    {
        "name": "Holdings",
        "description": "Investment holdings table for a specific account",
        "partition_key": "account number"
    }
]
The response includes partitions with extracted identifier values:
{
  "result": {
    "splits": [
      {
        "name": "Holdings",
        "pages": [1, 2, 3, 7, 8, 9, 10, 11],
        "conf": "high",
        "partitions": [
          {"name": "1234-5678", "pages": [1, 2, 3], "conf": "high"},
          {"name": "8765-4321", "pages": [7, 8, 9, 10, 11], "conf": "high"}
        ]
      }
    ]
  }
}
The name in each partition is the actual value extracted from the document. If the document shows “Account #1234-5678” on pages 1-3 and “Account #8765-4321” on pages 7-11, those become your partition names. The partition key is semantic, not literal. If you set partition_key to “account number” but the document says “Acct #1234” or “Portfolio ID: 5678”, Split will still find it. Describe what the identifier represents, not the exact text format.

When partition_key values appear in tables

By default, Split truncates table content to speed up processing. If your partition key values appear deep within tables (not in headers or the first few rows), the truncation might hide them. Set table_cutoff to preserve to keep full table content:
result = client.split.run(
    input=upload.file_id,
    split_description=[
        {
            "name": "Holdings",
            "description": "Investment holdings table",
            "partition_key": "account_number"
        }
    ],
    settings={"table_cutoff": "preserve"}
)
This increases processing time but ensures partition keys aren’t missed.

split_rules

Controls how pages are assigned to sections. The default rule:
"Split the document into the applicable sections. Sections may only overlap at their first and last page if at all."
This means a page can belong to multiple sections only at boundaries. Page 5 can belong to both “Section A” and “Section B” only if it’s the last page of A and the first page of B. Customize for your use case:
# Allow full overlap (page can belong to multiple sections anywhere)
split_rules="Pages can belong to multiple sections. A page with both summary data and transaction data should appear in both sections."

# Force exclusive classification (each page belongs to exactly one section)
split_rules="Each page must belong to exactly one section. Assign to the most relevant section."

# Document-specific logic
split_rules="The cover page (page 1) should not be assigned to any section. Start section detection from page 2."
The string is passed directly to the LLM as instructions. Write it as you would write instructions for a person doing the classification.

settings

table_cutoff

Controls how table content is processed during section detection.
settings={"table_cutoff": "truncate"}  # Default
settings={"table_cutoff": "preserve"}
truncate (default): Tables are shortened to the first few rows. Faster processing. Works for most cases where section identifiers appear in headers, titles, or surrounding text. preserve: Full table content is retained. Required when partition_key values or section identifiers appear deep within table rows. Slower but more thorough.

parsing

Split runs Parse internally before classifying sections. The parsing parameter accepts all Parse configuration options.
result = client.split.run(
    input=upload.file_id,
    split_description=[...],
    parsing={
        "settings": {
            "page_range": {"start": 1, "end": 50},
            "ocr_system": "standard"
        },
        "enhance": {
            "agentic": [{"scope": "table"}]
        }
    }
)
If you pass jobid:// as input (reusing a previous Parse result), the parsing options are ignored since the document was already parsed.

Response Structure

{
  "result": {
    "splits": [
      {
        "name": "Section Name",
        "pages": [1, 2, 3],
        "conf": "high",
        "partitions": null
      }
    ],
    "section_mapping": {
      "Section Name": [1, 2, 3]
    }
  },
  "usage": {
    "num_pages": 10,
    "credits": 20.0
  }
}
splits: Array of found sections, one per entry in split_description. splits[].name: The name you provided. splits[].pages: Page numbers where this section appears (1-indexed). splits[].conf: "high" or "low" indicating classification confidence. splits[].partitions: When using partition_key, sub-sections grouped by extracted identifier values. Each partition has its own name (the extracted value), pages, and conf. section_mapping: Legacy format mapping section names to page arrays. Use splits for new code. A section not found in the document still appears in results with an empty pages array. Always check that pages has content before processing.