Split

Split identifies which pages contain which sections of a document. You describe sections in natural language, and Split returns the page numbers where each section lives. Under the hood, Split runs Parse to understand the document, then uses an LLM to classify pages against your descriptions.

Split is not chunking. Split returns page numbers telling you where sections live. Chunking (configured via Parse) breaks content into smaller pieces for embeddings or retrieval. They solve different problems: Split identifies locations, chunking divides content.

Quick Start

from pathlib import Path
from reducto import Reducto

client = Reducto()
upload = client.upload(file=Path("financial_report.pdf"))

result = client.split.run(
    input=upload.file_id,
    split_description=[
        {
            "name": "Executive Summary",
            "description": "High-level overview and key findings at the beginning of the report"
        },
        {
            "name": "Financial Statements",
            "description": "Balance sheet, income statement, and cash flow tables"
        },
        {
            "name": "Risk Factors",
            "description": "Section discussing business risks and uncertainties"
        }
    ]
)

for split in result.result.splits:
    print(f"{split.name}: pages {split.pages}")

This request asks Split to find three sections in a financial report. Split returns the page numbers where each section appears, along with a confidence score for each match.

Sample Response

{
  "result": {
    "splits": [
      {"name": "Executive Summary", "pages": [1, 2], "conf": "high", "partitions": null},
      {"name": "Financial Statements", "pages": [15, 16, 17, 18], "conf": "high", "partitions": null},
      {"name": "Risk Factors", "pages": [8, 9, 10, 11, 12], "conf": "high", "partitions": null}
    ],
    "section_mapping": {
      "Executive Summary": [1, 2],
      "Financial Statements": [15, 16, 17, 18],
      "Risk Factors": [8, 9, 10, 11, 12]
    }
  },
  "usage": {"num_pages": 25, "credits": 50.0}
}

Pages are 1-indexed, meaning the first page is page 1, not 0.

Two Ways to Split

Split handles two fundamentally different scenarios.

Scenario 1: Different Sections Need Different Treatment

Your document contains distinct sections that each need their own extraction schema or processing logic. A financial report has an executive summary (extract key metrics), financial tables (extract line items), and risk disclosures (extract risk categories). These are different types of content requiring different approaches. For this, you define multiple entries in split_description, each describing a different section:

result = client.split.run(
    input=upload.file_id,
    split_description=[
        {
            "name": "Account Summary",
            "description": "Overview section with account balances and totals"
        },
        {
            "name": "Transaction History",
            "description": "Table listing individual transactions with dates and amounts"
        },
        {
            "name": "Disclosures",
            "description": "Legal disclosures and terms at the end of the statement"
        }
    ]
)

# Route each section to appropriate processing
for split in result.result.splits:
    if split.name == "Transaction History":
        transactions = client.extract.run(
            input=f"jobid://{parse_job_id}",
            instructions={"schema": transaction_schema},
            settings={"array_extract": True},
            parsing={"settings": {"page_range": {"start": split.pages[0], "end": split.pages[-1]}}}
        )

Scenario 2: Repeating Sections with Unknown Count

Your document contains the same type of section repeated multiple times, but you don’t know in advance how many. A consolidated financial statement might have holdings for 3 accounts or 30. A medical records packet might contain intake forms for 5 patients or 50. This is where partition_key becomes essential. Without a partition key, Split returns all pages containing “account holdings” as a single group. You’d then need to figure out where one account ends and the next begins. The partition key tells Split to look for a specific identifier within each section and group the pages by that identifier.

result = client.split.run(
    input=upload.file_id,
    split_description=[
        {
            "name": "Account Holdings",
            "description": "Investment holdings table for a specific account",
            "partition_key": "account_number"
        }
    ]
)

The response now includes a partitions array that breaks down the section by the values Split found in the document:

{
  "result": {
    "splits": [
      {
        "name": "Account Holdings",
        "pages": [1, 2, 3, 7, 8, 9, 10, 11],
        "conf": "high",
        "partitions": [
          {"name": "1234-5678", "pages": [1, 2, 3], "conf": "high"},
          {"name": "8765-4321", "pages": [7, 8, 9, 10, 11], "conf": "high"}
        ]
      }
    ],
    "section_mapping": {
      "Account Holdings 1234-5678": [1, 2, 3],
      "Account Holdings 8765-4321": [7, 8, 9, 10, 11]
    }
  }
}

The name in each partition is the actual value Split extracted from the document. If the document shows “Account #1234-5678” on pages 1-3 and “Account #8765-4321” on pages 7-11, those become your partition names.

The partition key describes what to look for semantically, not an exact string to match. If you set partition_key to “account number” but the document says “Acct #1234”, Split will still find it.

Connecting Split to Parse and Extract

Split is rarely used in isolation. The typical workflow is Parse → Split → Extract, where each step builds on the previous one. You can reuse a Parse result across multiple Split and Extract calls by passing the job ID. Since Parse is often the slowest step, this saves significant time and credits.

# Step 1: Parse the document once
parse_result = client.parse.run(input=upload.file_id)
job_id = parse_result.job_id

# Step 2: Split using the job ID (no re-parsing)
split_result = client.split.run(
    input=f"jobid://{job_id}",
    split_description=[
        {"name": "Summary", "description": "Account summary with balances"},
        {"name": "Transactions", "description": "Transaction history table"}
    ]
)

# Step 3: Extract from each section with the appropriate schema
summary_schema = {
    "type": "object",
    "properties": {
        "account_number": {"type": "string"},
        "current_balance": {"type": "number"},
        "available_balance": {"type": "number"}
    }
}

transaction_schema = {
    "type": "object",
    "properties": {
        "transactions": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "date": {"type": "string"},
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        }
    }
}

for split in split_result.result.splits:
    schema = summary_schema if split.name == "Summary" else transaction_schema
    
    extract_result = client.extract.run(
        input=f"jobid://{job_id}",
        instructions={"schema": schema},
        parsing={"settings": {"page_range": {"start": split.pages[0], "end": split.pages[-1]}}}
    )
    
    print(f"{split.name}: {extract_result.result}")

When you pass jobid:// as input, the parsing step is skipped entirely. Any parsing options you include won’t re-parse the document; they only affect how the already-parsed content is filtered (like limiting which pages to consider for extraction).

Request Parameters

input (required)

The document to process. Accepts:

Format	Example	Description
Upload response	`upload.file_id` or `"reducto://abc123"`	File uploaded via `/upload`
Public URL	`"https://example.com/doc.pdf"`	Publicly accessible document
Presigned URL	`"https://bucket.s3.../doc.pdf?X-Amz-..."`	Cloud storage with temporary credentials
Job ID	`"jobid://7600c8c5-..."`	Reuse a previous Parse result

split_description (required)

An array defining the sections to find. Each entry has:

Field	Required	Description
`name`	Yes	Identifier for this section in the response
`description`	Yes	Natural language description of what the section contains
`partition_key`	No	Identifier to look for when a section repeats (e.g., “account number”, “patient ID”)

Write descriptions that match how the content actually appears in the document. If the section has visual characteristics (“blue header”, “signature line at bottom”), mention them.

split_rules

A prompt that controls how Split handles page classification. The default is:

"Split the document into the applicable sections. Sections may only overlap at their first and last page if at all."

This default means a page can only belong to multiple sections if it’s at the boundary between them. Page 5 can belong to both “Section A” and “Section B” only if it’s the last page of A and the first page of B. You can customize this behavior for your use case:

# Allow full overlap when content genuinely spans multiple categories
result = client.split.run(
    input=upload.file_id,
    split_description=[...],
    split_rules="Pages can belong to multiple sections. A page with both summary information and transaction data should be included in both sections."
)

# Force exclusive classification
result = client.split.run(
    input=upload.file_id,
    split_description=[...],
    split_rules="Each page must belong to exactly one section. Choose the most relevant section for each page."
)

The split_rules string is passed directly to the LLM as instructions, so write it as you would write instructions for a person doing the classification.

parsing

Configuration for how the document is parsed. These options are inherited from Parse and are ignored if your input is a jobid:// reference (since the document was already parsed).

result = client.split.run(
    input=upload.file_id,
    split_description=[...],
    parsing={
        "settings": {
            "page_range": {"start": 1, "end": 50}  # Only analyze first 50 pages
        }
    }
)

settings

Field	Values	Default	Description
`table_cutoff`	`"truncate"` or `"preserve"`	`"truncate"`	How to handle table content when classifying pages

When analyzing tables, Split truncates them by default to improve speed. This works fine for most cases, but if your partition_key values appear deep within tables (row 50 of a 200-row table), you need the full content:

result = client.split.run(
    input=upload.file_id,
    split_description=[
        {
            "name": "Holdings",
            "description": "Investment holdings table",
            "partition_key": "account_number"
        }
    ],
    settings={"table_cutoff": "preserve"}
)

The tradeoff is latency. Preserving tables means more content for the LLM to process.

Sample Response

{
  "result": {
    "splits": [
      {
        "name": "Section Name",
        "pages": [1, 2, 3],
        "conf": "high",
        "partitions": null
      }
    ],
    "section_mapping": {
      "Section Name": [1, 2, 3]
    }
  },
  "usage": {
    "num_pages": 10,
    "credits": 20.0
  }
}

Field	Type	Description
`result.splits`	array	Array of found sections, one per entry in your `split_description`
`result.splits[].name`	string	The name you provided
`result.splits[].pages`	array	Page numbers where this section appears (1-indexed)
`result.splits[].conf`	string	Either `"high"` or `"low"` indicating match confidence
`result.splits[].partitions`	array \| null	When using `partition_key`, sub-sections with their own names, pages, and confidence
`result.section_mapping`	object	Legacy format mapping section names to page arrays. Use `splits` for new code.
`usage.num_pages`	number	Total pages in the document
`usage.credits`	number	Credits consumed (2 per page, plus Parse credits if not using `jobid://`)

A section that isn’t found still appears in the response with an empty pages array. Always check that pages has content before processing:

for split in result.result.splits:
    if not split.pages:
        print(f"Warning: {split.name} not found in document")
        continue
    # Process the section

Async Processing

For large documents or batch processing, use the async pattern to avoid timeouts:

import time

submission = client.split.run_job(
    input=upload.file_id,
    split_description=[...]
)

while True:
    job = client.job.get(submission.job_id)
    if job.status == "Completed":
        break
    if job.status == "Failed":
        raise Exception(f"Split failed: {job.reason}")
    time.sleep(2)

for split in job.result.splits:
    print(f"{split.name}: {split.pages}")

Documents over 100 pages should use async to avoid HTTP timeouts. The .run_job() method accepts the same parameters as .run().

Troubleshooting

I don't know what sections my document has

If you’re unsure of the document structure, start with broad, generic descriptions:

result = client.split.run(
    input=upload.file_id,
    split_description=[
        {"name": "Introduction", "description": "Opening sections, executive summary, or overview"},
        {"name": "Main Content", "description": "Core content, analysis, or detailed information"},
        {"name": "Tables/Data", "description": "Tables, figures, numerical data, or structured information"},
        {"name": "Appendix", "description": "Supporting materials, references, or supplementary content"}
    ]
)

Alternatively, Parse the document first and inspect the content to understand its structure, then create targeted split descriptions.

Section not found (empty pages array)

When a section returns with no pages:

Check your description. Is it specific enough? “Transaction table” might not match if the document calls it “Activity History”. Include terms that appear in the actual document.
Verify the section exists. Run Parse first and inspect the content. If the section isn’t visible to Parse, Split won’t find it either.
Broaden your description. Start general (“any table with dates and amounts”) and narrow down once you confirm Split can find it.

Partitions not being detected

When partitions is null despite setting partition_key:

Check table_cutoff. If the partition key appears inside tables, set settings.table_cutoff to "preserve". The default truncation might be hiding the values.
Verify the key exists. The partition key value must actually appear in the document. If you’re looking for “account number” but the document uses “portfolio ID”, adjust your partition key.
Check for consistent structure. Partition detection works best when repeating sections have similar layouts. Inconsistent formatting can confuse the classifier.

Wrong pages returned

When Split returns pages that don’t contain the expected content:

Make descriptions more specific. If multiple sections have similar content, add distinguishing details: “the transaction table in the Account Activity section” rather than just “transaction table”.
Check confidence scores. Low confidence suggests the match was ambiguous. The LLM made its best guess but wasn’t certain.
Adjust split_rules. The default overlap rules might be affecting page assignment. If a page legitimately belongs to multiple sections, customize split_rules to allow it.

Request timeout on large documents

Split can timeout on documents over 100 pages:

Use async processing. Replace .run() with .run_job() and poll for results.
Parse first, then split. If you’re not already using jobid://, parse the document separately and pass the job ID. This isolates the slow parsing step.
Limit page range. If you know the sections you need are in a specific range, set parsing.settings.page_range to process only those pages.

Next Steps

Parse

Understand what Split is analyzing under the hood.

Extract

Pull structured data from the sections Split identifies.

Page Ranges

Control which pages get processed at each step.

Async Processing

Handle large documents without timeouts.

Get Started

Migration

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

Quick Start

Sample Response

Two Ways to Split

Scenario 1: Different Sections Need Different Treatment

Scenario 2: Repeating Sections with Unknown Count

Connecting Split to Parse and Extract

Request Parameters

input (required)

split_description (required)

split_rules

parsing

settings

Sample Response

Async Processing

Troubleshooting

Next Steps

Parse

Extract

Page Ranges

Async Processing

Get Started

Migration

Core Functions

Configurations

FAQ

Security and privacy

On-premise deployment

​Quick Start

​Sample Response

​Two Ways to Split

​Scenario 1: Different Sections Need Different Treatment

​Scenario 2: Repeating Sections with Unknown Count

​Connecting Split to Parse and Extract

​Request Parameters

​input (required)

​split_description (required)

​split_rules

​parsing

​settings

​Sample Response

​Async Processing

​Troubleshooting

​Next Steps

Parse

Extract

Page Ranges

Async Processing

Quick Start

Sample Response

Two Ways to Split

Scenario 1: Different Sections Need Different Treatment

Scenario 2: Repeating Sections with Unknown Count

Connecting Split to Parse and Extract

Request Parameters

input (required)

split_description (required)

split_rules

parsing

settings

Sample Response

Async Processing

Troubleshooting

Next Steps