Split is not chunking. Split returns page numbers telling you where sections live. Chunking (configured via Parse) breaks content into smaller pieces for embeddings or retrieval. They solve different problems: Split identifies locations, chunking divides content.
Quick Start
Sample Response
Two Ways to Split
Split handles two fundamentally different scenarios.Scenario 1: Different Sections Need Different Treatment
Your document contains distinct sections that each need their own extraction schema or processing logic. A financial report has an executive summary (extract key metrics), financial tables (extract line items), and risk disclosures (extract risk categories). These are different types of content requiring different approaches. For this, you define multiple entries insplit_description, each describing a different section:
Scenario 2: Repeating Sections with Unknown Count
Your document contains the same type of section repeated multiple times, but you don’t know in advance how many. A consolidated financial statement might have holdings for 3 accounts or 30. A medical records packet might contain intake forms for 5 patients or 50. This is wherepartition_key becomes essential.
Without a partition key, Split returns all pages containing “account holdings” as a single group. You’d then need to figure out where one account ends and the next begins. The partition key tells Split to look for a specific identifier within each section and group the pages by that identifier.
partitions array that breaks down the section by the values Split found in the document:
name in each partition is the actual value Split extracted from the document. If the document shows “Account #1234-5678” on pages 1-3 and “Account #8765-4321” on pages 7-11, those become your partition names.
Connecting Split to Parse and Extract
Split is rarely used in isolation. The typical workflow is Parse → Split → Extract, where each step builds on the previous one. You can reuse a Parse result across multiple Split and Extract calls by passing the job ID. Since Parse is often the slowest step, this saves significant time and credits.jobid:// as input, the parsing step is skipped entirely. Any parsing options you include won’t re-parse the document; they only affect how the already-parsed content is filtered (like limiting which pages to consider for extraction).
Request Parameters
input (required)
The document to process. Accepts:| Format | Example | Description |
|---|---|---|
| Upload response | upload.file_id or "reducto://abc123" | File uploaded via /upload |
| Public URL | "https://example.com/doc.pdf" | Publicly accessible document |
| Presigned URL | "https://bucket.s3.../doc.pdf?X-Amz-..." | Cloud storage with temporary credentials |
| Job ID | "jobid://7600c8c5-..." | Reuse a previous Parse result |
split_description (required)
An array defining the sections to find. Each entry has:| Field | Required | Description |
|---|---|---|
name | Yes | Identifier for this section in the response |
description | Yes | Natural language description of what the section contains |
partition_key | No | Identifier to look for when a section repeats (e.g., “account number”, “patient ID”) |
split_rules
A prompt that controls how Split handles page classification. The default is:split_rules string is passed directly to the LLM as instructions, so write it as you would write instructions for a person doing the classification.
parsing
Configuration for how the document is parsed. These options are inherited from Parse and are ignored if yourinput is a jobid:// reference (since the document was already parsed).
settings
| Field | Values | Default | Description |
|---|---|---|---|
table_cutoff | "truncate" or "preserve" | "truncate" | How to handle table content when classifying pages |
partition_key values appear deep within tables (row 50 of a 200-row table), you need the full content:
Sample Response
| Field | Type | Description |
|---|---|---|
result.splits | array | Array of found sections, one per entry in your split_description |
result.splits[].name | string | The name you provided |
result.splits[].pages | array | Page numbers where this section appears (1-indexed) |
result.splits[].conf | string | Either "high" or "low" indicating match confidence |
result.splits[].partitions | array | null | When using partition_key, sub-sections with their own names, pages, and confidence |
result.section_mapping | object | Legacy format mapping section names to page arrays. Use splits for new code. |
usage.num_pages | number | Total pages in the document |
usage.credits | number | Credits consumed (2 per page, plus Parse credits if not using jobid://) |
pages has content before processing:
Async Processing
For large documents or batch processing, use the async pattern to avoid timeouts:.run_job() method accepts the same parameters as .run().
Troubleshooting
I don't know what sections my document has
I don't know what sections my document has
If you’re unsure of the document structure, start with broad, generic descriptions:Alternatively, Parse the document first and inspect the content to understand its structure, then create targeted split descriptions.
Section not found (empty pages array)
Section not found (empty pages array)
When a section returns with no pages:
- Check your description. Is it specific enough? “Transaction table” might not match if the document calls it “Activity History”. Include terms that appear in the actual document.
- Verify the section exists. Run Parse first and inspect the content. If the section isn’t visible to Parse, Split won’t find it either.
- Broaden your description. Start general (“any table with dates and amounts”) and narrow down once you confirm Split can find it.
Partitions not being detected
Partitions not being detected
When
partitions is null despite setting partition_key:-
Check table_cutoff. If the partition key appears inside tables, set
settings.table_cutoffto"preserve". The default truncation might be hiding the values. - Verify the key exists. The partition key value must actually appear in the document. If you’re looking for “account number” but the document uses “portfolio ID”, adjust your partition key.
- Check for consistent structure. Partition detection works best when repeating sections have similar layouts. Inconsistent formatting can confuse the classifier.
Wrong pages returned
Wrong pages returned
When Split returns pages that don’t contain the expected content:
- Make descriptions more specific. If multiple sections have similar content, add distinguishing details: “the transaction table in the Account Activity section” rather than just “transaction table”.
- Check confidence scores. Low confidence suggests the match was ambiguous. The LLM made its best guess but wasn’t certain.
-
Adjust split_rules. The default overlap rules might be affecting page assignment. If a page legitimately belongs to multiple sections, customize
split_rulesto allow it.
Request timeout on large documents
Request timeout on large documents
Split can timeout on documents over 100 pages:
-
Use async processing. Replace
.run()with.run_job()and poll for results. -
Parse first, then split. If you’re not already using
jobid://, parse the document separately and pass the job ID. This isolates the slow parsing step. -
Limit page range. If you know the sections you need are in a specific range, set
parsing.settings.page_rangeto process only those pages.