Additional Document Data

When you parse a document, Reducto extracts the main text content by default. But documents often contain additional information layered on top: revision marks from Track Changes, margin comments, highlighted passages, hyperlinks, and signatures. The formatting.include option lets you extract these.

result = client.parse.run(
    input=upload.file_id,
    formatting={
        "include": ["change_tracking", "comments", "highlight", "hyperlinks", "signatures"]
    }
)

By default, none of these are extracted. Enable only what you need, since each adds processing overhead.

These formatting options are available in the Python SDK, Node.js SDK, and via cURL. The Go SDK has limited support—only enable_underlines (for change tracking) is currently available.

Change Tracking

Legal documents, contracts, and collaborative drafts often use underlines and strikethroughs to show what changed between versions. Reducto can detect these and wrap them in HTML tags so you can programmatically identify revisions.

formatting={"include": ["change_tracking"]}

When enabled, underlined and struck-through text appears with markup:

The agreement shall commence on <change><s>January 1, 2024</s> <u>February 15, 2024</u></change>.

The <s> tag marks strikethrough (typically deletions), <u> marks underlines (typically insertions), and <change> wraps the entire revision region. How it works: For digital PDFs and Word documents, Reducto reads the embedded formatting information. For scanned documents, it uses a segmentation model to visually detect underlines and strikethroughs on the page image. Common uses:

Contract review: automatically extract what changed between versions
Compliance: track modifications to policies and procedures
Editorial workflows: preserve editor suggestions in parsed output

Comments

PDF sticky notes, Word margin comments, and Excel cell notes contain reviewer feedback, questions, and instructions that are separate from the document content itself. Reducto extracts these as distinct blocks.

formatting={"include": ["comments"]}

Each comment becomes its own block with the comment text and its position on the page:

{
  "type": "Comment",
  "content": "Verify this figure with the finance team before publishing",
  "bbox": {"left": 0.85, "top": 0.15, "width": 0.1, "height": 0.05, "page": 1}
}

The bounding box tells you where the comment annotation appeared (normalized to [0, 1] relative to page size). This lets you correlate comments with nearby content if needed.

Highlights

Highlighted text usually signals importance. Reducto can detect highlighted passages and wrap them in <mark> tags, letting you identify what reviewers or authors emphasized.

formatting={"include": ["highlight"]}

Output:

The key finding was that <mark>revenue increased 47% year-over-year</mark> despite market headwinds.

How it works: For digital documents, Reducto reads highlight annotations. For scanned documents, it uses a segmentation model to detect colored highlighting (typically yellow, but other colors work too). Common uses:

Extract key passages from research documents
Identify what reviewers marked as significant during review
Use highlights as importance signals for summarization

Hyperlinks

Documents contain links to external resources, internal references, and citations. Reducto extracts these and converts them to markdown link format, preserving both the display text and the URL.

formatting={"include": ["hyperlinks"]}

Output:

For more details, see [our methodology paper](https://example.com/methodology.pdf).

Common uses:

Build reference lists from academic papers
Audit documents for broken or outdated links
Extract cited sources for verification

Signatures

Forms and contracts often contain signature fields. Reducto can detect where signatures appear, which is useful for determining whether a document has been signed or for locating signature regions for downstream processing.

formatting={"include": ["signatures"]}

Detected signatures appear as blocks marking their location:

{
  "type": "Signature",
  "content": "<signature>",
  "bbox": {"left": 0.1, "top": 0.8, "width": 0.3, "height": 0.1, "page": 2}
}

The actual signature image is not extracted (for privacy). The block identifies where a signature was detected. Common uses:

Verify that forms have been signed before processing
Route unsigned documents back for completion
Classify documents as signed vs. unsigned

Format Compatibility

Not all features work with all document types: Change tracking and highlights work best with Word documents, which store this information natively. For PDFs, Reducto uses visual detection models, which work well but may miss subtle formatting. Scanned documents rely entirely on visual detection. Comments work with PDF annotations (sticky notes), Word margin comments, and Excel cell notes. Scanned documents don’t have extractable comments. Hyperlinks work with PDFs and Word documents that contain embedded links. Scanned documents don’t preserve hyperlink information. Signatures are detected visually, so they work across all document types including scans.

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

Additional Document Data

Change Tracking

Comments

Highlights

Hyperlinks

Signatures

Format Compatibility

Get Started

Core Functions

Workflows and Pipelines

Configurations

Reference

Components

Enterprise Resources

Security and privacy

On-premise Resources

​Change Tracking

​Comments

​Highlights

​Hyperlinks

​Signatures

​Format Compatibility

Change Tracking

Comments

Highlights

Hyperlinks

Signatures

Format Compatibility