Skip to main content
When you parse a document, Reducto extracts the main text content by default. But documents often contain additional information layered on top: revision marks from Track Changes, margin comments, highlighted passages, hyperlinks, and signatures. The formatting.include option lets you extract these.
result = client.parse.run(
    input=upload.file_id,
    formatting={
        "include": ["change_tracking", "comments", "highlight", "hyperlinks", "signatures"]
    }
)
By default, none of these are extracted. Enable only what you need, since each adds processing overhead.
These formatting options are available in the Python SDK, Node.js SDK, and via cURL. The Go SDK has limited supportโ€”only enable_underlines (for change tracking) is currently available.

Change Tracking

Legal documents, contracts, and collaborative drafts often use underlines and strikethroughs to show what changed between versions. Reducto can detect these and wrap them in HTML tags so you can programmatically identify revisions.
formatting={"include": ["change_tracking"]}
When enabled, underlined and struck-through text appears with markup:
The agreement shall commence on <change><s>January 1, 2024</s> <u>February 15, 2024</u></change>.
The <s> tag marks strikethrough (typically deletions), <u> marks underlines (typically insertions), and <change> wraps the entire revision region. How it works: For digital PDFs and Word documents, Reducto reads the embedded formatting information. For scanned documents, it uses a segmentation model to visually detect underlines and strikethroughs on the page image. Common uses:
  • Contract review: automatically extract what changed between versions
  • Compliance: track modifications to policies and procedures
  • Editorial workflows: preserve editor suggestions in parsed output

Comments

PDF sticky notes, Word margin comments, and Excel cell notes contain reviewer feedback, questions, and instructions that are separate from the document content itself. Reducto extracts these as distinct blocks.
formatting={"include": ["comments"]}
Each comment becomes its own block with the comment text and its position on the page:
{
  "type": "Comment",
  "content": "Verify this figure with the finance team before publishing",
  "bbox": {"left": 0.85, "top": 0.15, "width": 0.1, "height": 0.05, "page": 1}
}
The bounding box tells you where the comment annotation appeared (normalized to [0, 1] relative to page size). This lets you correlate comments with nearby content if needed.

Highlights

Highlighted text usually signals importance. Reducto can detect highlighted passages and wrap them in <mark> tags, letting you identify what reviewers or authors emphasized.
formatting={"include": ["highlight"]}
Output:
The key finding was that <mark>revenue increased 47% year-over-year</mark> despite market headwinds.
How it works: For digital documents, Reducto reads highlight annotations. For scanned documents, it uses a segmentation model to detect colored highlighting (typically yellow, but other colors work too). Common uses:
  • Extract key passages from research documents
  • Identify what reviewers marked as significant during review
  • Use highlights as importance signals for summarization
Documents contain links to external resources, internal references, and citations. Reducto extracts these and converts them to markdown link format, preserving both the display text and the URL.
formatting={"include": ["hyperlinks"]}
Output:
For more details, see [our methodology paper](https://example.com/methodology.pdf).
Common uses:
  • Build reference lists from academic papers
  • Audit documents for broken or outdated links
  • Extract cited sources for verification

Signatures

Forms and contracts often contain signature fields. Reducto can detect where signatures appear, which is useful for determining whether a document has been signed or for locating signature regions for downstream processing.
formatting={"include": ["signatures"]}
Detected signatures appear as blocks marking their location:
{
  "type": "Signature",
  "content": "<signature>",
  "bbox": {"left": 0.1, "top": 0.8, "width": 0.3, "height": 0.1, "page": 2}
}
The actual signature image is not extracted (for privacy). The block identifies where a signature was detected. Common uses:
  • Verify that forms have been signed before processing
  • Route unsigned documents back for completion
  • Classify documents as signed vs. unsigned

Format Compatibility

Not all features work with all document types: Change tracking and highlights work best with Word documents, which store this information natively. For PDFs, Reducto uses visual detection models, which work well but may miss subtle formatting. Scanned documents rely entirely on visual detection. Comments work with PDF annotations (sticky notes), Word margin comments, and Excel cell notes. Scanned documents donโ€™t have extractable comments. Hyperlinks work with PDFs and Word documents that contain embedded links. Scanned documents donโ€™t preserve hyperlink information. Signatures are detected visually, so they work across all document types including scans.