Extract underlines, strikethroughs, and PDF text annotations. Returned text includes HTML markup that identifies insertions and deletions, and annotations include normalized location data.

When to use

  • Legal and compliance review (redlines in contracts and policies)
  • Editorial review (what changed between versions)
  • PDF review workflows that rely on sticky notes or text comments

How-to: enable change tracking

Add HTML tags around text formatting to detect document changes.

Configuration

Requirements: Only works with hybrid or metadata extraction mode (not ocr).
{
  "document_url": "https://example.com/document.pdf",
  "options": {
    "extraction_mode": "hybrid"
  },
  "advanced_options": {
    "enable_change_tracking": true
  }
}

Output

  • <change><u>underlined text</u></change> for underlined text
  • <change><s>deleted text</s></change> for strikethrough text
  • <change><s>old</s> <u>new</u></change> for change sequences

How-to: detect PDF comments

Extract text annotations from PDF documents with their content and locations.

Configuration

{
  "document_url": "https://example.com/annotated.pdf",
  "advanced_options": {
    "read_comments": true
  }
}

Output

Comments include content and normalized bounding box coordinates:
{
  "content": "Review comment text",
  "bbox": [0.1, 0.2, 0.3, 0.4]
}
The bbox array contains [left, top, width, height] normalized to [0,1] relative to page dimensions.