Extract underlined/strikethrough text with HTML markup and PDF comments with their locations.

Change Tracking

Add HTML tags around text formatting to detect document changes.

Configuration

{
  "document_url": "https://example.com/document.pdf",
  "options": {
    "extraction_mode": "hybrid"
  },
  "advanced_options": {
    "enable_change_tracking": true
  }
}

Requirements: Only works with hybrid or metadata extraction mode (not ocr).

Output

  • <u>underlined text</u> for underlined text
  • <s>deleted text</s> for strikethrough text
  • <change><s>old</s> <u>new</u></change> for change sequences

PDF Comments

Extract text annotations from PDF documents with their content and locations.

Configuration

{
  "document_url": "https://example.com/annotated.pdf",
  "advanced_options": {
    "read_comments": true
  }
}

Output

Comments include content and normalized bounding box coordinates:

{
  "content": "Review comment text",
  "bbox": [0.1, 0.2, 0.3, 0.4]
}

The bbox array contains [left, top, width, height] normalized to [0,1] relative to page dimensions.