> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Additional Document Data

> Extract revision marks, comments, highlights, hyperlinks, and signatures

When you parse a document, Reducto extracts the main text content by default. But documents often contain additional information layered on top: revision marks from Track Changes, margin comments, highlighted passages, hyperlinks, and signatures. The `formatting.include` option lets you extract these.

<CodeGroup>
  ```python Python theme={null}
  result = client.parse.run(
      input=upload.file_id,
      formatting={
          "include": ["change_tracking", "comments", "highlight", "hyperlinks", "signatures"]
      }
  )
  ```

  ```javascript Node.js theme={null}
  const result = await client.parse.run({
    input: upload.file_id,
    formatting: {
      include: ['change_tracking', 'comments', 'highlight', 'hyperlinks', 'signatures']
    }
  });
  ```

  ```bash cURL theme={null}
  curl -X POST https://platform.reducto.ai/parse \
    -H "Authorization: Bearer $REDUCTO_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "input": "reducto://your-file-id",
      "formatting": {
        "include": ["change_tracking", "comments", "highlight", "hyperlinks", "signatures"]
      }
    }'
  ```
</CodeGroup>

By default, none of these are extracted. Enable only what you need, since each adds processing overhead.

<Note>
  These formatting options are available in the Python SDK, Node.js SDK, and via cURL. The Go SDK has limited support—only `enable_underlines` (for change tracking) is currently available.
</Note>

## Change Tracking

Legal documents, contracts, and collaborative drafts often use underlines and strikethroughs to show what changed between versions. Reducto can detect these and wrap them in HTML tags so you can programmatically identify revisions.

<CodeGroup>
  ```python Python theme={null}
  formatting={"include": ["change_tracking"]}
  ```

  ```javascript Node.js theme={null}
  formatting: { include: ['change_tracking'] }
  ```

  ```bash cURL theme={null}
  "formatting": {"include": ["change_tracking"]}
  ```
</CodeGroup>

When enabled, underlined and struck-through text appears with markup:

```html theme={null}
The agreement shall commence on <change><s>January 1, 2024</s> <u>February 15, 2024</u></change>.
```

The `<s>` tag marks strikethrough (typically deletions), `<u>` marks underlines (typically insertions), and `<change>` wraps the entire revision region.

**How it works:** For digital PDFs and Word documents, Reducto reads the embedded formatting information. For scanned documents, it uses a segmentation model to visually detect underlines and strikethroughs on the page image.

**Common uses:**

* Contract review: automatically extract what changed between versions
* Compliance: track modifications to policies and procedures
* Editorial workflows: preserve editor suggestions in parsed output

## Comments

PDF sticky notes, Word margin comments, and Excel cell notes contain reviewer feedback, questions, and instructions that are separate from the document content itself. Reducto extracts these as distinct blocks.

<CodeGroup>
  ```python Python theme={null}
  formatting={"include": ["comments"]}
  ```

  ```javascript Node.js theme={null}
  formatting: { include: ['comments'] }
  ```

  ```bash cURL theme={null}
  "formatting": {"include": ["comments"]}
  ```
</CodeGroup>

Each comment becomes its own block with the comment text and its position on the page:

```json theme={null}
{
  "type": "Comment",
  "content": "Verify this figure with the finance team before publishing",
  "bbox": {"left": 0.85, "top": 0.15, "width": 0.1, "height": 0.05, "page": 1}
}
```

The bounding box tells you where the comment annotation appeared (normalized to \[0, 1] relative to page size). This lets you correlate comments with nearby content if needed.

## Highlights

Highlighted text usually signals importance. Reducto can detect highlighted passages and wrap them in `<mark>` tags, letting you identify what reviewers or authors emphasized.

<CodeGroup>
  ```python Python theme={null}
  formatting={"include": ["highlight"]}
  ```

  ```javascript Node.js theme={null}
  formatting: { include: ['highlight'] }
  ```

  ```bash cURL theme={null}
  "formatting": {"include": ["highlight"]}
  ```
</CodeGroup>

Output:

```html theme={null}
The key finding was that <mark>revenue increased 47% year-over-year</mark> despite market headwinds.
```

**How it works:** For digital documents, Reducto reads highlight annotations. For scanned documents, it uses a segmentation model to detect colored highlighting (typically yellow, but other colors work too).

**Common uses:**

* Extract key passages from research documents
* Identify what reviewers marked as significant during review
* Use highlights as importance signals for summarization

## Hyperlinks

Documents contain links to external resources, internal references, and citations. Reducto extracts these and converts them to markdown link format, preserving both the display text and the URL.

<CodeGroup>
  ```python Python theme={null}
  formatting={"include": ["hyperlinks"]}
  ```

  ```javascript Node.js theme={null}
  formatting: { include: ['hyperlinks'] }
  ```

  ```bash cURL theme={null}
  "formatting": {"include": ["hyperlinks"]}
  ```
</CodeGroup>

Output:

```markdown theme={null}
For more details, see [our methodology paper](https://example.com/methodology.pdf).
```

**Common uses:**

* Build reference lists from academic papers
* Audit documents for broken or outdated links
* Extract cited sources for verification

## Signatures

Forms and contracts often contain signature fields. Reducto can detect where signatures appear, which is useful for determining whether a document has been signed or for locating signature regions for downstream processing.

<CodeGroup>
  ```python Python theme={null}
  formatting={"include": ["signatures"]}
  ```

  ```javascript Node.js theme={null}
  formatting: { include: ['signatures'] }
  ```

  ```bash cURL theme={null}
  "formatting": {"include": ["signatures"]}
  ```
</CodeGroup>

Detected signatures appear as blocks marking their location:

```json theme={null}
{
  "type": "Signature",
  "content": "<signature>",
  "bbox": {"left": 0.1, "top": 0.8, "width": 0.3, "height": 0.1, "page": 2}
}
```

The actual signature image is not extracted (for privacy). The block identifies where a signature was detected.

**Common uses:**

* Verify that forms have been signed before processing
* Route unsigned documents back for completion
* Classify documents as signed vs. unsigned

## Format Compatibility

Not all features work with all document types:

**Change tracking and highlights** work best with Word documents, which store this information natively. For PDFs, Reducto uses visual detection models, which work well but may miss subtle formatting. Scanned documents rely entirely on visual detection.

**Comments** work with PDF annotations (sticky notes), Word margin comments, and Excel cell notes. Scanned documents don't have extractable comments.

**Hyperlinks** work with PDFs and Word documents that contain embedded links. Scanned documents don't preserve hyperlink information.

**Signatures** are detected visually, so they work across all document types including scans.
