> ## Documentation Index > Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt > Use this file to discover all available pages before exploring further. # Contract & Legal Document Review > Parse redlined legal documents and extract insertions, deletions, and annotations as structured data Reducto's `change_tracking` feature extracts strikethroughs, underlines, and annotations from redlined contracts as structured HTML tags—making it easy to list changes programmatically, categorize by type, and build approval workflows. *** ## What you'll build By the end of this cookbook, you'll have a pipeline that parses any redlined contract and extracts every revision as structured data. ``` Redlined Contract (PDF/DOCX) | v Reducto Parse (change_tracking enabled) | v Structured output with , , tags | v Python extraction → List of all changes ``` *** ## Create API Key Go to [studio.reducto.ai](https://studio.reducto.ai) and sign in. From the home page, click **API Keys** in the left sidebar. The API Keys page shows your existing keys. Click **+ Create new API key** in the top right corner. In the modal, enter a name for your key and set an expiration policy (or select "Never" for no expiration). Click **Create**. Copy your new API key and store it securely. You won't be able to see it again after closing this dialog. Set the key as an environment variable: ```bash theme={null} export REDUCTO_API_KEY="your-api-key-here" ``` Install the SDK: ```bash Python theme={null} pip install reductoai requests ``` ```bash JavaScript theme={null} npm install reductoai ``` *** ## Sample document For this cookbook, we use a 165-page union labor agreement (AFSCME Local 328 vs. Oregon Health & Science University) with extensive redlines showing proposed contract changes. The document includes: * Strikethroughs for deleted clauses * Underlines for new language * Inline annotations explaining changes **Download the sample:** ``` https://static1.squarespace.com/static/5cee0f8eb1a76b0001ca1d78/t/5e45817fe03fd3048a4b1792/1581613498729/Redline+Contract+with+Annotation.pdf ``` Reducto works with both Word documents (which have native track changes metadata) and PDFs (where it visually detects underlines and strikethroughs). Word documents give the best results since the change metadata is embedded in the file. *** ## Step 1: Parse with change tracking ### Upload the document First, upload your redlined contract to Reducto: ```python Python theme={null} from pathlib import Path from reducto import Reducto client = Reducto() upload = client.upload(file=Path("redlined_contract.pdf")) print(f"File ID: {upload.file_id}") ``` ```javascript JavaScript theme={null} import Reducto from "reductoai"; import fs from "fs"; const client = new Reducto(); const upload = await client.upload({ file: fs.createReadStream("redlined_contract.pdf") }); console.log(`File ID: ${upload.file_id}`); ``` ``` File ID: reducto://e1894955-d89d-42ef-a118-f08d61ee890d.pdf ``` ### Parse with change tracking enabled The key setting is `formatting.include: ["change_tracking"]`. This tells Reducto to detect underlines and strikethroughs and wrap them in HTML tags. ```python Python theme={null} result = client.parse.run( input=upload.file_id, formatting={ "include": ["change_tracking"] } ) print(f"Pages: {result.usage.num_pages}") print(f"Credits: {result.usage.credits}") ``` ```javascript JavaScript theme={null} const result = await client.parse.run({ input: upload.file_id, formatting: { include: ["change_tracking"] } }); console.log(`Pages: ${result.usage.num_pages}`); console.log(`Credits: ${result.usage.credits}`); ``` ``` Pages: 165 Credits: 188.0 ``` **Why `change_tracking`?** Without this option, Reducto returns plain text. With it enabled, revisions appear as HTML tags that you can parse programmatically: * `` wraps strikethrough text (deletions) * `` wraps underlined text (insertions) * `` groups related revisions together *** ## Step 2: Handle large documents For large documents like our 165-page contract, Reducto returns results as a URL rather than inline data. This keeps response sizes manageable. ```python Python theme={null} import requests # Check if result is a URL if hasattr(result.result, 'url'): print(f"Result URL: {result.result.url[:80]}...") response = requests.get(result.result.url) data = response.json() chunks = data.get('chunks', []) full_content = "\n".join([c.get('content', '') for c in chunks]) else: full_content = "\n".join([c.content for c in result.result.chunks]) print(f"Content length: {len(full_content)} characters") ``` ```javascript JavaScript theme={null} let fullContent; // Check if result is a URL (large documents) if (result.result.type === "url") { console.log(`Result URL: ${result.result.url.slice(0, 80)}...`); const response = await fetch(result.result.url); const data = await response.json(); const chunks = data.chunks || []; fullContent = chunks.map(c => c.content || "").join("\n"); } else { fullContent = result.result.chunks.map(c => c.content).join("\n"); } console.log(`Content length: ${fullContent.length} characters`); ``` ``` Result URL: https://prod-storage20241010144745140900000001.s3.amazonaws.com/ac80631f... Content length: 365675 characters ``` *** ## Step 3: Understand the output With change tracking enabled, revisions appear as HTML markup in the content. Here's what we found in our sample contract: ``` tags found: 554 (strikethrough) tags found: 289 (underline) tags found: 438 ``` ### Real examples from the document **Simple insertion** (new language added): ```html theme={null} and any PEOPLE deduction ``` **Clause deletion** (entire section removed): ```html theme={null} ~~a. An employee's chosen form of dues or payment in lieu of dues shall recommence upon reinstatement following a period of layoff or extended leave.~~ ``` **Replacement** (old language replaced with new): ```html theme={null} ~~forty-eight (48)~~ fifty (50) ``` ### Tag meanings | Tag | Meaning | Visual in document | | ---------- | --------------- | ----------------------------------- | | `` | Strikethrough | ~~deleted text~~ | | `` | Underline | inserted text | | `` | Revision region | Groups related deletions/insertions | A single `` block can contain: * Just a deletion: `~~removed text~~` * Just an insertion: `new text` * Both: `~~old~~ new` *** ## Step 4: Extract changes programmatically Now we parse the HTML tags to get a structured list of all changes. This function uses regex to find every `` block and extract the deletions and insertions within it. ```python Python theme={null} import re def extract_changes(content): """Extract all revision regions from parsed content.""" changes = [] pattern = r'(.*?)' for match in re.finditer(pattern, content, re.DOTALL): change_text = match.group(1) # Extract deletions (strikethrough) deletions = re.findall(r'~~(.*?)~~', change_text, re.DOTALL) # Extract insertions (underline) insertions = re.findall(r'(.*?)', change_text, re.DOTALL) changes.append({ "deleted": deletions, "inserted": insertions, }) return changes ``` ```javascript JavaScript theme={null} function extractChanges(content) { const changes = []; // Match all ... blocks const changeRegex = /([\s\S]*?)<\/change>/g; let match; while ((match = changeRegex.exec(content)) !== null) { const changeText = match[1]; // Extract deletions (strikethrough) const deletions = []; const delRegex = /([\s\S]*?)<\/s>/g; let delMatch; while ((delMatch = delRegex.exec(changeText)) !== null) { deletions.push(delMatch[1]); } // Extract insertions (underline) const insertions = []; const insRegex = /([\s\S]*?)<\/u>/g; let insMatch; while ((insMatch = insRegex.exec(changeText)) !== null) { insertions.push(insMatch[1]); } changes.push({ deleted: deletions, inserted: insertions }); } return changes; } ``` **Why regex?** The HTML tags are simple and well-structured. For basic extraction, regex is fast and sufficient. For documents with complex nested changes, consider using an HTML parser like BeautifulSoup. ### Run the extraction ```python Python theme={null} changes = extract_changes(full_content) print(f"Found {len(changes)} revisions") for i, change in enumerate(changes[:5]): print(f"\nRevision {i+1}:") if change["deleted"]: print(f" Deleted: {change['deleted'][0][:60]}...") if change["inserted"]: print(f" Inserted: {change['inserted'][0][:60]}...") ``` ```javascript JavaScript theme={null} const changes = extractChanges(fullContent); console.log(`Found ${changes.length} revisions`); for (let i = 0; i < Math.min(5, changes.length); i++) { const change = changes[i]; console.log(`\nRevision ${i + 1}:`); if (change.deleted.length > 0) { console.log(` Deleted: ${change.deleted[0].slice(0, 60)}...`); } if (change.inserted.length > 0) { console.log(` Inserted: ${change.inserted[0].slice(0, 60)}...`); } } ``` **Output from our sample contract:** ``` Found 555 revisions Revision 1: Deleted: Employees in the bargaining unit are required either to b... Revision 2: Deleted: a. An employee's chosen form of dues or payment in lieu o... Revision 3: Deleted: b. Dues and payments in-lieu-of dues for employees workin... Revision 4: Inserted: Employees covered by this Agreement shall have the right... Revision 5: Inserted: 1.2.2 Holder of Record. During the life of this Agreemen... ``` *** ## Step 5: Categorize changes Not all changes are equal. Some are pure deletions (language removed), some are pure insertions (new language added), and some are replacements (old swapped for new). Categorizing helps prioritize review. ```python Python theme={null} deletions_only = sum(1 for c in changes if c["deleted"] and not c["inserted"]) insertions_only = sum(1 for c in changes if c["inserted"] and not c["deleted"]) replacements = sum(1 for c in changes if c["deleted"] and c["inserted"]) print(f"Total revisions: {len(changes)}") print(f" - Deletions only: {deletions_only}") print(f" - Insertions only: {insertions_only}") print(f" - Replacements: {replacements}") ``` ```javascript JavaScript theme={null} const deletionsOnly = changes.filter(c => c.deleted.length > 0 && c.inserted.length === 0).length; const insertionsOnly = changes.filter(c => c.inserted.length > 0 && c.deleted.length === 0).length; const replacements = changes.filter(c => c.deleted.length > 0 && c.inserted.length > 0).length; console.log(`Total revisions: ${changes.length}`); console.log(` - Deletions only: ${deletionsOnly}`); console.log(` - Insertions only: ${insertionsOnly}`); console.log(` - Replacements: ${replacements}`); ``` **Output:** ``` Total revisions: 555 - Deletions only: 191 - Insertions only: 270 - Replacements: 94 ``` This contract has 191 sections removed entirely, 270 new sections added, and 94 places where language was swapped. *** ## Using Studio You can also extract changes visually in Reducto Studio without writing code. Go to [studio.reducto.ai](https://studio.reducto.ai) and upload your redlined contract. In the **Configurations** tab, switch to **Advanced** mode. Expand the **Formatting** section and check `change_tracking`. Click **Run**. The results show the parsed content with ``, `~~`, and `` tags visible in the output. You can search for specific changes using Ctrl+F.~~ ~~Copy the results, download as JSON, or deploy the pipeline with these settings for repeated use on similar documents.~~ *** ## Complete example Here's a full script that parses a redlined contract and generates a change summary: ```python Python theme={null} import re import requests from pathlib import Path from reducto import Reducto def extract_changes(content): """Extract revision regions from content.""" changes = [] pattern = r'(.*?)' for match in re.finditer(pattern, content, re.DOTALL): change_text = match.group(1) deletions = re.findall(r'~~(.*?)~~', change_text, re.DOTALL) insertions = re.findall(r'(.*?)', change_text, re.DOTALL) changes.append({"deleted": deletions, "inserted": insertions}) return changes # Parse the document client = Reducto() upload = client.upload(file=Path("redlined_contract.pdf")) result = client.parse.run( input=upload.file_id, formatting={"include": ["change_tracking"]} ) # Handle URL result for large documents if hasattr(result.result, 'url'): response = requests.get(result.result.url) data = response.json() chunks = data.get('chunks', []) full_content = "\n".join([c.get('content', '') for c in chunks]) else: full_content = "\n".join([c.content for c in result.result.chunks]) # Extract and categorize changes = extract_changes(full_content) deletions_only = sum(1 for c in changes if c["deleted"] and not c["inserted"]) insertions_only = sum(1 for c in changes if c["inserted"] and not c["deleted"]) replacements = sum(1 for c in changes if c["deleted"] and c["inserted"]) # Print summary print(f"Document: {result.usage.num_pages} pages") print(f"Total revisions: {len(changes)}") print(f" - Deletions only: {deletions_only}") print(f" - Insertions only: {insertions_only}") print(f" - Replacements: {replacements}") ``` ```javascript JavaScript theme={null} import Reducto from "reductoai"; import fs from "fs"; function extractChanges(content) { const changes = []; const changeRegex = /([\s\S]*?)<\/change>/g; let match; while ((match = changeRegex.exec(content)) !== null) { const changeText = match[1]; const deletions = [...changeText.matchAll(/([\s\S]*?)<\/s>/g)].map(m => m[1]); const insertions = [...changeText.matchAll(/([\s\S]*?)<\/u>/g)].map(m => m[1]); changes.push({ deleted: deletions, inserted: insertions }); } return changes; } // Parse the document const client = new Reducto(); const upload = await client.upload({ file: fs.createReadStream("redlined_contract.pdf") }); const result = await client.parse.run({ input: upload.file_id, formatting: { include: ["change_tracking"] } }); // Handle URL result for large documents let fullContent; if (result.result.type === "url") { const response = await fetch(result.result.url); const data = await response.json(); fullContent = (data.chunks || []).map(c => c.content || "").join("\n"); } else { fullContent = result.result.chunks.map(c => c.content).join("\n"); } // Extract and categorize const changes = extractChanges(fullContent); const deletionsOnly = changes.filter(c => c.deleted.length > 0 && c.inserted.length === 0).length; const insertionsOnly = changes.filter(c => c.inserted.length > 0 && c.deleted.length === 0).length; const replacements = changes.filter(c => c.deleted.length > 0 && c.inserted.length > 0).length; // Print summary console.log(`Document: ${result.usage.num_pages} pages`); console.log(`Total revisions: ${changes.length}`); console.log(` - Deletions only: ${deletionsOnly}`); console.log(` - Insertions only: ${insertionsOnly}`); console.log(` - Replacements: ${replacements}`); ``` **Output from our sample contract:** ``` Document: 165 pages Total revisions: 555 - Deletions only: 191 - Insertions only: 270 - Replacements: 94 ``` *** ## How change tracking works Reducto uses different detection methods depending on document type: | Document type | Detection method | | ------------- | ------------------------------------------------------------------ | | Word (.docx) | Reads native track changes metadata. Most accurate. | | PDF (digital) | Detects colored text and formatting via embedded character data. | | PDF (scanned) | Uses ML models to visually identify underlines and strikethroughs. | For best results, use Word documents with Track Changes enabled. The metadata is preserved natively. PDFs require visual detection, which works well but depends on clear formatting. *** ## Best practices Word's native Track Changes stores revision metadata directly in the file. This gives Reducto exact information about what was added or removed, including author and timestamp. PDFs require visual detection. Replacements (where old text is swapped for new) often need careful review. Pure insertions may be less risky. Route different categories to appropriate reviewers. Use Parse with change tracking to get the revisions, then pipe specific clauses through Extract to pull structured fields like dates, amounts, or party names. Some documents have nested revisions (changes within changes). The regex patterns above handle simple cases. For complex documents, consider using an HTML parser like BeautifulSoup. *** ## Use cases Extract all changes from incoming redlines and route them to the appropriate reviewer based on clause type. Send indemnification changes to legal, pricing changes to finance. Build approval queues where each revision must be explicitly accepted or rejected before finalizing the agreement. Track who approved what. Monitor changes to policies and procedures. Flag modifications to critical sections for compliance review before they go into effect. Generate executive summaries showing what the counterparty changed. Brief stakeholders without requiring them to read a 165-page document. *** ## Next steps Learn about highlights, hyperlinks, and signatures. Pull structured data from specific clauses. Process multiple contracts in parallel. Visual walkthrough of the Parse pipeline.