Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt

Use this file to discover all available pages before exploring further.

The /cite endpoint allows you to find the exact location of text within a parsed document. Given a text string, it returns the bounding boxes where that text appears in the original document. This is useful for highlighting citations, building document viewers, or verifying extracted data against source locations.

Prerequisites

The document must be parsed with OCR data enabled:
from reducto import Reducto

client = Reducto()

# Parse with OCR data enabled
result = client.parse.run(
    document_url="https://example.com/document.pdf",
    options={"return_ocr_data": True}
)

Basic usage

Using a job ID

If you have a job ID from a previous parse operation, you can reference it directly:
import requests

response = requests.post(
    "https://your-reducto-instance/cite",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "source": "jobid://your-job-id",
        "queries": [
            {"text": "Total Revenue"}
        ]
    }
)

result = response.json()

Using a parse result directly

You can also pass the full parse result object:
import requests

# First, parse the document with OCR data
parse_response = requests.post(
    "https://your-reducto-instance/parse",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "document_url": "https://example.com/document.pdf",
        "options": {"return_ocr_data": True}
    }
)

parse_result = parse_response.json()["result"]

# Then find citations
cite_response = requests.post(
    "https://your-reducto-instance/cite",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "source": parse_result,
        "queries": [
            {"text": "Total Revenue"},
            {"text": "Net Income"}
        ]
    }
)

citations = cite_response.json()

Request format

FieldTypeRequiredDescription
sourcestring or objectYesEither jobid://<job_id> string or full parse result object. The parse must have been run with return_ocr_data=true.
queriesarrayYesList of text citations to locate.

Query object

FieldTypeRequiredDescription
textstringYesText to locate. Whitespace is normalized for matching.
bbox_filterobjectNoOptional bounding box to limit the search region.

Bounding box filter

When you want to search within a specific region of a page:
{
    "source": "jobid://your-job-id",
    "queries": [
        {
            "text": "Amount",
            "bbox_filter": {
                "page": 1,
                "left": 0.0,
                "top": 0.0,
                "width": 0.5,
                "height": 0.5
            }
        }
    ]
}

Response format

{
    "results": [
        {
            "matches": [
                {
                    "page": 1,
                    "bboxes": [
                        {
                            "page": 1,
                            "left": 0.123,
                            "top": 0.456,
                            "width": 0.089,
                            "height": 0.023
                        }
                    ]
                }
            ]
        }
    ],
    "duration": 0.045
}
FieldDescription
resultsArray of results in the same order as input queries (1:1 correspondence).
results[].matchesAll locations where the text was found. Empty array if no matches.
results[].matches[].pagePage number (1-indexed) where the match was found.
results[].matches[].bboxesBounding boxes for the match. Multiple boxes are returned for multi-line text.
durationProcessing time in seconds.

Text matching behavior

The endpoint normalizes text for matching:
  • Converts to lowercase
  • Removes punctuation
  • Collapses whitespace
This means a query for "Total Revenue" will match "total revenue", "Total Revenue", or "TOTAL REVENUE" in the document.

Multiple queries

You can search for multiple text strings in a single request. Results are returned in the same order as the queries:
response = requests.post(
    "https://your-reducto-instance/cite",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "source": "jobid://your-job-id",
        "queries": [
            {"text": "Revenue"},
            {"text": "Expenses"},
            {"text": "Net Income"}
        ]
    }
)

result = response.json()
# result["results"][0] corresponds to "Revenue"
# result["results"][1] corresponds to "Expenses"
# result["results"][2] corresponds to "Net Income"

Multi-line text

When the matched text spans multiple lines, the response includes separate bounding boxes for each line:
{
    "matches": [
        {
            "page": 1,
            "bboxes": [
                {
                    "page": 1,
                    "left": 0.1,
                    "top": 0.2,
                    "width": 0.3,
                    "height": 0.02
                },
                {
                    "page": 1,
                    "left": 0.1,
                    "top": 0.22,
                    "width": 0.25,
                    "height": 0.02
                }
            ]
        }
    ]
}

Error handling

Status CodeDescription
400Invalid source format, or OCR data not available (document was not parsed with return_ocr_data=true).
401Invalid or missing API key.
404Job ID not found or not accessible.

Use cases

Highlighting extracted values

After extracting structured data, use /cite to highlight where each value appears in the original document:
# Extract data
extract_response = client.extract.run(
    document_url="https://example.com/invoice.pdf",
    schema={
        "type": "object",
        "properties": {
            "total": {"type": "string"},
            "vendor": {"type": "string"}
        }
    },
    options={"return_ocr_data": True}
)

extracted = extract_response.result.data
job_id = extract_response.job_id

# Find where each value appears
cite_response = requests.post(
    "https://your-reducto-instance/cite",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "source": f"jobid://{job_id}",
        "queries": [
            {"text": extracted["total"]},
            {"text": extracted["vendor"]}
        ]
    }
)

Building document viewers

Use the bounding boxes to draw highlights or annotations on document pages in your application.

Data verification

Verify that extracted values actually appear in the expected locations within the source document.