Documentation Index
Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
Use this file to discover all available pages before exploring further.
The /cite endpoint allows you to find the exact location of text within a parsed document. Given a text string, it returns the bounding boxes where that text appears in the original document. This is useful for highlighting citations, building document viewers, or verifying extracted data against source locations.
Prerequisites
The document must be parsed with OCR data enabled:
from reducto import Reducto
client = Reducto()
# Parse with OCR data enabled
result = client.parse.run(
document_url="https://example.com/document.pdf",
options={"return_ocr_data": True}
)
Basic usage
Using a job ID
If you have a job ID from a previous parse operation, you can reference it directly:
import requests
response = requests.post(
"https://your-reducto-instance/cite",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"source": "jobid://your-job-id",
"queries": [
{"text": "Total Revenue"}
]
}
)
result = response.json()
Using a parse result directly
You can also pass the full parse result object:
import requests
# First, parse the document with OCR data
parse_response = requests.post(
"https://your-reducto-instance/parse",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"document_url": "https://example.com/document.pdf",
"options": {"return_ocr_data": True}
}
)
parse_result = parse_response.json()["result"]
# Then find citations
cite_response = requests.post(
"https://your-reducto-instance/cite",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"source": parse_result,
"queries": [
{"text": "Total Revenue"},
{"text": "Net Income"}
]
}
)
citations = cite_response.json()
| Field | Type | Required | Description |
|---|
source | string or object | Yes | Either jobid://<job_id> string or full parse result object. The parse must have been run with return_ocr_data=true. |
queries | array | Yes | List of text citations to locate. |
Query object
| Field | Type | Required | Description |
|---|
text | string | Yes | Text to locate. Whitespace is normalized for matching. |
bbox_filter | object | No | Optional bounding box to limit the search region. |
Bounding box filter
When you want to search within a specific region of a page:
{
"source": "jobid://your-job-id",
"queries": [
{
"text": "Amount",
"bbox_filter": {
"page": 1,
"left": 0.0,
"top": 0.0,
"width": 0.5,
"height": 0.5
}
}
]
}
{
"results": [
{
"matches": [
{
"page": 1,
"bboxes": [
{
"page": 1,
"left": 0.123,
"top": 0.456,
"width": 0.089,
"height": 0.023
}
]
}
]
}
],
"duration": 0.045
}
| Field | Description |
|---|
results | Array of results in the same order as input queries (1:1 correspondence). |
results[].matches | All locations where the text was found. Empty array if no matches. |
results[].matches[].page | Page number (1-indexed) where the match was found. |
results[].matches[].bboxes | Bounding boxes for the match. Multiple boxes are returned for multi-line text. |
duration | Processing time in seconds. |
Text matching behavior
The endpoint normalizes text for matching:
- Converts to lowercase
- Removes punctuation
- Collapses whitespace
This means a query for "Total Revenue" will match "total revenue", "Total Revenue", or "TOTAL REVENUE" in the document.
Multiple queries
You can search for multiple text strings in a single request. Results are returned in the same order as the queries:
response = requests.post(
"https://your-reducto-instance/cite",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"source": "jobid://your-job-id",
"queries": [
{"text": "Revenue"},
{"text": "Expenses"},
{"text": "Net Income"}
]
}
)
result = response.json()
# result["results"][0] corresponds to "Revenue"
# result["results"][1] corresponds to "Expenses"
# result["results"][2] corresponds to "Net Income"
Multi-line text
When the matched text spans multiple lines, the response includes separate bounding boxes for each line:
{
"matches": [
{
"page": 1,
"bboxes": [
{
"page": 1,
"left": 0.1,
"top": 0.2,
"width": 0.3,
"height": 0.02
},
{
"page": 1,
"left": 0.1,
"top": 0.22,
"width": 0.25,
"height": 0.02
}
]
}
]
}
Error handling
| Status Code | Description |
|---|
| 400 | Invalid source format, or OCR data not available (document was not parsed with return_ocr_data=true). |
| 401 | Invalid or missing API key. |
| 404 | Job ID not found or not accessible. |
Use cases
After extracting structured data, use /cite to highlight where each value appears in the original document:
# Extract data
extract_response = client.extract.run(
document_url="https://example.com/invoice.pdf",
schema={
"type": "object",
"properties": {
"total": {"type": "string"},
"vendor": {"type": "string"}
}
},
options={"return_ocr_data": True}
)
extracted = extract_response.result.data
job_id = extract_response.job_id
# Find where each value appears
cite_response = requests.post(
"https://your-reducto-instance/cite",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"source": f"jobid://{job_id}",
"queries": [
{"text": extracted["total"]},
{"text": extracted["vendor"]}
]
}
)
Building document viewers
Use the bounding boxes to draw highlights or annotations on document pages in your application.
Data verification
Verify that extracted values actually appear in the expected locations within the source document.