Skip to main content

Overview

Reducto provides bounding box citations so you can trace extracted data back to its original source inside a document. This feature is especially important for compliance, debugging, and trust in industries that need data grounded in real evidence.

Why use citations?

  • Traceability — Confirm where each extracted value came from in the source file.
  • Compliance — Maintain audit trails for regulated workflows.
  • Debugging — Compare schema outputs against original text to fix errors.
  • User experience — In Studio, click between the output field and the highlighted bounding box on the document.

How-to enable citations

result = client.extract.run(
    document_url=upload,
	schema=schema,
	system_prompt=system_prompt,
	generate_citations=True,
)

How citations work

When generate_citations is enabled, Reducto includes bounding box metadata for each extracted field. Citation data (citations) is in a separate field than result, which is the data returned in your specified schema structure. Each bounding box (bbox) represents the location of text in the document with coordinates relative to the top left corner of the page:
Sample response
{
  "citations": [
    {
      "sample_extracted_field": [
        {
		// Relative to top left corner of the page
          "bbox": {
            "left": 0.1,
            "top": 0.2,
            "width": 0.3,
            "height": 0.05,
            "page": 1,
            "original_page": 1
          },
          "confidence": "high",
          "content": "granular citation",
          "image_url": null,
		// For granular citations, this is the parent block that it belongs inside of
          "parentBlock": {
            "bbox": {
              "left": 0.1,
              "top": 0.9,
              "width": 0.8,
              "height": 0.05,
              "page": 1
            },
            "block_type": "Text",
            "confidence": "high",
            "content": "This is the full sentence with the granular citation."
          },
          "type": "Text"
        }
      ]
    }
  ],
  "result": ....
}
In the bounding box coordinates: left, top, width, height: pixel coordinates normalized to the page, relative to the top left corner.
  • parentBlock: Since citations can be very granular and specific, the parentBlock is the Parse block containing the extracted data. It helps with providing more contextual data.
In Studio, citations are two-way links: -> Click an extracted field to see its highlight in the document. -> Click a highlighted bounding box to locate the field in the output.

Spreadsheet citations

Excel and other spreadsheet formats handle citations differently from PDFs and images: Coordinate system:
  • Excel: Uses actual row/column positions (1-indexed). For example, cell A1 would have coordinates left: 1, top: 1, width: 1, height: 1
  • Other formats: Use normalized coordinates in [0,1] range relative to page dimensions
Page field:
  • Excel: The page field represents the sheet index (1-indexed). Sheet 1 = page 1, Sheet 2 = page 2, etc.
  • Other formats: The page field represents the actual page number in the document
Example Excel citation:
JSON output
{
  "bbox": {
    "left": 2,      // Column B (1-indexed)
    "top": 5,       // Row 5 (1-indexed) 
    "width": 1,     // 1 column wide
    "height": 1,    // 1 row tall
    "page": 1,      // First sheet
    "original_page": 1
  }
}
This allows for precise cell-level citations that correspond directly to Excel’s native coordinate system.