Skip to main content
Citations link each extracted value to its source location in the document. When enabled, every field in your extraction result includes coordinates pointing to where the value was found. This matters for three reasons:
  1. Verification: Confirm extractions are correct by seeing the source text
  2. Debugging: When values are wrong, citations show where the model looked
  3. User experience: Let users click from extracted data to the original location

Enabling Citations

Add citations.enabled to your extraction settings:
result = client.extract.run(
    input=upload.file_id,
    instructions={
        "schema": schema,
        "system_prompt": "Extract invoice details."
    },
    settings={
        "citations": {
            "enabled": True
        }
    }
)

Citation Structure

With citations enabled, the response format changes. The result becomes an object (instead of an array), and each value is wrapped with citation data:
{
  "result": {
    "invoice_total": {
      "value": 1575.00,
      "citations": [
        {
          "type": "Table",
          "content": "Total Due: $1,575.00",
          "bbox": {
            "left": 0.65,
            "top": 0.82,
            "width": 0.25,
            "height": 0.03,
            "page": 1,
            "original_page": 1
          },
          "confidence": "high",
          "granular_confidence": {
            "extract_confidence": 0.95,
            "parse_confidence": 0.91
          },
          "parentBlock": {
            "type": "Table",
            "content": "Invoice Total\nSubtotal: $1,500.00\nTax: $75.00\nTotal Due: $1,575.00",
            "bbox": {"left": 0.60, "top": 0.75, "width": 0.35, "height": 0.12, "page": 1}
          }
        }
      ]
    }
  }
}

Citation Fields

FieldDescription
typeThe block type where the value was found: Text, Table, Key Value, Title, etc.
contentThe source text that was extracted from. May include more context than just the value.
bboxBounding box coordinates for the source location.
confidenceOverall confidence as "high" or "low".
granular_confidenceDetailed scores: extract_confidence (0-1) and parse_confidence (0-1).
parentBlockThe larger Parse block containing this citation. Useful for understanding context.

Bounding Box Coordinates

Coordinates are normalized to the range [0, 1] relative to page dimensions:
CoordinateMeaning
leftDistance from the left edge (0 = left margin, 1 = right margin)
topDistance from the top edge (0 = top, 1 = bottom)
widthWidth as fraction of page width
heightHeight as fraction of page height
pagePage number (1-indexed) in the processed result
original_pagePage number in the original document
To convert to pixel coordinates:
def to_pixels(bbox, page_width, page_height):
    return {
        "left": bbox.left * page_width,
        "top": bbox.top * page_height,
        "width": bbox.width * page_width,
        "height": bbox.height * page_height
    }

# For a standard letter page (612x792 pixels)
pixels = to_pixels(citation.bbox, 612, 792)

Working with Citations

Accessing Citation Data

result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={"citations": {"enabled": True, "numerical_confidence": True}}
)

# Access a scalar field
invoice_number = result.result["invoice_number"]
print(f"Value: {invoice_number.value}")

if invoice_number.citations:
    citation = invoice_number.citations[0]
    print(f"Found on page {citation.bbox.page}")
    print(f"Source text: {citation.content}")
    print(f"Confidence: {citation.confidence}")

Array Citations

For array fields, each item in the array has its own citations:
for i, item in enumerate(result.result["line_items"]):
    description = item["description"]
    amount = item["amount"]
    
    print(f"Item {i + 1}: {description.value} - ${amount.value}")
    
    if amount.citations:
        print(f"  Amount found at: page {amount.citations[0].bbox.page}")

Filtering by Confidence

Use confidence scores to flag uncertain extractions:
LOW_CONFIDENCE_THRESHOLD = 0.7

for field_name, field_data in result.result.items():
    if hasattr(field_data, 'citations') and field_data.citations:
        confidence = field_data.citations[0].confidence
        if confidence < LOW_CONFIDENCE_THRESHOLD:
            print(f"Low confidence ({confidence:.2f}): {field_name} = {field_data.value}")

Spreadsheet Citations

Excel and other spreadsheet formats use cell coordinates instead of normalized positions.

Coordinate Differences

AspectPDFs/ImagesSpreadsheets
leftFraction (0-1)Column number (1 = A, 2 = B)
topFraction (0-1)Row number (1-indexed)
widthFraction (0-1)Columns spanned
heightFraction (0-1)Rows spanned
pagePage numberSheet index (1 = first sheet)

Example Spreadsheet Citation

{
  "bbox": {
    "left": 3,        // Column C
    "top": 15,        // Row 15
    "width": 1,       // Single column
    "height": 1,      // Single row
    "page": 2,        // Second sheet
    "original_page": 2
  }
}
This points to cell C15 on the second sheet. The coordinates map directly to Excel’s A1 notation.
def bbox_to_cell(bbox):
    """Convert spreadsheet bbox to cell reference."""
    col_letter = chr(ord('A') + bbox.left - 1)  # Simplified, doesn't handle AA, AB, etc.
    return f"{col_letter}{bbox.top}"

# bbox with left=3, top=15 becomes "C15"

Constraints and Limitations

Citations Disable Chunking

Citations require knowing exactly where each piece of content came from. Chunking merges content across boundaries, which would make citation coordinates ambiguous. When you enable citations:
  • Chunking is automatically disabled in the parsing step
  • The document is processed as a single unit
  • This may increase processing time for very long documents
If you have explicit chunking settings in your parsing configuration, they’ll be ignored when citations are enabled.

Streaming Array Extract Incompatible

The streaming mode for array extraction cannot be used with citations. If you need both complete arrays and citations:
# Works: default array_extract mode with citations
result = client.extract.run(
    input=upload.file_id,
    instructions={"schema": schema},
    settings={
        "array_extract": True,
        "citations": {"enabled": True}
    }
)
The default array extraction mode (not streaming) fully supports citations.

Empty Citations

Citations may be empty for fields that were inferred rather than directly extracted:
# If the document says "Payment Terms: Net 30"
# And your schema has a field for "days_until_due"
# The model extracts 30 but may not have a citation since "30" was derived from "Net 30"
Always check if field.citations: before accessing citation data.

Viewing in Studio

Every extraction response includes a studio_link. In Studio, citations become interactive:
  • Click an extracted field to highlight its source in the document
  • Click a highlight to jump to the corresponding field
  • See all citations overlaid on the document at once
This is particularly useful for debugging when extractions don’t match expectations. You can see exactly what the model identified as the source for each value.