Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt

Use this file to discover all available pages before exploring further.

OCR service configuration

For OCR functionality, Reducto uses AWS Textract with automatic region selection and load balancing across multiple regions.

Textract region configuration

VariableDescriptionRequiredDefault
TEXTRACT_REGIONSComma-separated list of AWS regions and their quotas for Textract operationsNous-east-2:100,us-east-1:100,us-west-2:100,ap-south-1:5,eu-west-1:5

Format

The TEXTRACT_REGIONS environment variable accepts a comma-separated list of region:quota pairs:
TEXTRACT_REGIONS=us-gov-west-1:10,us-gov-east-1:5,eu-central-1:15
Format Rules:
  • With quota: region:quota (e.g., us-east-1:100)
  • Without quota: region (defaults to quota of 1)
  • Mixed: us-east-1:50,us-west-2,eu-west-1:25

Use cases

This configuration is particularly useful for:
  • Government Cloud deployments requiring specific regions (e.g., us-gov-west-1, us-gov-east-1)
  • Data residency requirements restricting processing to certain geographic regions
  • Custom quota management when you have different service limits across regions
  • Network restrictions where only certain regions are accessible from your deployment

Examples

Government Cloud:
TEXTRACT_REGIONS=us-gov-west-1:10,us-gov-east-1:10
Europe-only processing:
TEXTRACT_REGIONS=eu-west-1:20,eu-central-1:15,eu-north-1:10
Single region:
TEXTRACT_REGIONS=us-west-2:50
When this variable is set, Reducto will use weighted random selection based on the quotas to distribute Textract requests across the specified regions, providing both load balancing and compliance with regional requirements. With automatic quota handling.

Azure Vision OCR

For OCR functionality using Azure Computer Vision service.

Single endpoint

VariableDescriptionRequiredDefault
AZURE_VISION_ENDPOINTAzure Computer Vision endpoint URLYesNone
AZURE_VISION_KEYAzure Computer Vision API keyYesNone

Multiple endpoints (failover and load balancing)

For setups with multiple endpoints, use AZURE_VISION_ARRAY. If any endpoint returns a transient error (rate limits, connection failures, server errors), Reducto automatically tries the remaining endpoints before failing. Non-retryable errors (e.g. 400, 401, 403, 404) on the primary surface immediately — they will reproduce on every endpoint, so cross-region retries would only add latency.
VariableDescriptionRequiredDefault
AZURE_VISION_ARRAYJSON array of Azure Vision endpoint objectsYesNone
AZURE_VISION_ARRAY_STRATEGYload_balance, priority, or primary_with_fanoutNoload_balance
Strategies
StrategyBehavior
load_balance (default)Pick a random endpoint per request; on transient error, retry that endpoint up to 3 times with exponential backoff (60-second per-endpoint wall-clock cap), then fail over to the next endpoint and repeat. Use when all endpoints have similar capacity and you want to spread steady-state load.
priorityTry endpoints in the order listed; on transient error, retry that endpoint up to 3 times with exponential backoff (60-second per-endpoint wall-clock cap), then fail over to the next and repeat. Use when you have a preferred region and want others used only as fallbacks.
primary_with_fanoutTry the primary (index 0) once; on a 408/429/5xx/network error, immediately probe each secondary in order with a single attempt each. If every endpoint returns a transient error, return to the primary and retry with exponential backoff (3 tries, 60-second cap). Use when endpoints have unequal capacity and you want to keep secondaries from absorbing sustained overflow — the higher-capacity primary recovers first under backoff while secondaries see only one probe per overloaded request.
Retry classification (all strategies):
HTTP statusRetryable?Behavior on primary
408 Request TimeoutYesretry / fail over
429 Too Many RequestsYesretry / fail over
5xx (500–599)Yesretry / fail over
Network / connection errorsYesretry / fail over
4xx other than 408/429Nosurface immediately, no failover
Under primary_with_fanout only, a non-retryable error on a secondary (e.g. a misconfigured key on one secondary) is logged and skipped so a single bad secondary does not poison a request the primary could otherwise serve once it recovers.
# Load balance across regions (default — equal-capacity endpoints)
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-uscentral.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-useast.cognitiveservices.azure.com/", "key": "<key-2>"},
  {"endpoint": "https://my-vision-uswest.cognitiveservices.azure.com/", "key": "<key-3>"}
]'

# Priority order: primary first, secondaries only on transient failure
AZURE_VISION_ARRAY_STRATEGY=priority
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-uscentral.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-useast.cognitiveservices.azure.com/", "key": "<key-2>"}
]'

# Primary with fanout: high-capacity primary, low-quota secondaries only absorb spillover
AZURE_VISION_ARRAY_STRATEGY=primary_with_fanout
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-primary.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-overflow-1.cognitiveservices.azure.com/", "key": "<key-2>"},
  {"endpoint": "https://my-vision-overflow-2.cognitiveservices.azure.com/", "key": "<key-3>"}
]'
When AZURE_VISION_ARRAY is set, AZURE_VISION_ENDPOINT and AZURE_VISION_KEY are ignored.

LLM provider environment variables

Reducto supports multiple LLM providers through environment variables. Below is a complete list of supported providers and their required environment variables.

LiteLLM proxy

VariableDescriptionRequired
LITELLM_PROXY_URLURL of the LiteLLM ProxyYes
LITELLM_PROXY_FAST_MODELFast model to route to via the proxyYes
LITELLM_PROXY_ACCURATE_MODELAccurate model to route to via the proxyYes

OpenAI

VariableDescriptionRequired
OPENAI_API_KEYYour OpenAI API keyYes

Azure OpenAI

VariableDescriptionRequired
AZURE_OPENAI_API_KEYYour Azure OpenAI API keyYes
AZURE_OPENAI_ENDPOINTAzure OpenAI endpoint (e.g., https://your-resource-name.openai.azure.com/)Yes
OPENAI_API_VERSIONAzure OpenAI API version (e.g., 2024-10-21)Yes
AZURE_OPENAI_MODEL_MAPComma-separated map (or single default deployment) used to translate model names to Azure deployment names. Reducto uses the following models and each should resolve to a deployment unless a single default is supplied: gpt-4o-2024-08-06, gpt-4o, gpt-4o-mini-2024-07-18, gpt-4o-mini, gpt-4.1, o1. Example mappings: my-default-deployment (single default) or gpt-4o=my-prod-dep, gpt-4o-mini=gpt4o-mini-depYes

Anthropic

VariableDescriptionRequired
ANTHROPIC_API_KEYYour Anthropic API keyYes

Google

VariableDescriptionRequired
GOOGLE_APPLICATION_CREDENTIALSService account key json with roles/aiplatform.user role for Vertex AIYes
GCP_PROJECT_IDGCP project for Cloud Vision APIYes
GCP_REGIONRegion for Vertex AI, defaults to us-central1No
GCP_API_KEYAPI key with no Application or API restrictions to access Cloud Vision APIYes

Gemini

VariableDescriptionRequired
GEMINI_API_KEYYour Gemini API keyYes

AWS Bedrock

VariableDescriptionRequired
USE_CLAUDE_BEDROCKSet to any value to enable Claude via AWS BedrockYes
AWS_ACCESS_KEY_IDAWS access key IDYes, when using Bedrock
AWS_SECRET_ACCESS_KEYAWS secret access keyYes, when using Bedrock
AWS_REGIONAWS region nameYes, when using Bedrock

GPU-based extraction models

Reducto offers GPU-based models for structured data extraction and fine-grained citations. For best results, deploy both models together. Model weights are downloaded from HuggingFace using a scoped token provided by Reducto.

Prerequisites

Create a Kubernetes secret with the HuggingFace token provided by Reducto:
kubectl create secret generic reducto-hf-token --from-literal=HF_TOKEN=hf_...
We recommend enabling modelStorage to cache weights on a PVC so restarts don’t re-download.

YAML extraction model (30B)

GPU requirement: 1x NVIDIA H200 (will not fit on H100/A100/A10G).
yamlExtract:
  enabled: true
  gpu: "H200"
  modelStorage:
    enabled: true
    size: "100Gi"
    storageClassName: "your-storage-class"
When enabled, REDUCTO_YAML_EXTRACT_URL is automatically injected into all worker and HTTP pods.

Citation model (7B)

GPU requirement: 1x NVIDIA H100 or H200.
citationModel:
  enabled: true
  gpu: "H200"  # or "H100"
  modelStorage:
    enabled: true
    size: "50Gi"
    storageClassName: "your-storage-class"
When enabled, REDUCTO_CITATION_URL is automatically injected into all worker and HTTP pods. If not deployed, citations fall back to your configured external LLM provider.

Model path overrides

Both deployments expose a modelPath field that can be updated if Reducto ships new model weights:
yamlExtract:
  modelPath: "reducto/extract_30b_0108"  # update when directed by Reducto

citationModel:
  modelPath: "reducto/citation_7b_mimo_0812"  # update when directed by Reducto

Extraction without GPU models

If you do not deploy either GPU model, extraction uses your configured external LLM provider (OpenAI, Anthropic, Google, Azure, or Bedrock). No additional configuration is needed.

Fine-tuned OpenAI extraction model (alternative)

VariableDescriptionRequired
LOCAL_EXTRACT_CITATIONS_MODELFine-tuned OpenAI model ID (e.g., openai:ft:gpt-4.1-2025-04-14:...)No
When set, this takes priority over the self-hosted extraction model.

Request-level LLM overrides

In addition to environment variables, on-prem deployments can override LLM configuration at the request level using the overrides parameter in experimental_options.

Key-value processing overrides

Override the model and add custom instructions for key-value (form) region processing:
{
  "document_url": "https://example.com/form.pdf",
  "experimental_options": {
    "overrides": {
      "key_value": {
        "model": "google:gemini-2.5-flash-lite",
        "custom_instructions": "Pay special attention to date fields. Use MM/DD/YYYY format."
      }
    }
  }
}
FieldDescription
modelModel alias (fast, accurate) or provider:model format
custom_instructionsAdditional instructions appended to the default prompt

Resolution order

Model resolution:
  1. Request override - experimental_options.overrides.key_value.model
  2. Environment variable - LOCAL_KV_MODEL
  3. Code default - Based on deployment configuration
Prompt resolution:
  1. Base prompt - LOCAL_KV_PROMPT env var, or built-in default
  2. Custom instructions - Appended from overrides.key_value.custom_instructions

Environment variable defaults

VariableDescriptionDefault
LOCAL_KV_MODELOverride model for KV processingNone (uses built-in cascade)
LOCAL_KV_PROMPTBase prompt for KV processing (can be fully replaced)Built-in prompt

AI usage tracking

Reducto includes a comprehensive AI usage tracking system that monitors language model consumption throughout the document processing pipeline. This feature provides detailed insights into token usage, request counts, and model utilization for billing and optimization purposes.

How AI usage tracking works

The AI usage tracking system operates at the block level within the parsing pipeline:
  1. Token Counting: Each AI operation (table summarization, figure analysis, key-value extraction, etc.) records token consumption
  2. Request Tracking: The system counts API calls made to each model
  3. Model Identification: Usage is tracked per model type with provider information
  4. Aggregation: Usage is aggregated across all blocks and pages for comprehensive reporting

Available via /parse API

AI usage information is currently only available through the /parse API endpoint using the custom_format parameter. This feature is not available in other API endpoints.

Usage information structure

When enabled, the system returns an AIUsageInfo object containing:
{
  "did_use_ai_models": true,
  "ai_usage_info": [
    {
      "promptTokenCount": 1500,
      "completionTokenCount": 300,
      "cachedTokenCount": 0,
      "requestCount": 2,
      "modelType": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
      "modelProvider": "anthropic",
      "modelRateLimitFamily": "us.anthropic.claude-3-7-sonnet"
    }
  ]
}

Field descriptions

  • did_use_ai_models: Boolean indicating whether any AI models were used during processing
  • ai_usage_info: Array of usage information objects, one per model type used
  • promptTokenCount: Total input tokens sent to the model
  • completionTokenCount: Total output tokens generated by the model
  • cachedTokenCount: Total cached tokens used (when supported by provider)
  • requestCount: Number of API calls made to this model
  • modelType: Standardized model identifier
  • modelProvider: Provider name (e.g., “anthropic”, “openai”)
  • modelRateLimitFamily: Rate limiting group for the model

Enabling AI usage tracking

To retrieve AI usage information, set the custom_format parameter to "ai_usage" in your /parse request:
{
  "input": "your_document_url",
  "settings": {
    "custom_format": "ai_usage"
  }
}

Tracked AI operations

The system tracks usage from these AI-powered features:
  • Table Summarization: Analysis and description of complex tables
  • Figure Summarization: Analysis and description of images and charts
  • Key-Value enrichment: Enrichment for form-like regions within documents

Model name standardization

The system automatically standardizes model names for consistent reporting:
  • Internal model identifiers are mapped to standard formats
  • Provider information is automatically added
  • Rate limit families are identified for capacity planning

Possible model identifiers

The following model identifiers may appear in the modelType field of AI usage tracking responses, if you have OpenAI and Anthropic access enabled:

OpenAI models

  • gpt-4o-2024-08-06
  • gpt-4o-mini-2024-07-18

Anthropic models

  • claude-haiku-4-5-20251001
  • claude-3-7-sonnet-20250219
If you enable other model providers, they have their own prefixes which will appear.