> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM & service configuration options

> Complete guide to LLM configuration and environment variables for Reducto

## OCR service configuration

For OCR functionality, Reducto uses AWS Textract with automatic region selection and load balancing across multiple regions.

### Textract region configuration

| Variable           | Description                                                                  | Required | Default                                                              |
| ------------------ | ---------------------------------------------------------------------------- | -------- | -------------------------------------------------------------------- |
| `TEXTRACT_REGIONS` | Comma-separated list of AWS regions and their quotas for Textract operations | No       | `us-east-2:100,us-east-1:100,us-west-2:100,ap-south-1:5,eu-west-1:5` |

#### Format

The `TEXTRACT_REGIONS` environment variable accepts a comma-separated list of `region:quota` pairs:

```
TEXTRACT_REGIONS=us-gov-west-1:10,us-gov-east-1:5,eu-central-1:15
```

**Format Rules:**

* **With quota:** `region:quota` (e.g., `us-east-1:100`)
* **Without quota:** `region` (defaults to quota of 1)
* **Mixed:** `us-east-1:50,us-west-2,eu-west-1:25`

#### Use cases

This configuration is particularly useful for:

* **Government Cloud deployments** requiring specific regions (e.g., `us-gov-west-1`, `us-gov-east-1`)
* **Data residency requirements** restricting processing to certain geographic regions
* **Custom quota management** when you have different service limits across regions
* **Network restrictions** where only certain regions are accessible from your deployment

#### Examples

**Government Cloud:**

```bash theme={null}
TEXTRACT_REGIONS=us-gov-west-1:10,us-gov-east-1:10
```

**Europe-only processing:**

```bash theme={null}
TEXTRACT_REGIONS=eu-west-1:20,eu-central-1:15,eu-north-1:10
```

**Single region:**

```bash theme={null}
TEXTRACT_REGIONS=us-west-2:50
```

When this variable is set, Reducto will use weighted random selection based on the quotas to distribute Textract requests across the specified regions, providing both load balancing and compliance with regional requirements. With automatic quota handling.

### Azure Vision OCR

For OCR functionality using Azure Computer Vision service.

#### Single endpoint

| Variable                | Description                        | Required | Default |
| ----------------------- | ---------------------------------- | -------- | ------- |
| `AZURE_VISION_ENDPOINT` | Azure Computer Vision endpoint URL | Yes      | None    |
| `AZURE_VISION_KEY`      | Azure Computer Vision API key      | Yes      | None    |

#### Multiple endpoints (failover and load balancing)

For setups with multiple endpoints, use `AZURE_VISION_ARRAY`. If any endpoint returns a transient error (rate limits, connection failures, server errors), Reducto automatically tries the remaining endpoints before failing. Non-retryable errors (e.g. `400`, `401`, `403`, `404`) on the primary surface immediately — they will reproduce on every endpoint, so cross-region retries would only add latency.

| Variable                      | Description                                          | Required | Default        |
| ----------------------------- | ---------------------------------------------------- | -------- | -------------- |
| `AZURE_VISION_ARRAY`          | JSON array of Azure Vision endpoint objects          | Yes      | None           |
| `AZURE_VISION_ARRAY_STRATEGY` | `load_balance`, `priority`, or `primary_with_fanout` | No       | `load_balance` |

##### Strategies

| Strategy                   | Behavior                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `load_balance` *(default)* | Pick a random endpoint per request; on transient error, retry that endpoint up to 3 times with exponential backoff (60-second per-endpoint wall-clock cap), then fail over to the next endpoint and repeat. Use when all endpoints have similar capacity and you want to spread steady-state load.                                                                                                                                                                                                                  |
| `priority`                 | Try endpoints in the order listed; on transient error, retry that endpoint up to 3 times with exponential backoff (60-second per-endpoint wall-clock cap), then fail over to the next and repeat. Use when you have a preferred region and want others used only as fallbacks.                                                                                                                                                                                                                                      |
| `primary_with_fanout`      | Try the primary (index 0) once; on a `408`/`429`/`5xx`/network error, immediately probe each secondary in order with a single attempt each. If every endpoint returns a transient error, return to the primary and retry with exponential backoff (3 tries, 60-second cap). Use when endpoints have **unequal capacity** and you want to keep secondaries from absorbing sustained overflow — the higher-capacity primary recovers first under backoff while secondaries see only one probe per overloaded request. |

Retry classification (all strategies):

| HTTP status                  | Retryable? | Behavior on primary              |
| ---------------------------- | ---------- | -------------------------------- |
| `408` Request Timeout        | Yes        | retry / fail over                |
| `429` Too Many Requests      | Yes        | retry / fail over                |
| `5xx` (500–599)              | Yes        | retry / fail over                |
| Network / connection errors  | Yes        | retry / fail over                |
| `4xx` other than `408`/`429` | No         | surface immediately, no failover |

Under `primary_with_fanout` only, a non-retryable error on a *secondary* (e.g. a misconfigured key on one secondary) is logged and skipped so a single bad secondary does not poison a request the primary could otherwise serve once it recovers.

```bash theme={null}
# Load balance across regions (default — equal-capacity endpoints)
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-uscentral.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-useast.cognitiveservices.azure.com/", "key": "<key-2>"},
  {"endpoint": "https://my-vision-uswest.cognitiveservices.azure.com/", "key": "<key-3>"}
]'

# Priority order: primary first, secondaries only on transient failure
AZURE_VISION_ARRAY_STRATEGY=priority
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-uscentral.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-useast.cognitiveservices.azure.com/", "key": "<key-2>"}
]'

# Primary with fanout: high-capacity primary, low-quota secondaries only absorb spillover
AZURE_VISION_ARRAY_STRATEGY=primary_with_fanout
AZURE_VISION_ARRAY='[
  {"endpoint": "https://my-vision-primary.cognitiveservices.azure.com/", "key": "<key-1>"},
  {"endpoint": "https://my-vision-overflow-1.cognitiveservices.azure.com/", "key": "<key-2>"},
  {"endpoint": "https://my-vision-overflow-2.cognitiveservices.azure.com/", "key": "<key-3>"}
]'
```

<Note>
  When `AZURE_VISION_ARRAY` is set, `AZURE_VISION_ENDPOINT` and `AZURE_VISION_KEY` are ignored.
</Note>

## LLM provider environment variables

Reducto supports multiple LLM providers through environment variables. Below is a complete list of supported providers and their required environment variables.

### LiteLLM proxy

| Variable                       | Description                              | Required |
| ------------------------------ | ---------------------------------------- | -------- |
| `LITELLM_PROXY_URL`            | URL of the LiteLLM Proxy                 | Yes      |
| `LITELLM_PROXY_FAST_MODEL`     | Fast model to route to via the proxy     | Yes      |
| `LITELLM_PROXY_ACCURATE_MODEL` | Accurate model to route to via the proxy | Yes      |

### OpenAI

| Variable         | Description         | Required |
| ---------------- | ------------------- | -------- |
| `OPENAI_API_KEY` | Your OpenAI API key | Yes      |

### Azure OpenAI

| Variable                 | Description                                                                                                                                                                                                                                                                                                                                                                                                                        | Required |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| `AZURE_OPENAI_API_KEY`   | Your Azure OpenAI API key                                                                                                                                                                                                                                                                                                                                                                                                          | Yes      |
| `AZURE_OPENAI_ENDPOINT`  | Azure OpenAI endpoint (e.g., `https://your-resource-name.openai.azure.com/`)                                                                                                                                                                                                                                                                                                                                                       | Yes      |
| `OPENAI_API_VERSION`     | Azure OpenAI API version (e.g., `2024-10-21`)                                                                                                                                                                                                                                                                                                                                                                                      | Yes      |
| `AZURE_OPENAI_MODEL_MAP` | Comma-separated map (or single default deployment) used to translate model names to Azure deployment names. Reducto uses the following models and each should resolve to a deployment unless a single default is supplied: `gpt-4o-2024-08-06`, `gpt-4o`, `gpt-4o-mini-2024-07-18`, `gpt-4o-mini`, `gpt-4.1`, `o1`. Example mappings: `my-default-deployment` (single default) or `gpt-4o=my-prod-dep, gpt-4o-mini=gpt4o-mini-dep` | Yes      |

### Anthropic

| Variable            | Description            | Required |
| ------------------- | ---------------------- | -------- |
| `ANTHROPIC_API_KEY` | Your Anthropic API key | Yes      |

### Google

| Variable                         | Description                                                                                                                     | Required |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------- | -------- |
| `GOOGLE_APPLICATION_CREDENTIALS` | Service account key json with `roles/aiplatform.user` role for Vertex AI                                                        | Yes      |
| `GCP_PROJECT_ID`                 | GCP project for Cloud Vision API                                                                                                | Yes      |
| `GCP_REGION`                     | Region for Vertex AI, defaults to `us-central1`                                                                                 | No       |
| `GCP_API_KEY`                    | [API key](https://console.cloud.google.com/apis/credentials) with no Application or API restrictions to access Cloud Vision API | Yes      |

### Gemini

| Variable         | Description         | Required |
| ---------------- | ------------------- | -------- |
| `GEMINI_API_KEY` | Your Gemini API key | Yes      |

### AWS Bedrock

| Variable                | Description                                       | Required                |
| ----------------------- | ------------------------------------------------- | ----------------------- |
| `USE_CLAUDE_BEDROCK`    | Set to any value to enable Claude via AWS Bedrock | Yes                     |
| `AWS_ACCESS_KEY_ID`     | AWS access key ID                                 | Yes, when using Bedrock |
| `AWS_SECRET_ACCESS_KEY` | AWS secret access key                             | Yes, when using Bedrock |
| `AWS_REGION`            | AWS region name                                   | Yes, when using Bedrock |

## GPU-based extraction models

Reducto offers GPU-based models for structured data extraction and fine-grained citations. For best results, deploy both models together. Model weights are downloaded from HuggingFace using a scoped token provided by Reducto.

### Prerequisites

Create a Kubernetes secret with the HuggingFace token provided by Reducto:

```bash theme={null}
kubectl create secret generic reducto-hf-token --from-literal=HF_TOKEN=hf_...
```

We recommend enabling `modelStorage` to cache weights on a PVC so restarts don't re-download.

### YAML extraction model (30B)

**GPU requirement:** 1x NVIDIA H200 (will not fit on H100/A100/A10G).

```yaml theme={null}
yamlExtract:
  enabled: true
  gpu: "H200"
  modelStorage:
    enabled: true
    size: "100Gi"
    storageClassName: "your-storage-class"
```

When enabled, `REDUCTO_YAML_EXTRACT_URL` is automatically injected into all worker and HTTP pods.

### Citation model (7B)

**GPU requirement:** 1x NVIDIA H100 or H200.

```yaml theme={null}
citationModel:
  enabled: true
  gpu: "H200"  # or "H100"
  modelStorage:
    enabled: true
    size: "50Gi"
    storageClassName: "your-storage-class"
```

When enabled, `REDUCTO_CITATION_URL` is automatically injected into all worker and HTTP pods. If not deployed, citations fall back to your configured external LLM provider.

### Model path overrides

Both deployments expose a `modelPath` field that can be updated if Reducto ships new model weights:

```yaml theme={null}
yamlExtract:
  modelPath: "reducto/extract_30b_0108"  # update when directed by Reducto

citationModel:
  modelPath: "reducto/citation_7b_mimo_0812"  # update when directed by Reducto
```

### Extraction without GPU models

If you do not deploy either GPU model, extraction uses your configured external LLM provider (OpenAI, Anthropic, Google, Azure, or Bedrock). No additional configuration is needed.

### Fine-tuned OpenAI extraction model (alternative)

| Variable                        | Description                                                           | Required |
| ------------------------------- | --------------------------------------------------------------------- | -------- |
| `LOCAL_EXTRACT_CITATIONS_MODEL` | Fine-tuned OpenAI model ID (e.g., `openai:ft:gpt-4.1-2025-04-14:...`) | No       |

When set, this takes priority over the self-hosted extraction model.

## Request-level LLM overrides

In addition to environment variables, on-prem deployments can override LLM configuration at the request level using the `overrides` parameter in `experimental_options`.

### Key-value processing overrides

Override the model and add custom instructions for key-value (form) region processing:

```json theme={null}
{
  "document_url": "https://example.com/form.pdf",
  "experimental_options": {
    "overrides": {
      "key_value": {
        "model": "google:gemini-2.5-flash-lite",
        "custom_instructions": "Pay special attention to date fields. Use MM/DD/YYYY format."
      }
    }
  }
}
```

| Field                 | Description                                                 |
| --------------------- | ----------------------------------------------------------- |
| `model`               | Model alias (`fast`, `accurate`) or `provider:model` format |
| `custom_instructions` | Additional instructions appended to the default prompt      |

### Resolution order

**Model resolution:**

1. **Request override** - `experimental_options.overrides.key_value.model`
2. **Environment variable** - `LOCAL_KV_MODEL`
3. **Code default** - Based on deployment configuration

**Prompt resolution:**

1. **Base prompt** - `LOCAL_KV_PROMPT` env var, or built-in default
2. **Custom instructions** - Appended from `overrides.key_value.custom_instructions`

### Environment variable defaults

| Variable          | Description                                           | Default                      |
| ----------------- | ----------------------------------------------------- | ---------------------------- |
| `LOCAL_KV_MODEL`  | Override model for KV processing                      | None (uses built-in cascade) |
| `LOCAL_KV_PROMPT` | Base prompt for KV processing (can be fully replaced) | Built-in prompt              |

## AI usage tracking

Reducto includes a comprehensive AI usage tracking system that monitors language model consumption throughout the document processing pipeline. This feature provides detailed insights into token usage, request counts, and model utilization for billing and optimization purposes.

### How AI usage tracking works

The AI usage tracking system operates at the block level within the parsing pipeline:

1. **Token Counting**: Each AI operation (table summarization, figure analysis, key-value extraction, etc.) records token consumption
2. **Request Tracking**: The system counts API calls made to each model
3. **Model Identification**: Usage is tracked per model type with provider information
4. **Aggregation**: Usage is aggregated across all blocks and pages for comprehensive reporting

### Available via /parse API

AI usage information is **currently only available through the `/parse` API endpoint** using the `custom_format` parameter. This feature is not available in other API endpoints.

### Usage information structure

When enabled, the system returns an `AIUsageInfo` object containing:

```json theme={null}
{
  "did_use_ai_models": true,
  "ai_usage_info": [
    {
      "promptTokenCount": 1500,
      "completionTokenCount": 300,
      "cachedTokenCount": 0,
      "requestCount": 2,
      "modelType": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
      "modelProvider": "anthropic",
      "modelRateLimitFamily": "us.anthropic.claude-3-7-sonnet"
    }
  ]
}
```

### Field descriptions

* **`did_use_ai_models`**: Boolean indicating whether any AI models were used during processing
* **`ai_usage_info`**: Array of usage information objects, one per model type used
* **`promptTokenCount`**: Total input tokens sent to the model
* **`completionTokenCount`**: Total output tokens generated by the model
* **`cachedTokenCount`**: Total cached tokens used (when supported by provider)
* **`requestCount`**: Number of API calls made to this model
* **`modelType`**: Standardized model identifier
* **`modelProvider`**: Provider name (e.g., "anthropic", "openai")
* **`modelRateLimitFamily`**: Rate limiting group for the model

### Enabling AI usage tracking

To retrieve AI usage information, set the `custom_format` parameter to `"ai_usage"` in your `/parse` request:

```json theme={null}
{
  "input": "your_document_url",
  "settings": {
    "custom_format": "ai_usage"
  }
}
```

### Tracked AI operations

The system tracks usage from these AI-powered features:

* **Table Summarization**: Analysis and description of complex tables
* **Figure Summarization**: Analysis and description of images and charts
* **Key-Value enrichment**: Enrichment for form-like regions within documents

### Model name standardization

The system automatically standardizes model names for consistent reporting:

* Internal model identifiers are mapped to standard formats
* Provider information is automatically added
* Rate limit families are identified for capacity planning

### Possible model identifiers

The following model identifiers may appear in the `modelType` field of AI usage tracking responses, if you have OpenAI and Anthropic access enabled:

#### OpenAI models

* `gpt-4o-2024-08-06`
* `gpt-4o-mini-2024-07-18`

#### Anthropic models

* `claude-haiku-4-5-20251001`
* `claude-3-7-sonnet-20250219`

If you enable other model providers, they have their own prefixes which will appear.
