> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Operations

> Runtime configuration for on-premise Reducto worker health and restart behavior

Runtime knobs operators use to keep on-prem Reducto worker pods healthy: liveness behavior, restart policy, and watchdog tuning.

Related references:

* [Observability → Pod stack trace dumps](/onprem/observability#pod-stack-trace-dumps-sigusr2) for live thread-stack diagnostics via `SIGUSR2`.
* [Database configuration](/onprem/database_configuration) for DB pool sizing and timeouts.
* [LLM options → Azure Vision](/onprem/llm_options#azure-vision-ocr) for OCR provider timeouts, retries, and failover.

## Worker liveness probe

The DB-queue worker pods (`reducto-worker`, `reducto-priority-worker`, `reducto-gpu-worker`) ship with a Kubernetes liveness probe that restarts a pod **only** when it is stuck mid-processing, not when it is idle waiting for work.

### How it works

1. An asyncio `WorkerWatchdog` task runs alongside the worker's job loops.
2. Every 5 seconds it writes a heartbeat file `/tmp/worker-state` containing `<unix_timestamp> <oldest_in_flight_task_age_seconds>`. Idle workers emit `-1` for the age.
3. The kubelet runs `bin/worker-liveness.sh` as an `exec` probe. The script fails (exit non-zero) and triggers a restart when **either**:
   * the heartbeat file itself is older than `WORKER_LIVENESS_WATCHDOG_STALE_SEC` (event loop is wedged so the watchdog can't tick); **or**
   * the oldest in-flight task age exceeds `WORKER_LIVENESS_STUCK_THRESHOLD_SEC` (a real job has hung beyond the threshold).
4. Idle workers always pass the probe. The watchdog ticks even with no work, and the age sentinel `-1` is always treated as healthy.

A file-based heartbeat is used rather than an in-process HTTP `/health` endpoint because an HTTP server can keep returning `200` while the asyncio event loop is blocked on a syscall. The watchdog has to be alive to refresh the file, so the probe directly tests the thing we care about.

### Helm configuration

Configure the probe via `worker.livenessProbe.*` in your Helm values:

```yaml theme={null}
worker:
  livenessProbe:
    enabled: true                # set to false to disable the probe entirely
    stuckTaskThresholdSec: 1800  # restart if any in-flight task runs longer than this
    watchdogStaleSec: 60         # restart if the watchdog heartbeat hasn't ticked in this long
    periodSeconds: 30            # how often kubelet runs the probe
    timeoutSeconds: 5            # exec probe timeout
    initialDelaySeconds: 120     # grace period after pod start before probing begins
    failureThreshold: 2          # consecutive probe failures before pod restart
```

The same Helm partial applies the probe to all three worker deployments, so a single block configures `reducto-worker`, `reducto-priority-worker`, and `reducto-gpu-worker` together.

### Defaults

| Knob                    | Default         | When to change                                                                                                                                                                                 |
| ----------------------- | --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `stuckTaskThresholdSec` | `1800` (30 min) | Raise if your workload includes legitimately long single-task work (large multi-thousand-page documents, long extraction prompts). Lower if you'd rather fail fast and rely on client retries. |
| `watchdogStaleSec`      | `60`            | Rarely needs tuning. Lower bounds how quickly an event-loop wedge is caught; should stay several times larger than the 5s tick interval to avoid false positives.                              |
| `periodSeconds`         | `30`            | Lower for faster detection at the cost of more probe overhead.                                                                                                                                 |
| `initialDelaySeconds`   | `120`           | Raise if your pods take longer to come up (large image pulls, slow init containers).                                                                                                           |
| `failureThreshold`      | `2`             | Raise to make restart decisions more conservative.                                                                                                                                             |

### Environment variables

The Helm chart pipes the values above into env vars that the Python watchdog and shell probe both read, so the two stay in sync. You normally configure these via Helm, but you can override directly when running outside the chart:

| Variable                              | Read by                  | Default             | Purpose                                                                                                                                                                    |
| ------------------------------------- | ------------------------ | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `WORKER_STUCK_TASK_THRESHOLD_SEC`     | Python watchdog          | `1800`              | Drives the per-task age comparison **and** the `logfire.warn` emitted when oldest age exceeds 80% of threshold (early breadcrumb for SRE before kubelet restarts the pod). |
| `WORKER_LIVENESS_STUCK_THRESHOLD_SEC` | `bin/worker-liveness.sh` | `1800`              | Per-task age limit used by the exec probe itself. Helm sets this from `worker.livenessProbe.stuckTaskThresholdSec`.                                                        |
| `WORKER_LIVENESS_WATCHDOG_STALE_SEC`  | `bin/worker-liveness.sh` | `60`                | Maximum heartbeat-file age before the probe fails. Helm sets this from `worker.livenessProbe.watchdogStaleSec`.                                                            |
| `WORKER_WATCHDOG_STATE_PATH`          | Both                     | `/tmp/worker-state` | Heartbeat file location. Almost never needs to change.                                                                                                                     |

### Disabling the probe

If you're operating in a constrained environment that can't run exec probes, or you'd rather rely on external orchestration, set:

```yaml theme={null}
worker:
  livenessProbe:
    enabled: false
```

The Python watchdog still runs and emits `logfire.warn` when tasks exceed 80% of the threshold. The probe just doesn't trigger restarts.

### Verifying it's installed

After deploy, confirm the probe is wired up:

```bash theme={null}
kubectl describe pod -n reducto -l app=reducto-worker | grep -A 4 Liveness
# Liveness: exec [bin/worker-liveness.sh] delay=120s timeout=5s period=30s ...
```

And confirm the watchdog file is being refreshed:

```bash theme={null}
kubectl exec -n reducto <pod-name> -- cat /tmp/worker-state
# 1778631873 -1        <- idle (age sentinel -1 is healthy)
# 1778631878 42        <- busy with a 42-second-old task
```
