Operations - Reducto

Runtime knobs operators use to keep on-prem Reducto worker pods healthy: liveness behavior, restart policy, and watchdog tuning. Related references:

Observability → Pod stack trace dumps for live thread-stack diagnostics via SIGUSR2.
Database configuration for DB pool sizing and timeouts.
LLM options → Azure Vision for OCR provider timeouts, retries, and failover.

Worker liveness probe

The DB-queue worker pods (reducto-worker, reducto-priority-worker, reducto-gpu-worker) ship with a Kubernetes liveness probe that restarts a pod only when it is stuck mid-processing, not when it is idle waiting for work.

How it works

An asyncio WorkerWatchdog task runs alongside the worker’s job loops.
Every 5 seconds it writes a heartbeat file /tmp/worker-state containing <unix_timestamp> <oldest_in_flight_task_age_seconds>. Idle workers emit -1 for the age.
The kubelet runs bin/worker-liveness.sh as an exec probe. The script fails (exit non-zero) and triggers a restart when either:
- the heartbeat file itself is older than WORKER_LIVENESS_WATCHDOG_STALE_SEC (event loop is wedged so the watchdog can’t tick); or
- the oldest in-flight task age exceeds WORKER_LIVENESS_STUCK_THRESHOLD_SEC (a real job has hung beyond the threshold).
Idle workers always pass the probe. The watchdog ticks even with no work, and the age sentinel -1 is always treated as healthy.

A file-based heartbeat is used rather than an in-process HTTP /health endpoint because an HTTP server can keep returning 200 while the asyncio event loop is blocked on a syscall. The watchdog has to be alive to refresh the file, so the probe directly tests the thing we care about.

Helm configuration

Configure the probe via worker.livenessProbe.* in your Helm values:

worker:
  livenessProbe:
    enabled: true                # set to false to disable the probe entirely
    stuckTaskThresholdSec: 1800  # restart if any in-flight task runs longer than this
    watchdogStaleSec: 60         # restart if the watchdog heartbeat hasn't ticked in this long
    periodSeconds: 30            # how often kubelet runs the probe
    timeoutSeconds: 5            # exec probe timeout
    initialDelaySeconds: 120     # grace period after pod start before probing begins
    failureThreshold: 2          # consecutive probe failures before pod restart

The same Helm partial applies the probe to all three worker deployments, so a single block configures reducto-worker, reducto-priority-worker, and reducto-gpu-worker together.

Defaults

Knob	Default	When to change
`stuckTaskThresholdSec`	`1800` (30 min)	Raise if your workload includes legitimately long single-task work (large multi-thousand-page documents, long extraction prompts). Lower if you’d rather fail fast and rely on client retries.
`watchdogStaleSec`	`60`	Rarely needs tuning. Lower bounds how quickly an event-loop wedge is caught; should stay several times larger than the 5s tick interval to avoid false positives.
`periodSeconds`	`30`	Lower for faster detection at the cost of more probe overhead.
`initialDelaySeconds`	`120`	Raise if your pods take longer to come up (large image pulls, slow init containers).
`failureThreshold`	`2`	Raise to make restart decisions more conservative.

Environment variables

The Helm chart pipes the values above into env vars that the Python watchdog and shell probe both read, so the two stay in sync. You normally configure these via Helm, but you can override directly when running outside the chart:

Variable	Read by	Default	Purpose
`WORKER_STUCK_TASK_THRESHOLD_SEC`	Python watchdog	`1800`	Drives the per-task age comparison and the `logfire.warn` emitted when oldest age exceeds 80% of threshold (early breadcrumb for SRE before kubelet restarts the pod).
`WORKER_LIVENESS_STUCK_THRESHOLD_SEC`	`bin/worker-liveness.sh`	`1800`	Per-task age limit used by the exec probe itself. Helm sets this from `worker.livenessProbe.stuckTaskThresholdSec`.
`WORKER_LIVENESS_WATCHDOG_STALE_SEC`	`bin/worker-liveness.sh`	`60`	Maximum heartbeat-file age before the probe fails. Helm sets this from `worker.livenessProbe.watchdogStaleSec`.
`WORKER_WATCHDOG_STATE_PATH`	Both	`/tmp/worker-state`	Heartbeat file location. Almost never needs to change.

Disabling the probe

If you’re operating in a constrained environment that can’t run exec probes, or you’d rather rely on external orchestration, set:

worker:
  livenessProbe:
    enabled: false

The Python watchdog still runs and emits logfire.warn when tasks exceed 80% of the threshold. The probe just doesn’t trigger restarts.

Verifying it’s installed

After deploy, confirm the probe is wired up:

kubectl describe pod -n reducto -l app=reducto-worker | grep -A 4 Liveness
# Liveness: exec [bin/worker-liveness.sh] delay=120s timeout=5s period=30s ...

And confirm the watchdog file is being refreshed:

kubectl exec -n reducto <pod-name> -- cat /tmp/worker-state
# 1778631873 -1        <- idle (age sentinel -1 is healthy)
# 1778631878 42        <- busy with a 42-second-old task

​Worker liveness probe

​How it works

​Helm configuration

​Defaults

​Environment variables

​Disabling the probe

​Verifying it’s installed