Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt

Use this file to discover all available pages before exploring further.

Runtime knobs operators use to keep on-prem Reducto worker pods healthy: liveness behavior, restart policy, and watchdog tuning. Related references:

Worker liveness probe

The DB-queue worker pods (reducto-worker, reducto-priority-worker, reducto-gpu-worker) ship with a Kubernetes liveness probe that restarts a pod only when it is stuck mid-processing, not when it is idle waiting for work.

How it works

  1. An asyncio WorkerWatchdog task runs alongside the worker’s job loops.
  2. Every 5 seconds it writes a heartbeat file /tmp/worker-state containing <unix_timestamp> <oldest_in_flight_task_age_seconds>. Idle workers emit -1 for the age.
  3. The kubelet runs bin/worker-liveness.sh as an exec probe. The script fails (exit non-zero) and triggers a restart when either:
    • the heartbeat file itself is older than WORKER_LIVENESS_WATCHDOG_STALE_SEC (event loop is wedged so the watchdog can’t tick); or
    • the oldest in-flight task age exceeds WORKER_LIVENESS_STUCK_THRESHOLD_SEC (a real job has hung beyond the threshold).
  4. Idle workers always pass the probe. The watchdog ticks even with no work, and the age sentinel -1 is always treated as healthy.
A file-based heartbeat is used rather than an in-process HTTP /health endpoint because an HTTP server can keep returning 200 while the asyncio event loop is blocked on a syscall. The watchdog has to be alive to refresh the file, so the probe directly tests the thing we care about.

Helm configuration

Configure the probe via worker.livenessProbe.* in your Helm values:
worker:
  livenessProbe:
    enabled: true                # set to false to disable the probe entirely
    stuckTaskThresholdSec: 1800  # restart if any in-flight task runs longer than this
    watchdogStaleSec: 60         # restart if the watchdog heartbeat hasn't ticked in this long
    periodSeconds: 30            # how often kubelet runs the probe
    timeoutSeconds: 5            # exec probe timeout
    initialDelaySeconds: 120     # grace period after pod start before probing begins
    failureThreshold: 2          # consecutive probe failures before pod restart
The same Helm partial applies the probe to all three worker deployments, so a single block configures reducto-worker, reducto-priority-worker, and reducto-gpu-worker together.

Defaults

KnobDefaultWhen to change
stuckTaskThresholdSec1800 (30 min)Raise if your workload includes legitimately long single-task work (large multi-thousand-page documents, long extraction prompts). Lower if you’d rather fail fast and rely on client retries.
watchdogStaleSec60Rarely needs tuning. Lower bounds how quickly an event-loop wedge is caught; should stay several times larger than the 5s tick interval to avoid false positives.
periodSeconds30Lower for faster detection at the cost of more probe overhead.
initialDelaySeconds120Raise if your pods take longer to come up (large image pulls, slow init containers).
failureThreshold2Raise to make restart decisions more conservative.

Environment variables

The Helm chart pipes the values above into env vars that the Python watchdog and shell probe both read, so the two stay in sync. You normally configure these via Helm, but you can override directly when running outside the chart:
VariableRead byDefaultPurpose
WORKER_STUCK_TASK_THRESHOLD_SECPython watchdog1800Drives the per-task age comparison and the logfire.warn emitted when oldest age exceeds 80% of threshold (early breadcrumb for SRE before kubelet restarts the pod).
WORKER_LIVENESS_STUCK_THRESHOLD_SECbin/worker-liveness.sh1800Per-task age limit used by the exec probe itself. Helm sets this from worker.livenessProbe.stuckTaskThresholdSec.
WORKER_LIVENESS_WATCHDOG_STALE_SECbin/worker-liveness.sh60Maximum heartbeat-file age before the probe fails. Helm sets this from worker.livenessProbe.watchdogStaleSec.
WORKER_WATCHDOG_STATE_PATHBoth/tmp/worker-stateHeartbeat file location. Almost never needs to change.

Disabling the probe

If you’re operating in a constrained environment that can’t run exec probes, or you’d rather rely on external orchestration, set:
worker:
  livenessProbe:
    enabled: false
The Python watchdog still runs and emits logfire.warn when tasks exceed 80% of the threshold. The probe just doesn’t trigger restarts.

Verifying it’s installed

After deploy, confirm the probe is wired up:
kubectl describe pod -n reducto -l app=reducto-worker | grep -A 4 Liveness
# Liveness: exec [bin/worker-liveness.sh] delay=120s timeout=5s period=30s ...
And confirm the watchdog file is being refreshed:
kubectl exec -n reducto <pod-name> -- cat /tmp/worker-state
# 1778631873 -1        <- idle (age sentinel -1 is healthy)
# 1778631878 42        <- busy with a 42-second-old task