Runtime knobs operators use to keep on-prem Reducto worker pods healthy: liveness behavior, restart policy, and watchdog tuning. Related references:Documentation Index
Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Observability → Pod stack trace dumps for live thread-stack diagnostics via
SIGUSR2. - Database configuration for DB pool sizing and timeouts.
- LLM options → Azure Vision for OCR provider timeouts, retries, and failover.
Worker liveness probe
The DB-queue worker pods (reducto-worker, reducto-priority-worker, reducto-gpu-worker) ship with a Kubernetes liveness probe that restarts a pod only when it is stuck mid-processing, not when it is idle waiting for work.
How it works
- An asyncio
WorkerWatchdogtask runs alongside the worker’s job loops. - Every 5 seconds it writes a heartbeat file
/tmp/worker-statecontaining<unix_timestamp> <oldest_in_flight_task_age_seconds>. Idle workers emit-1for the age. - The kubelet runs
bin/worker-liveness.shas anexecprobe. The script fails (exit non-zero) and triggers a restart when either:- the heartbeat file itself is older than
WORKER_LIVENESS_WATCHDOG_STALE_SEC(event loop is wedged so the watchdog can’t tick); or - the oldest in-flight task age exceeds
WORKER_LIVENESS_STUCK_THRESHOLD_SEC(a real job has hung beyond the threshold).
- the heartbeat file itself is older than
- Idle workers always pass the probe. The watchdog ticks even with no work, and the age sentinel
-1is always treated as healthy.
/health endpoint because an HTTP server can keep returning 200 while the asyncio event loop is blocked on a syscall. The watchdog has to be alive to refresh the file, so the probe directly tests the thing we care about.
Helm configuration
Configure the probe viaworker.livenessProbe.* in your Helm values:
reducto-worker, reducto-priority-worker, and reducto-gpu-worker together.
Defaults
| Knob | Default | When to change |
|---|---|---|
stuckTaskThresholdSec | 1800 (30 min) | Raise if your workload includes legitimately long single-task work (large multi-thousand-page documents, long extraction prompts). Lower if you’d rather fail fast and rely on client retries. |
watchdogStaleSec | 60 | Rarely needs tuning. Lower bounds how quickly an event-loop wedge is caught; should stay several times larger than the 5s tick interval to avoid false positives. |
periodSeconds | 30 | Lower for faster detection at the cost of more probe overhead. |
initialDelaySeconds | 120 | Raise if your pods take longer to come up (large image pulls, slow init containers). |
failureThreshold | 2 | Raise to make restart decisions more conservative. |
Environment variables
The Helm chart pipes the values above into env vars that the Python watchdog and shell probe both read, so the two stay in sync. You normally configure these via Helm, but you can override directly when running outside the chart:| Variable | Read by | Default | Purpose |
|---|---|---|---|
WORKER_STUCK_TASK_THRESHOLD_SEC | Python watchdog | 1800 | Drives the per-task age comparison and the logfire.warn emitted when oldest age exceeds 80% of threshold (early breadcrumb for SRE before kubelet restarts the pod). |
WORKER_LIVENESS_STUCK_THRESHOLD_SEC | bin/worker-liveness.sh | 1800 | Per-task age limit used by the exec probe itself. Helm sets this from worker.livenessProbe.stuckTaskThresholdSec. |
WORKER_LIVENESS_WATCHDOG_STALE_SEC | bin/worker-liveness.sh | 60 | Maximum heartbeat-file age before the probe fails. Helm sets this from worker.livenessProbe.watchdogStaleSec. |
WORKER_WATCHDOG_STATE_PATH | Both | /tmp/worker-state | Heartbeat file location. Almost never needs to change. |
Disabling the probe
If you’re operating in a constrained environment that can’t run exec probes, or you’d rather rely on external orchestration, set:logfire.warn when tasks exceed 80% of the threshold. The probe just doesn’t trigger restarts.