> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reducto.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Observability & Monitoring

> Built-in observability stack for on-premise Reducto deployments

## Overview

On-premise Reducto deployments include a built-in observability stack called **ClickStack**, which provides:

* **HyperDX**: Unified observability UI for logs, traces, and metrics
* **ClickHouse**: High-performance analytics database for telemetry storage
* **OTEL Collector**: OpenTelemetry collector for ingesting and routing telemetry data

ClickStack is enabled by setting `clickstack.enabled: true` in your Helm values. Everything else is automatic. No additional setup required.

Telemetry is part of your on-premise security boundary. Reducto emits logs to stdout and can route traces, metrics, and logs through OpenTelemetry, but you control where telemetry is stored, who can access it, and how long it is retained. See the [on-prem security model](/onprem/security_model) for the shared responsibility model.

Reducto telemetry is designed for operational metadata. It should not contain document content, OCR text, extracted values, prompts, model outputs, API tokens, access keys, secrets, or other customer content. File names and URLs can reveal customer identity and should be redacted or avoided before telemetry leaves the deployment.

## Accessing HyperDX

### Default Credentials

When ClickStack is enabled, a seed admin user is automatically created on first install with default credentials. Contact the Reducto team for the default login details, or configure your own credentials in your Helm values (see [Configuration](#configuration) below).

<Warning>
  Change the default password immediately after first login.
</Warning>

### Accessing the UI

HyperDX can be exposed via:

* **Ingress**: Set `clickstack.hyperdx.ingress.enabled: true` with your domain
* **Tailscale**: Set `clickstack.hyperdx.exposure.tailscale.enabled: true` for private access
* **Cloudflare Tunnel**: Set `clickstack.hyperdx.exposure.cloudflareTunnel.enabled: true`
* **Port-forward** (for testing): `kubectl port-forward svc/<release>-clickstack-app 3000:3000`

## Prometheus Scraping

Prometheus endpoints are intended for internal scraping by monitoring systems. Do not expose `/metrics`, `/prometheus`, ClickStack, HyperDX, ClickHouse, or the OTEL collector to the public internet.

To collect metrics from services that expose Prometheus endpoints (like NGINX ingress controllers), enable the Prometheus receiver on the OTEL collector together with the Target Allocator (TA) subchart. TA shards scrape targets across collector replicas via consistent-hashing so each target is scraped exactly once. See [Scaling the OTEL Collector](#scaling-the-otel-collector) for the multi-replica rationale.

```yaml theme={null}
prometheusScrape:
  enabled: true

targetAllocator:
  enabled: true
  targetAllocator:
    config:
      collector_selector:
        matchlabels:
          app.kubernetes.io/instance: <release-name>
          app.kubernetes.io/name: otel-collector
      config:
        scrape_configs:
          - job_name: nginx-ingress
            scrape_interval: 30s
            static_configs:
              - targets:
                  - ingress-nginx-controller-metrics.ingress-nginx.svc.cluster.local:10254
          - job_name: kube-state-metrics
            scrape_interval: 30s
            static_configs:
              - targets:
                  - prometheus-stack-kube-state-metrics.monitoring.svc.cluster.local:8080
```

`scrape_configs` follows the standard Prometheus scrape config schema — any `static_configs`, `kubernetes_sd_configs`, `relabel_configs`, etc. are supported.

<Note>
  When `targetAllocator.enabled: true`, the OTEL collector pod must expose `POD_NAME` via the downward API so each replica gets a distinct `collector_id` for consistent-hashing. Render fails fast otherwise.

  ```yaml theme={null}
  otelCollector:
    extraEnvs:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
  ```
</Note>

Scraped metrics route to ClickHouse by default. See [Routing](#routing) to fan out to other sinks.

## Scaling the OTEL Collector

The OTEL collector deploys as a Deployment with `otelCollector.replicaCount` replicas (HPA optional). Two receivers need explicit coordination once `replicaCount > 1`, otherwise every replica emits the same data and ClickHouse storage grows linearly with the replica count.

| Receiver                     | Coordination mechanism                                          | Values key                         |
| ---------------------------- | --------------------------------------------------------------- | ---------------------------------- |
| `prometheus` (scrape)        | Target Allocator subchart shards targets via consistent-hashing | `targetAllocator.enabled`          |
| `k8s_cluster` + `k8sobjects` | `k8s_leader_elector` extension — only the lease-holder emits    | `k8sMetrics.leaderElector.enabled` |

### Leader Elector for Kubernetes Metrics & Events

`k8s_cluster` (cluster-level pod/node/container metrics) and `k8sobjects` (K8s events watch) are cluster-singletons — every replica running them independently produces an identical stream. Enable the leader-elector extension so only the lease-holder emits:

```yaml theme={null}
k8sMetrics:
  enabled: true
  leaderElector:
    enabled: true
```

The chart renders a namespace-scoped `Role` + `RoleBinding` for the `coordination.k8s.io` lease, gated on `leaderElector.enabled`. Lease hand-off on pod rollover is automatic.

Both `targetAllocator.enabled` and `k8sMetrics.leaderElector.enabled` default to `false` — single-replica deployments work out of the box with no extra RBAC or subchart.

## Configuration

### Seed User

Configure the admin user credentials in your Helm values:

```yaml theme={null}
clickstack:
  hyperdx:
    seedUser:
      email: "admin@yourcompany.com"
      password: "your-secure-password"
      teamName: "Your Team"
```

For production deployments, use a Kubernetes secret instead of a plaintext password:

```yaml theme={null}
clickstack:
  hyperdx:
    seedUser:
      email: "admin@yourcompany.com"
      existingSecret: "my-hyperdx-secret"
      secretKey: "HYPERDX_ADMIN_PASSWORD"
      teamName: "Your Team"
```

### ClickHouse Storage

```yaml theme={null}
clickstack:
  clickhouse:
    persistence:
      dataSize: 50Gi   # Adjust based on expected telemetry volume
      logSize: 10Gi
```

### Data Retention

Telemetry data retention is controlled by the OTEL exporter TTL:

```yaml theme={null}
otelConfig:
  exporters:
    clickhouse:
      ttl: 360h  # 15 days (default: 72h)
```

## Telemetry Pipeline

The OTEL collector receives telemetry from multiple sources and routes it to configured sinks:

| Source                     | What it collects                                                                                             | Default sink      |
| -------------------------- | ------------------------------------------------------------------------------------------------------------ | ----------------- |
| Application traces/metrics | OTLP from Reducto services                                                                                   | All enabled sinks |
| Kubernetes metrics         | Cluster-level pod, node, container metrics via k8s\_cluster receiver; node-level kubelet stats via DaemonSet | All enabled sinks |
| Kubernetes events          | K8s events via k8sobjects receiver                                                                           | All enabled sinks |
| Prometheus scrape          | Metrics from any Prometheus endpoint                                                                         | ClickHouse        |
| Application logs           | OTLP logs from Reducto services                                                                              | ClickHouse        |

### Routing

Each source can be independently routed to any combination of sinks:

```yaml theme={null}
otelConfig:
  routing:
    traces: [tinybird, datadog, clickhouse, iceberg]         # default
    metrics: [tinybird, datadog, clickhouse, iceberg]        # default
    k8sMetrics: [tinybird, datadog, clickhouse, iceberg]     # default
    k8sEvents: [tinybird, datadog, clickhouse, iceberg]      # default
    prometheusScrape: [clickhouse]                           # default
    logs: [clickhouse]                                       # default
```

Available sinks: `clickhouse`, `tinybird`, `datadog`, `iceberg`. Each sink must also be enabled in `otelConfig.exporters`. The defaults list all sinks, but only sinks that are both listed **and** enabled will actually receive data, so the defaults are safe for any exporter combination.

## Pod stack trace dumps (SIGUSR2)

Every Reducto worker and HTTP pod installs a `SIGUSR2` handler that dumps a per-thread stack trace to `stderr` when signalled. Use this when a pod is unresponsive (stuck event loop, hung downstream call, contended thread pool) and `kubectl logs` alone doesn't explain why.

Coverage:

| Pod                                                                          | Process labelled as                                                          |
| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------- |
| `reducto-http` (gunicorn)                                                    | `http-worker-<pid>`                                                          |
| `reducto-streaq-worker`                                                      | `streaq-<worker-name>-worker` (e.g. `streaq-io-worker`, `streaq-cpu-worker`) |
| `reducto-worker`, `reducto-priority-worker`, `reducto-gpu-worker` (DB-queue) | `k8s-worker` (or the value of `LOGFIRE_SERVICE_NAME` when set)               |

The handler is always installed, with no Helm flag to disable it. Output is written directly to `stderr` (not through the structured logger), so the trace appears even if the application logging pipeline is itself wedged.

### Triggering a dump

```bash theme={null}
# Pick a pod that's misbehaving
kubectl get pods -n reducto -l app=reducto-worker

# Find the worker PID (gunicorn / python process)
kubectl exec -n reducto <pod-name> -- ps -eo pid,cmd | grep -E 'gunicorn|streaq|python'

# Send SIGUSR2 to that PID
kubectl exec -n reducto <pod-name> -- kill -s USR2 <pid>

# Read the dump from the pod log
kubectl logs -n reducto <pod-name> --tail=500
```

You'll see a single-line banner followed by one frame block per thread:

```
USR2 signal received [http-worker-42]; dumping thread stacks
USR2 triggered stack trace:

Thread "MainThread" (most recent call first):
  File "/app/.venv/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  ...

Thread "asyncio-loop-0" (most recent call first):
  ...
```

### Notes

* Forked gunicorn and streaq child processes each register their own handler, so signalling the main PID alone won't dump child stacks. Signal each child PID individually if you need full process-tree coverage.
* `SIGUSR2` is not used by any other component in the worker/HTTP processes, so triggering a dump is safe in production. The signal handler is async-safe and only enqueues work onto a dedicated daemon thread.
* For wider diagnostics (CPU profile, off-CPU sampling), consider `py-spy dump --pid <pid>` from a debug container. `SIGUSR2` is the lowest-friction option and works without an extra binary.

## Telemetry controls

Reducto telemetry is designed to avoid customer content. Logs, traces, and metrics are for debugging and performance analysis, not for storing customer data or business records.

Recommended controls:

* Keep observability UIs and scrape endpoints on private networks.
* Use SSO, VPN, Zero Trust access, or equivalent controls for operator access.
* Send telemetry only to approved sinks.
* Set retention periods that match your security and compliance requirements.
* Review telemetry exports for customer content, file names, URLs, prompts, model outputs, and secrets before sharing them outside your organization.
