Observability Stack for LLM: What to Track and Why

LLM systems fail in new ways: retrieval drift, prompt injection, silent quality regressions, and runaway cost. Observability is how you turn “AI magic” into an engineered system with SLOs and auditability.

The four signals: metrics, logs, traces, eval telemetry

  • Metrics: fast aggregates for latency, throughput, errors, cost.
  • Logs: structured events for audit trails and incident response.
  • Traces: distributed view across tools, retrieval, model, and external calls.
  • Eval telemetry: quality signals (groundedness proxies, user feedback, regressions).

What to measure (minimum viable set)

Domain Metrics Why it matters
Latency TTFT, TBT, p95 end-to-end UX and SLO compliance.
Reliability error rate, timeouts, retries Silent failures look like “bad answers”.
Retrieval recall@k, empty hits, filter rejects RAG quality is retrieval quality.
Quality hallucination rate (sampled), user rating Detect regressions before stakeholders do.
Cost €/call, tokens/call, cache hit rate Unit economics and predictability.
Safety policy blocks, PII detects, injection flags Risk management and compliance evidence.

Tracing with OpenTelemetry (LLM-specific)

In enterprise workflows, a single “chat request” triggers multiple spans: retrieval, reranking, tool calls, model inference, and post-processing.

  • Propagate trace IDs across all internal services and tool calls.
  • Add LLM attributes: model version, prompt template ID, retrieval IDs, and safety policy results.
  • Sample smartly: keep full traces for errors and a small rate for successes.

Audit logging (privacy-safe)

Audit trails should connect inputs, retrieved sources, and outputs without leaking PII or secrets.

  • Log identifiers: doc IDs, chunk IDs, policy versions, model versions.
  • Redact payloads: store hashes or sampled, masked text.
  • Retention: align to GDPR data minimization and incident response needs.

Alert rules that don’t spam

  • Latency: p95 breach for N minutes (and separate TTFT alerts).
  • Quality regression: golden set score drops after deploy.
  • Cost anomaly: tokens/call spikes or cache hit rate drops.
  • Safety: policy blocks spike (possible injection campaign).

Related Articles