Antonio Brundo

Operations

Observability Stack for LLM: What to Track and Why

14 min read

Jan 2, 2026

LLM systems fail in new ways: retrieval drift, prompt injection, silent quality regressions, and runaway cost. Observability is how you turn “AI magic” into an engineered system with SLOs and auditability.

Observability OpenTelemetry Audit Trail FinOps

The four signals: metrics, logs, traces, eval telemetry

Metrics: fast aggregates for latency, throughput, errors, cost.
Logs: structured events for audit trails and incident response.
Traces: distributed view across tools, retrieval, model, and external calls.
Eval telemetry: quality signals (groundedness proxies, user feedback, regressions).

What to measure (minimum viable set)

Domain	Metrics	Why it matters
Latency	TTFT, TBT, p95 end-to-end	UX and SLO compliance.
Reliability	error rate, timeouts, retries	Silent failures look like “bad answers”.
Retrieval	recall@k, empty hits, filter rejects	RAG quality is retrieval quality.
Quality	hallucination rate (sampled), user rating	Detect regressions before stakeholders do.
Cost	€/call, tokens/call, cache hit rate	Unit economics and predictability.
Safety	policy blocks, PII detects, injection flags	Risk management and compliance evidence.

Tracing with OpenTelemetry (LLM-specific)

In enterprise workflows, a single “chat request” triggers multiple spans: retrieval, reranking, tool calls, model inference, and post-processing.

Propagate trace IDs across all internal services and tool calls.
Add LLM attributes: model version, prompt template ID, retrieval IDs, and safety policy results.
Sample smartly: keep full traces for errors and a small rate for successes.

Audit logging (privacy-safe)

Audit trails should connect inputs, retrieved sources, and outputs without leaking PII or secrets.

Log identifiers: doc IDs, chunk IDs, policy versions, model versions.
Redact payloads: store hashes or sampled, masked text.
Retention: align to GDPR data minimization and incident response needs.

Alert rules that don’t spam

Latency: p95 breach for N minutes (and separate TTFT alerts).
Quality regression: golden set score drops after deploy.
Cost anomaly: tokens/call spikes or cache hit rate drops.
Safety: policy blocks spike (possible injection campaign).

Related Articles

Executive brief

Observability Stack for LLM: What to Track and Why

Antonio Brundo

Jan 02, 2026

14 min read

Tracing, metrics, and audit logging for LLM systems: what to track (quality, latency, safety, cost) and how to tie it to decisions and SLOs.

Want the full technical deep dive?

This page includes an executive brief in your language. Switch to English to read the full technical version with implementation details.

Read full technical version (EN) Request Strategic Assessment

Key takeaways

Without observability, you can’t manage quality, latency, safety, or cost in production.
Track TTFT, p95 latency, tokens/request, cache hit rate, retrieval metrics, and policy blocks.
Logs and traces must be audit-friendly and privacy-aware (redaction, retention, access).
Tie metrics to SLOs and runbooks; do weekly governance reviews on real data.

30-day plan

Define KPIs + SLOs and instrument the full request lifecycle end-to-end.
Add RAG metrics (recall proxies, freshness) and caching metrics.
Set alert thresholds and write runbooks for latency/quality/cost regressions.
Use dashboards in a weekly executive cadence and iterate on blind spots.