The four signals: metrics, logs, traces, eval telemetry
- Metrics: fast aggregates for latency, throughput, errors, cost.
- Logs: structured events for audit trails and incident response.
- Traces: distributed view across tools, retrieval, model, and external calls.
- Eval telemetry: quality signals (groundedness proxies, user feedback, regressions).
What to measure (minimum viable set)
| Domain | Metrics | Why it matters |
|---|---|---|
| Latency | TTFT, TBT, p95 end-to-end | UX and SLO compliance. |
| Reliability | error rate, timeouts, retries | Silent failures look like “bad answers”. |
| Retrieval | recall@k, empty hits, filter rejects | RAG quality is retrieval quality. |
| Quality | hallucination rate (sampled), user rating | Detect regressions before stakeholders do. |
| Cost | €/call, tokens/call, cache hit rate | Unit economics and predictability. |
| Safety | policy blocks, PII detects, injection flags | Risk management and compliance evidence. |
Tracing with OpenTelemetry (LLM-specific)
In enterprise workflows, a single “chat request” triggers multiple spans: retrieval, reranking, tool calls, model inference, and post-processing.
- Propagate trace IDs across all internal services and tool calls.
- Add LLM attributes: model version, prompt template ID, retrieval IDs, and safety policy results.
- Sample smartly: keep full traces for errors and a small rate for successes.
Audit logging (privacy-safe)
Audit trails should connect inputs, retrieved sources, and outputs without leaking PII or secrets.
- Log identifiers: doc IDs, chunk IDs, policy versions, model versions.
- Redact payloads: store hashes or sampled, masked text.
- Retention: align to GDPR data minimization and incident response needs.
Alert rules that don’t spam
- Latency: p95 breach for N minutes (and separate TTFT alerts).
- Quality regression: golden set score drops after deploy.
- Cost anomaly: tokens/call spikes or cache hit rate drops.
- Safety: policy blocks spike (possible injection campaign).