Executive summary
- Default architecture: hybrid retrieval (keyword + vector) → reranker → context budget + citations → answer + self-check.
- Most common failure: no eval harness, so “quality” is subjective and regressions slip into production.
- Most common scaling issue: metadata is missing (tenant, doc type, ACL), so filtering is inaccurate and retrieval becomes unsafe.
Pattern 1 — Hybrid retrieval (BM25 + vectors)
Vectors are great for semantic similarity, but they can miss exact strings, codes, version numbers, and product identifiers. Hybrid retrieval combines strengths:
- Keyword / BM25: exact matches (SKUs, error codes, product names, policy IDs).
- Vector search: semantic similarity for “how do I…?” questions and paraphrases.
- Filters first: tenant, ACL, doc type, region, and “effective date” should narrow the search space before scoring.
Pattern 2 — Chunking that matches how humans read
Chunking is not a technical detail: it is your retrieval granularity. The default that works across enterprise docs:
- Semantic chunks: split by headings/sections and keep paragraphs together.
- Small overlap: avoid losing definitions spanning two paragraphs.
- Structure-aware: tables, SOPs, and runbooks benefit from specialized chunking.
| Doc type | Chunk strategy | Notes |
|---|---|---|
| Policies / Legal | Section-based | Preserve clause boundaries; citations matter. |
| Runbooks / Ops | Step-based | Prefer “procedure blocks” over paragraphs. |
| Tickets / KB | Thread-based | Keep resolution + context together. |
Pattern 3 — Query rewriting and expansion
Users rarely write the best retrieval query. A lightweight “query rewrite” step improves recall without changing the UI:
- Normalize: expand abbreviations, map synonyms, keep the original.
- Extract entities: product names, regions, dates, ticket IDs.
- Generate 2–3 variants: one keyword-heavy, one semantic, one “problem→solution”.
Pattern 4 — Reranking (cheap, high impact)
Retrievers optimize speed; rerankers optimize relevance. In most stacks, reranking is the single best quality lever after filters.
- Use reranking when: you have long documents, similar sections, or high “near-duplicate” content.
- Keep it bounded: rerank top 20–50 results, not thousands.
- Measure: run eval before/after; keep a rollback path.
Pattern 5 — Context budgeting + citations
Long context windows don’t solve retrieval. They hide errors. Budget context explicitly:
- Top-k with diversity: avoid 5 chunks from the same section when 3 topics are needed.
- Citations: tie claims to sources; in regulated environments, this is non-negotiable.
- Refuse gracefully: if evidence is missing, respond with “I can’t find it” and ask clarifying questions.
Pattern 6 — Eval harness (offline + online)
Without evaluation, “quality” is a feeling. With evaluation, it becomes an SLO.
- Offline: golden set questions + expected citations; regression tests on every change.
- Online: sample production traffic; track groundedness proxies and user feedback.
- Failure taxonomy: retrieval miss, stale doc, wrong filter, hallucination, tool failure.
Pattern 7 — Guardrails for RAG (prompt injection & data safety)
RAG increases attack surface: documents can contain malicious instructions. Treat retrieval as untrusted input.
- Instruction hierarchy: system > developer > user; retrieved text is evidence, not instructions.
- Policy filters: block unsafe tools/actions and sensitive data exfiltration.
- Audit trail: log query, retrieved doc IDs, and citations (privacy-safe).
Decision matrix
| Problem | Most likely fix | What to measure |
|---|---|---|
| Wrong answers with “confident” tone | Reranking + citations + refusal policy | Hallucination rate, groundedness |
| Answers ignore the newest policy | Metadata (effective date) + filtering | Staleness, doc coverage |
| Misses exact identifiers / codes | Hybrid retrieval | Recall on code-heavy queries |
| Too slow at scale | Index sizing, batching, caching | p95 latency, throughput |