RAG Architecture: 7 Patterns for Quality Retrieval

RAG failures are rarely “model problems”. They are retrieval problems: wrong chunks, wrong filters, or missing evaluation. This guide captures seven production patterns that consistently raise groundedness and reduce hallucinations—without turning your stack into a science project.

Executive summary

  • Default architecture: hybrid retrieval (keyword + vector) → reranker → context budget + citations → answer + self-check.
  • Most common failure: no eval harness, so “quality” is subjective and regressions slip into production.
  • Most common scaling issue: metadata is missing (tenant, doc type, ACL), so filtering is inaccurate and retrieval becomes unsafe.

Pattern 1 — Hybrid retrieval (BM25 + vectors)

Vectors are great for semantic similarity, but they can miss exact strings, codes, version numbers, and product identifiers. Hybrid retrieval combines strengths:

  • Keyword / BM25: exact matches (SKUs, error codes, product names, policy IDs).
  • Vector search: semantic similarity for “how do I…?” questions and paraphrases.
  • Filters first: tenant, ACL, doc type, region, and “effective date” should narrow the search space before scoring.

Pattern 2 — Chunking that matches how humans read

Chunking is not a technical detail: it is your retrieval granularity. The default that works across enterprise docs:

  • Semantic chunks: split by headings/sections and keep paragraphs together.
  • Small overlap: avoid losing definitions spanning two paragraphs.
  • Structure-aware: tables, SOPs, and runbooks benefit from specialized chunking.
Doc type Chunk strategy Notes
Policies / Legal Section-based Preserve clause boundaries; citations matter.
Runbooks / Ops Step-based Prefer “procedure blocks” over paragraphs.
Tickets / KB Thread-based Keep resolution + context together.

Pattern 3 — Query rewriting and expansion

Users rarely write the best retrieval query. A lightweight “query rewrite” step improves recall without changing the UI:

  • Normalize: expand abbreviations, map synonyms, keep the original.
  • Extract entities: product names, regions, dates, ticket IDs.
  • Generate 2–3 variants: one keyword-heavy, one semantic, one “problem→solution”.

Pattern 4 — Reranking (cheap, high impact)

Retrievers optimize speed; rerankers optimize relevance. In most stacks, reranking is the single best quality lever after filters.

  • Use reranking when: you have long documents, similar sections, or high “near-duplicate” content.
  • Keep it bounded: rerank top 20–50 results, not thousands.
  • Measure: run eval before/after; keep a rollback path.

Pattern 5 — Context budgeting + citations

Long context windows don’t solve retrieval. They hide errors. Budget context explicitly:

  • Top-k with diversity: avoid 5 chunks from the same section when 3 topics are needed.
  • Citations: tie claims to sources; in regulated environments, this is non-negotiable.
  • Refuse gracefully: if evidence is missing, respond with “I can’t find it” and ask clarifying questions.

Pattern 6 — Eval harness (offline + online)

Without evaluation, “quality” is a feeling. With evaluation, it becomes an SLO.

  • Offline: golden set questions + expected citations; regression tests on every change.
  • Online: sample production traffic; track groundedness proxies and user feedback.
  • Failure taxonomy: retrieval miss, stale doc, wrong filter, hallucination, tool failure.

Pattern 7 — Guardrails for RAG (prompt injection & data safety)

RAG increases attack surface: documents can contain malicious instructions. Treat retrieval as untrusted input.

  • Instruction hierarchy: system > developer > user; retrieved text is evidence, not instructions.
  • Policy filters: block unsafe tools/actions and sensitive data exfiltration.
  • Audit trail: log query, retrieved doc IDs, and citations (privacy-safe).

Decision matrix

Problem Most likely fix What to measure
Wrong answers with “confident” tone Reranking + citations + refusal policy Hallucination rate, groundedness
Answers ignore the newest policy Metadata (effective date) + filtering Staleness, doc coverage
Misses exact identifiers / codes Hybrid retrieval Recall on code-heavy queries
Too slow at scale Index sizing, batching, caching p95 latency, throughput

Related Articles