Exam focus
Primary domain: Developing LLM-Based Applications (24%). Secondary: Data for LLM Applications (10%).
- Retrieval-Augmented Generation (RAG)
- Vector embeddings
- Embedding models
- Vector databases
- Semantic search
- Similarity metrics (cosine, dot product, Euclidean)
- Chunking strategies
- Overlapping chunks
- Metadata filtering
- Hybrid search (keyword + semantic)
- Re-ranking
- Grounded generation
- Knowledge base construction
- Indexing pipelines
Scope Bullet Explanations
- Retrieval-Augmented Generation (RAG): Combines retrieval from external knowledge with model generation.
- Vector embeddings: Dense numeric representations capturing semantic meaning.
- Embedding models: Models that convert text (or multimodal data) into vectors.
- Vector databases: Stores/indexes embeddings for nearest-neighbor search.
- Semantic search: Retrieves by meaning similarity rather than exact keyword matches.
- Similarity metrics (cosine, dot product, Euclidean): Distance/similarity formulas used for vector ranking.
- Chunking strategies: Methods for splitting documents into retrievable units.
- Overlapping chunks: Adds context continuity between adjacent chunks.
- Metadata filtering: Restricts retrieval by fields like source, date, tenant, or policy.
- Hybrid search (keyword + semantic): Combines lexical and vector retrieval to improve recall/precision.
- Re-ranking: Secondary ranking stage that improves relevance among initial retrieval results.
- Grounded generation: Producing answers that are explicitly supported by retrieved evidence.
- Knowledge base construction: Building and maintaining curated source corpora for retrieval.
- Indexing pipelines: End-to-end ingest, parse, embed, and index workflows for searchable knowledge.
Chapter overview
RAG connects LLMs to external knowledge so answers are grounded, current, and auditable. This chapter covers retrieval architecture, embedding strategy, chunking, ranking, and evaluation patterns that determine whether RAG helps or harms quality.
Learning objectives
- Explain the full RAG lifecycle from ingestion to grounded response.
- Compare embedding models, vector indexes, and similarity metrics.
- Design chunking, filtering, and reranking for retrieval quality.
- Build validation loops for faithfulness, relevance, and freshness.
5.1 RAG system architecture
A practical RAG pipeline has these stages:
- Source ingestion.
- Parsing and cleaning.
- Chunking and metadata attachment.
- Embedding generation.
- Vector index creation.
- Retrieval and reranking.
- Grounded generation with citations. If any stage is weak, final responses degrade regardless of model size.
5.2 Embeddings and vector retrieval
Embedding models
Embedding quality determines retrieval relevance. Choose based on domain vocabulary, multilingual needs, latency budget, and licensing constraints.
Vector databases
Vector stores optimize nearest-neighbor search at scale. Evaluate indexing strategy, update latency, filter support, and operational overhead.
Similarity metrics
- Cosine similarity: direction-focused, common default.
- Dot product: sensitive to vector norms.
- Euclidean distance: geometry-based distance measure. Metric choice should match embedding model assumptions.
5.3 Chunking and index design
Chunking strategies
- Fixed-size chunks are simple but may split semantic units.
- Semantic chunks align with headings/sections.
- Hybrid approaches balance consistency and meaning boundaries.
Overlapping chunks
Overlap preserves context around boundaries but increases index size and possible duplicate evidence.
Metadata filtering
Metadata fields (source, timestamp, business unit, policy scope, tenant) are critical for relevance and governance.
Hybrid retrieval
Combining lexical and vector retrieval improves performance on keyword-heavy queries and rare terms.
Reranking
A reranker applies deeper scoring to top retrieved candidates, often improving precision before generation.
5.4 Grounded generation
Grounded generation means response claims are traceable to retrieved evidence.
Implementation patterns:
- cite source chunks inline,
- separate answer from evidence section,
- refuse or hedge when evidence is insufficient,
- avoid unsupported synthesis.
5.5 Knowledge base operations
Ingestion pipelines
Automate source sync, parsing, deduplication, quality checks, and reindexing triggers.
Versioning and freshness
Track document versions and index versions. Freshness is essential in policy, legal, pricing, or procedural knowledge domains.
Access controls
Apply tenant and role filters at retrieval time, not only UI level.
5.6 Evaluation framework for RAG
Evaluate at three layers:
- Retrieval quality (recall@k, precision@k, hit quality).
- Generation faithfulness (claim-evidence consistency).
- User outcome (task completion, escalation rate, trust). Use adversarial tests: conflicting documents, stale content, missing evidence, and policy-sensitive prompts.
5.7 Failure modes
- Chunk size too large, causing diluted relevance.
- Chunk size too small, losing essential context.
- Missing metadata filters, causing cross-tenant leakage.
- No reranking, resulting in top-k noise.
- Hallucinated answers when retrieval returns weak evidence.
Chapter summary
RAG quality is retrieval engineering plus generation discipline. Better grounding comes from better data and ranking controls, not from larger prompts alone.
Mini-lab: retrieval quality tuning
Goal: improve answer grounding for one domain corpus.
- Build baseline index with fixed-size chunks.
- Run 20 representative questions and log retrieved chunks.
- Implement metadata filtering and reranking.
- Compare relevance and faithfulness before/after.
- Add refusal rule for low-evidence cases.
- Document final retrieval configuration. Deliverable in Notion:
- Retrieval tuning report with chosen chunk size, overlap, filters, and reranker policy.
Review questions
- Why does chunking strategy directly influence answer quality?
- When is hybrid retrieval better than pure vector search?
- Why is metadata filtering a security and relevance control?
- What does reranking add beyond ANN retrieval?
- How do you detect ungrounded generation behavior?
- Why should source freshness be explicitly tracked?
- What is a safe response pattern when retrieval evidence is weak?
- How can overlap improve recall yet increase noise?
- Which metrics evaluate retrieval versus generation separately?
- Why is RAG not a replacement for model safety controls?
Key terms
RAG, embeddings, vector database, semantic search, cosine similarity, chunking, overlap, metadata filtering, hybrid retrieval, reranking, grounded generation.
Exam traps
- Treating all retrieval misses as model hallucination.
- Assuming top-k retrieval order is already optimal.
- Ignoring access-control filtering in multi-tenant deployments.