Protected

NCA-GENL course chapter content is available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL

Chapter 5: Retrieval-Augmented Generation (RAG)

Chapter study guide page

Chapter 5 of 12 · Developing LLM-Based Applications (24%). Secondary: Data for LLM Applications (10%).

Chapter Content

Exam focus

Primary domain: Developing LLM-Based Applications (24%). Secondary: Data for LLM Applications (10%).

Retrieval-Augmented Generation (RAG)
Vector embeddings
Embedding models
Vector databases
Semantic search
Similarity metrics (cosine, dot product, Euclidean)
Chunking strategies
Overlapping chunks
Metadata filtering
Hybrid search (keyword + semantic)
Re-ranking
Grounded generation
Knowledge base construction
Indexing pipelines

Scope Bullet Explanations

Retrieval-Augmented Generation (RAG): Combines retrieval from external knowledge with model generation.
Vector embeddings: Dense numeric representations capturing semantic meaning.
Embedding models: Models that convert text (or multimodal data) into vectors.
Vector databases: Stores/indexes embeddings for nearest-neighbor search.
Semantic search: Retrieves by meaning similarity rather than exact keyword matches.
Similarity metrics (cosine, dot product, Euclidean): Distance/similarity formulas used for vector ranking.
Chunking strategies: Methods for splitting documents into retrievable units.
Overlapping chunks: Adds context continuity between adjacent chunks.
Metadata filtering: Restricts retrieval by fields like source, date, tenant, or policy.
Hybrid search (keyword + semantic): Combines lexical and vector retrieval to improve recall/precision.
Re-ranking: Secondary ranking stage that improves relevance among initial retrieval results.
Grounded generation: Producing answers that are explicitly supported by retrieved evidence.
Knowledge base construction: Building and maintaining curated source corpora for retrieval.
Indexing pipelines: End-to-end ingest, parse, embed, and index workflows for searchable knowledge.

Chapter overview

RAG connects LLMs to external knowledge so answers are grounded, current, and auditable. This chapter covers retrieval architecture, embedding strategy, chunking, ranking, and evaluation patterns that determine whether RAG helps or harms quality.

Learning objectives

Explain the full RAG lifecycle from ingestion to grounded response.
Compare embedding models, vector indexes, and similarity metrics.
Design chunking, filtering, and reranking for retrieval quality.
Build validation loops for faithfulness, relevance, and freshness.

5.1 RAG system architecture

A practical RAG pipeline has these stages:

Source ingestion.
Parsing and cleaning.
Chunking and metadata attachment.
Embedding generation.
Vector index creation.
Retrieval and reranking.
Grounded generation with citations. If any stage is weak, final responses degrade regardless of model size.

5.2 Embeddings and vector retrieval

Embedding models

Embedding quality determines retrieval relevance. Choose based on domain vocabulary, multilingual needs, latency budget, and licensing constraints.

Vector databases

Vector stores optimize nearest-neighbor search at scale. Evaluate indexing strategy, update latency, filter support, and operational overhead.

Similarity metrics

Cosine similarity: direction-focused, common default.
Dot product: sensitive to vector norms.
Euclidean distance: geometry-based distance measure. Metric choice should match embedding model assumptions.

5.3 Chunking and index design

Chunking strategies

Fixed-size chunks are simple but may split semantic units.
Semantic chunks align with headings/sections.
Hybrid approaches balance consistency and meaning boundaries.

Overlapping chunks

Overlap preserves context around boundaries but increases index size and possible duplicate evidence.

Metadata filtering

Metadata fields (source, timestamp, business unit, policy scope, tenant) are critical for relevance and governance.

Hybrid retrieval

Combining lexical and vector retrieval improves performance on keyword-heavy queries and rare terms.

Reranking

A reranker applies deeper scoring to top retrieved candidates, often improving precision before generation.

5.4 Grounded generation

Grounded generation means response claims are traceable to retrieved evidence.

Implementation patterns:

cite source chunks inline,
separate answer from evidence section,
refuse or hedge when evidence is insufficient,
avoid unsupported synthesis.

5.5 Knowledge base operations

Ingestion pipelines

Automate source sync, parsing, deduplication, quality checks, and reindexing triggers.

Versioning and freshness

Track document versions and index versions. Freshness is essential in policy, legal, pricing, or procedural knowledge domains.

Access controls

Apply tenant and role filters at retrieval time, not only UI level.

5.6 Evaluation framework for RAG

Evaluate at three layers:

Retrieval quality (recall@k, precision@k, hit quality).
Generation faithfulness (claim-evidence consistency).
User outcome (task completion, escalation rate, trust). Use adversarial tests: conflicting documents, stale content, missing evidence, and policy-sensitive prompts.

5.7 Failure modes

Chunk size too large, causing diluted relevance.
Chunk size too small, losing essential context.
Missing metadata filters, causing cross-tenant leakage.
No reranking, resulting in top-k noise.
Hallucinated answers when retrieval returns weak evidence.

Chapter summary

RAG quality is retrieval engineering plus generation discipline. Better grounding comes from better data and ranking controls, not from larger prompts alone.

Mini-lab: retrieval quality tuning

Goal: improve answer grounding for one domain corpus.

Build baseline index with fixed-size chunks.
Run 20 representative questions and log retrieved chunks.
Implement metadata filtering and reranking.
Compare relevance and faithfulness before/after.
Add refusal rule for low-evidence cases.
Document final retrieval configuration. Deliverable in Notion:

Retrieval tuning report with chosen chunk size, overlap, filters, and reranker policy.

Review questions

Why does chunking strategy directly influence answer quality?
When is hybrid retrieval better than pure vector search?
Why is metadata filtering a security and relevance control?
What does reranking add beyond ANN retrieval?
How do you detect ungrounded generation behavior?
Why should source freshness be explicitly tracked?
What is a safe response pattern when retrieval evidence is weak?
How can overlap improve recall yet increase noise?
Which metrics evaluate retrieval versus generation separately?
Why is RAG not a replacement for model safety controls?

Key terms

RAG, embeddings, vector database, semantic search, cosine similarity, chunking, overlap, metadata filtering, hybrid retrieval, reranking, grounded generation.

Exam traps

Treating all retrieval misses as model hallucination.
Assuming top-k retrieval order is already optimal.
Ignoring access-control filtering in multi-tenant deployments.

Navigation

Back to NCA-GENL course page Previous: Chapter 4 Next: Chapter 6