Protected

NCA-GENL course chapter content is available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL

Chapter 2: Transformer Architecture and LLM Mechanics

Chapter study guide page

Chapter 2 of 12 · Core ML and DL Concepts for LLMs (30%).

Chapter Content

Exam focus

Primary domain: Core ML and DL Concepts for LLMs (30%).

Transformer Core

  • Self-attention
  • Multi-head attention
  • Positional encoding
  • Encoder-only vs Decoder-only vs Encoder-Decoder
  • Feed-forward blocks
  • Residual connections
  • Layer normalization
  • Parameter scaling

Tokenization

  • Subword tokenization (BPE, WordPiece, SentencePiece)
  • Token limits
  • Context window constraints

Advanced

  • Emergent abilities
  • In-context learning
  • KV cache
  • Attention complexity (O(n²))
  • Long-context strategies

Scope Bullet Explanations

  • Self-attention: Lets each token weigh relevance of other tokens in the sequence.
  • Multi-head attention: Uses multiple attention projections so the model can learn different relation types in parallel.
  • Positional encoding: Adds token-order information because attention alone is permutation-invariant.
  • Encoder-only vs Decoder-only vs Encoder-Decoder: Encoder-only is strong for understanding, decoder-only for generation, and encoder-decoder for input-to-output transformation.
  • Feed-forward blocks: Per-token nonlinear layers that refine representations after attention.
  • Residual connections: Skip paths that preserve signal and stabilize deep model training.
  • Layer normalization: Normalizes activations to improve optimization stability.
  • Parameter scaling: Increasing model size boosts capability but increases memory/latency cost.
  • Subword tokenization (BPE, WordPiece, SentencePiece): Splits text into subword units to balance vocabulary size and coverage.
  • Token limits: Maximum token budget per request (prompt plus output).
  • Context window constraints: Bound how much text the model can attend to at once.
  • Emergent abilities: Capabilities that become apparent at certain model/data/compute scales.
  • In-context learning: Adapting behavior from examples in the prompt without changing weights.
  • KV cache: Stores prior attention keys/values to speed autoregressive decoding.
  • Attention complexity (O(n²)): Full attention cost grows quadratically with sequence length.
  • Long-context strategies: Techniques like chunking, retrieval, and summarization to keep quality and cost manageable on long inputs.

Chapter overview

Transformers are the operating system of modern LLMs. This chapter explains how tokens move through attention blocks, why context length is expensive, and what model architecture choices imply for quality and serving cost.

Learning objectives

  • Explain self-attention, multi-head attention, feed-forward layers, residuals, and normalization.
  • Compare encoder-only, decoder-only, and encoder-decoder model families.
  • Understand tokenization, token budgets, and context-window constraints.
  • Analyze emergent abilities, in-context learning, and long-context strategies.

2.1 Transformer block internals

Self-attention

Self-attention lets each token attend to other tokens in the sequence. Instead of fixed local windows, the model learns relevance dynamically. This is why transformers handle long-range dependencies better than many earlier architectures.

Multi-head attention

Multiple attention heads project different relational views in parallel. In practical terms, one head may track syntax while another tracks entity relationships or long-range context cues.

Feed-forward network (FFN)

After attention mixes context, FFN applies per-token nonlinear transformation. Attention handles “which context matters” and FFN handles “how to transform that contextualized representation.”

Residual connections and layer normalization

Residual paths preserve signal and stabilize deep networks. Layer normalization keeps activation scales under control. Together they support very deep stacks without immediate optimization collapse.

2.2 Architecture variants

Encoder-only

  • Best for understanding tasks (classification, retrieval embeddings).
  • Bidirectional context over input sequence.

Decoder-only

  • Best for autoregressive generation.
  • Predicts next token conditioned on previous tokens.
  • Dominant in chat and coding assistants.

Encoder-decoder

  • Effective for input-to-output transformations (translation, structured conversion).
  • Separate encoding and generation pathways. Exam tip: identify task pattern first, then choose architecture family.

2.3 Tokenization and context budgeting

Subword tokenization

Common strategies include BPE, WordPiece, and SentencePiece. They trade off vocabulary size, unknown-token behavior, multilingual coverage, and token count efficiency.

Token limits

Context budget includes:

  • system instructions,
  • user prompt,
  • retrieved evidence,
  • tool traces,
  • generated response. Failing to budget all segments leads to truncation or degraded answers.

Context window constraints

Even when a model supports large windows, quality can degrade from:

  • attention dilution,
  • irrelevant retrieval chunks,
  • stale or conflicting context. Large context is not a substitute for retrieval quality and prompt discipline.

2.4 Advanced mechanics

In-context learning

Models infer task behavior from prompt examples without parameter updates. This is powerful but brittle if examples conflict or formatting is inconsistent.

Emergent abilities

Certain reasoning or instruction-following capabilities appear at specific scale thresholds. Treat emergence as observed behavior, not guaranteed feature.

KV cache

In autoregressive decoding, caching key/value tensors avoids recomputing previous-token attention states. This dramatically reduces latency for long outputs.

Attention complexity

Full attention typically scales O(n^2) with sequence length, driving memory and compute costs upward quickly as context grows.

Long-context strategies

  • retrieval and chunk selection,
  • hierarchical summarization,
  • windowed memory buffers,
  • sparse or grouped attention variants (model dependent).

2.5 Design patterns and tradeoffs

  1. Use smaller high-quality context over maximum possible context.
  2. Spend budget on relevant evidence, not verbose instructions.
  3. Cache aggressively for multi-turn generation workloads.
  4. Pair tokenizer-aware preprocessing with prompt templates.

2.6 Failure modes

  • Treating character count as token count.
  • Overfilling context with duplicate retrieval chunks.
  • Ignoring KV cache settings in latency debugging.
  • Assuming architecture family can be swapped with no evaluation change.

Chapter summary

Transformer performance is determined by attention mechanics, architecture selection, and context management discipline. Most production quality regressions in LLM apps are context engineering failures, not only model-weight failures.

Mini-lab: token and context budget audit

Goal: quantify context pressure in one application flow.

  1. Capture a real prompt flow: system + user + retrieval + response target.
  2. Tokenize each segment and compute percentage of total budget.
  3. Reduce total tokens by 30 percent while preserving answer quality.
  4. Document retrieval and prompt edits used.
  5. Re-test latency and quality. Deliverable in Notion:
  • Before/after token budget table and observed quality delta.

Review questions

  1. Why does self-attention improve long-range dependency modeling?
  2. What role does FFN play after attention?
  3. How do residual connections stabilize deep transformers?
  4. When is encoder-decoder preferable over decoder-only?
  5. Why can large context windows still produce weak answers?
  6. How does KV caching reduce inference cost?
  7. What practical issues come from O(n^2) attention complexity?
  8. How do tokenization choices influence serving cost?
  9. Why is in-context learning powerful but brittle?
  10. What is a robust strategy for long-document QA quality?

Key terms

Self-attention, multi-head attention, feed-forward network, residual connection, layer normalization, decoder-only, encoder-decoder, tokenization, context window, KV cache.

Exam traps

  • Confusing token budget with visible prompt length.
  • Assuming large context always improves factual quality.
  • Treating emergent behavior as deterministic capability.

Navigation