Exam focus
Primary domain: Core ML and DL Concepts for LLMs (30%).
Transformer Core
- Self-attention
- Multi-head attention
- Positional encoding
- Encoder-only vs Decoder-only vs Encoder-Decoder
- Feed-forward blocks
- Residual connections
- Layer normalization
- Parameter scaling
Tokenization
- Subword tokenization (BPE, WordPiece, SentencePiece)
- Token limits
- Context window constraints
Advanced
- Emergent abilities
- In-context learning
- KV cache
- Attention complexity (O(n²))
- Long-context strategies
Scope Bullet Explanations
- Self-attention: Lets each token weigh relevance of other tokens in the sequence.
- Multi-head attention: Uses multiple attention projections so the model can learn different relation types in parallel.
- Positional encoding: Adds token-order information because attention alone is permutation-invariant.
- Encoder-only vs Decoder-only vs Encoder-Decoder: Encoder-only is strong for understanding, decoder-only for generation, and encoder-decoder for input-to-output transformation.
- Feed-forward blocks: Per-token nonlinear layers that refine representations after attention.
- Residual connections: Skip paths that preserve signal and stabilize deep model training.
- Layer normalization: Normalizes activations to improve optimization stability.
- Parameter scaling: Increasing model size boosts capability but increases memory/latency cost.
- Subword tokenization (BPE, WordPiece, SentencePiece): Splits text into subword units to balance vocabulary size and coverage.
- Token limits: Maximum token budget per request (prompt plus output).
- Context window constraints: Bound how much text the model can attend to at once.
- Emergent abilities: Capabilities that become apparent at certain model/data/compute scales.
- In-context learning: Adapting behavior from examples in the prompt without changing weights.
- KV cache: Stores prior attention keys/values to speed autoregressive decoding.
- Attention complexity (O(n²)): Full attention cost grows quadratically with sequence length.
- Long-context strategies: Techniques like chunking, retrieval, and summarization to keep quality and cost manageable on long inputs.
Chapter overview
Transformers are the operating system of modern LLMs. This chapter explains how tokens move through attention blocks, why context length is expensive, and what model architecture choices imply for quality and serving cost.
Learning objectives
- Explain self-attention, multi-head attention, feed-forward layers, residuals, and normalization.
- Compare encoder-only, decoder-only, and encoder-decoder model families.
- Understand tokenization, token budgets, and context-window constraints.
- Analyze emergent abilities, in-context learning, and long-context strategies.
2.1 Transformer block internals
Self-attention
Self-attention lets each token attend to other tokens in the sequence. Instead of fixed local windows, the model learns relevance dynamically. This is why transformers handle long-range dependencies better than many earlier architectures.
Multi-head attention
Multiple attention heads project different relational views in parallel. In practical terms, one head may track syntax while another tracks entity relationships or long-range context cues.
Feed-forward network (FFN)
After attention mixes context, FFN applies per-token nonlinear transformation. Attention handles “which context matters” and FFN handles “how to transform that contextualized representation.”
Residual connections and layer normalization
Residual paths preserve signal and stabilize deep networks. Layer normalization keeps activation scales under control. Together they support very deep stacks without immediate optimization collapse.
2.2 Architecture variants
Encoder-only
- Best for understanding tasks (classification, retrieval embeddings).
- Bidirectional context over input sequence.
Decoder-only
- Best for autoregressive generation.
- Predicts next token conditioned on previous tokens.
- Dominant in chat and coding assistants.
Encoder-decoder
- Effective for input-to-output transformations (translation, structured conversion).
- Separate encoding and generation pathways. Exam tip: identify task pattern first, then choose architecture family.
2.3 Tokenization and context budgeting
Subword tokenization
Common strategies include BPE, WordPiece, and SentencePiece. They trade off vocabulary size, unknown-token behavior, multilingual coverage, and token count efficiency.
Token limits
Context budget includes:
- system instructions,
- user prompt,
- retrieved evidence,
- tool traces,
- generated response. Failing to budget all segments leads to truncation or degraded answers.
Context window constraints
Even when a model supports large windows, quality can degrade from:
- attention dilution,
- irrelevant retrieval chunks,
- stale or conflicting context. Large context is not a substitute for retrieval quality and prompt discipline.
2.4 Advanced mechanics
In-context learning
Models infer task behavior from prompt examples without parameter updates. This is powerful but brittle if examples conflict or formatting is inconsistent.
Emergent abilities
Certain reasoning or instruction-following capabilities appear at specific scale thresholds. Treat emergence as observed behavior, not guaranteed feature.
KV cache
In autoregressive decoding, caching key/value tensors avoids recomputing previous-token attention states. This dramatically reduces latency for long outputs.
Attention complexity
Full attention typically scales O(n^2) with sequence length, driving memory and compute costs upward quickly as context grows.
Long-context strategies
- retrieval and chunk selection,
- hierarchical summarization,
- windowed memory buffers,
- sparse or grouped attention variants (model dependent).
2.5 Design patterns and tradeoffs
- Use smaller high-quality context over maximum possible context.
- Spend budget on relevant evidence, not verbose instructions.
- Cache aggressively for multi-turn generation workloads.
- Pair tokenizer-aware preprocessing with prompt templates.
2.6 Failure modes
- Treating character count as token count.
- Overfilling context with duplicate retrieval chunks.
- Ignoring KV cache settings in latency debugging.
- Assuming architecture family can be swapped with no evaluation change.
Chapter summary
Transformer performance is determined by attention mechanics, architecture selection, and context management discipline. Most production quality regressions in LLM apps are context engineering failures, not only model-weight failures.
Mini-lab: token and context budget audit
Goal: quantify context pressure in one application flow.
- Capture a real prompt flow: system + user + retrieval + response target.
- Tokenize each segment and compute percentage of total budget.
- Reduce total tokens by 30 percent while preserving answer quality.
- Document retrieval and prompt edits used.
- Re-test latency and quality. Deliverable in Notion:
- Before/after token budget table and observed quality delta.
Review questions
- Why does self-attention improve long-range dependency modeling?
- What role does FFN play after attention?
- How do residual connections stabilize deep transformers?
- When is encoder-decoder preferable over decoder-only?
- Why can large context windows still produce weak answers?
- How does KV caching reduce inference cost?
- What practical issues come from O(n^2) attention complexity?
- How do tokenization choices influence serving cost?
- Why is in-context learning powerful but brittle?
- What is a robust strategy for long-document QA quality?
Key terms
Self-attention, multi-head attention, feed-forward network, residual connection, layer normalization, decoder-only, encoder-decoder, tokenization, context window, KV cache.
Exam traps
- Confusing token budget with visible prompt length.
- Assuming large context always improves factual quality.
- Treating emergent behavior as deterministic capability.