Chapter 2: Generative AI Fundamentals
Exam focus
- Autoregressive models
- Diffusion models
- VAEs
- GANs
- Encoder-decoder architectures
- LLMs, ViTs, multimodal transformers, cross-modal transformers
- Prompt engineering, zero-shot, few-shot, chain-of-thought
- Conditioning tokens and system prompts
Scope Bullet Explanations
- Autoregressive models: Generate outputs token-by-token conditioned on prior context.
- Diffusion models: Generate by iterative denoising from noise toward structured output.
- VAEs: Learn compressed latent representation with probabilistic decoding.
- GANs: Generator/discriminator competition for synthetic realism.
- Encoder-decoder: Encode source signal then decode target sequence/signal.
- Transformer families: Shared attention mechanics specialized across modalities.
- Prompting methods: Input-design strategies for behavior control without weight updates.
- Conditioning/system control: Explicit steering context for role, style, safety, and constraints.
Chapter overview
This chapter maps the model families and prompting controls expected for GENM. The exam typically tests conceptual tradeoffs and architectural selection logic rather than implementation internals.
Assumed foundational awareness
Expected baseline:
- tokenization concept,
- embedding intuition,
- loss minimization and gradient-based training,
- train/validation split discipline.
Learning objectives
- Compare major generative model classes and where each fits.
- Explain transformer adaptation from text to vision and multimodal settings.
- Select prompting strategies based on task uncertainty and control needs.
- Recognize practical limitations and risks of each model family.
2.1 Generative model family tradeoffs
Autoregressive systems
Strengths:
- natural sequence generation,
- high controllability with prompt context,
- strong alignment with language tasks.
Limitations:
- sequential decoding latency,
- context window constraints.
Diffusion systems
Strengths:
- strong fidelity for image/media generation,
- flexible conditioning pathways.
Limitations:
- iterative sampling cost,
- tuning complexity for speed-quality balance.
VAEs and GANs
VAEs are useful for latent representation and controllable compression workflows. GANs remain conceptually important for adversarial generation but are less dominant than diffusion for many high-fidelity media tasks.
2.2 Transformer families in GENM context
LLMs
LLMs provide text reasoning and generation backbone functionality for assistant behavior and orchestration logic.
ViTs and vision transformers
ViTs tokenize image patches and apply attention for global context modeling.
Multimodal and cross-modal transformers
Multimodal transformers integrate heterogeneous streams. Cross-modal transformers explicitly model inter-modality dependency (for example text query attending to image regions).
2.3 Prompting and conditioning for reliable outputs
Zero-shot, few-shot, chain-of-thought
- Zero-shot: fast baseline with no examples.
- Few-shot: improves format adherence and style consistency.
- Chain-of-thought: can improve reasoning transparency for some tasks but increases verbosity and leakage risk.
System prompts and conditioning tokens
System prompts define high-priority behavior and policy boundaries. Conditioning tokens can steer domain role output style safety mode or generation constraints.
2.4 Model selection under constraints
Use this exam-ready selection flow:
- Define modality and output target.
- Determine latency and quality requirements.
- Choose candidate model families.
- Evaluate controllability and safety implications.
- Select prompt/conditioning strategy and validation plan.
Common failure modes
- Picking model class by hype instead of requirement fit.
- Assuming chain-of-thought always improves outcomes.
- Ignoring serving cost while choosing diffusion-heavy workflows.
- Treating prompts as a replacement for evaluation and guardrails.
Chapter summary
GENM readiness requires clear reasoning across model family choice, transformer specialization, and prompt-control strategies. Practical tradeoff awareness is usually more important than memorizing architecture trivia.
Mini-lab: model family comparison card
- Pick one task from text, one from vision, one multimodal.
- Select candidate model families for each.
- Score each candidate on quality, latency, control, and safety.
- Choose final candidate and justify decision.
Deliverable:
- one comparison table with final recommendation per task.
Review questions
- Why are autoregressive models naturally aligned to text generation?
- What quality-vs-latency tradeoff appears in diffusion inference?
- When is encoder-decoder preferable to decoder-only architecture?
- How is a multimodal transformer different from a plain LLM wrapper?
- Why does few-shot prompting often improve output consistency?
- What is one risk of over-reliance on chain-of-thought prompts?
- How do conditioning tokens improve controllability?
- Why is model family selection an operational decision, not only a research decision?
- What is one practical limitation of GAN training stability?
- How should system prompts interact with downstream guardrail layers?
Key terms
Autoregressive model, diffusion model, VAE, GAN, encoder-decoder, multimodal transformer, conditioning tokens, system prompt.
Exam traps
- Assuming one generative family is universally best.
- Confusing prompt cleverness with robust model design.
- Ignoring inference economics in architecture choice.