Protected

NCA-GENL course chapter content is available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL

Chapter 1: Foundations of Generative AI and Deep Learning

Chapter study guide page

Chapter 1 of 12 · Core ML and DL Concepts for LLMs (30%).

Chapter Content

Exam focus

Primary domain: Core ML and DL Concepts for LLMs (30%).

Core AI Foundations

Generative vs Discriminative models
Supervised vs Unsupervised vs Self-supervised learning
Representation learning
Foundation models
Scaling laws
Transfer learning

Deep Learning Basics (Added from DLI)

Neural networks (MLP basics)
Activation functions (ReLU, GELU, Sigmoid, Tanh)
Loss functions (Cross-Entropy, MSE)
Backpropagation
Gradient descent
Vanishing/exploding gradients
Regularization (L1/L2, dropout)

Scope Bullet Explanations

Generative vs Discriminative models: Generative models create new outputs from learned distributions, while discriminative models classify or score existing inputs.
Supervised vs Unsupervised vs Self-supervised learning: Supervised uses labels, unsupervised finds structure, and self-supervised builds labels from raw data (the core of LLM pretraining).
Representation learning: The model learns useful internal features (embeddings/hidden states) that can transfer across many tasks.
Foundation models: Broad pretrained models that can be adapted with prompting, RAG, or fine-tuning for specific use cases.
Scaling laws: Performance generally improves with more parameters, data, and compute, but with cost and diminishing-return tradeoffs.
Transfer learning: Reusing pretrained model knowledge to reduce training cost/time for downstream tasks.
Neural networks (MLP basics): Stacked linear and nonlinear layers transform inputs into progressively richer features.
Activation functions (ReLU, GELU, Sigmoid, Tanh): Nonlinearities that let networks model complex patterns; GELU/ReLU are common in modern transformer stacks.
Loss functions (Cross-Entropy, MSE): Cross-entropy is typical for token prediction/classification; MSE is common for regression targets.
Backpropagation: Algorithm that computes gradients of loss with respect to model parameters.
Gradient descent: Optimization process that updates parameters using gradients to reduce loss.
Vanishing/exploding gradients: Gradients become too small or too large, causing stalled learning or unstable updates.
Regularization (L1/L2, dropout): Techniques that reduce overfitting and improve generalization.

Chapter overview

This chapter builds the mental model you need for the rest of the book: what generative AI is solving, why modern LLMs are trained the way they are, and which deep learning principles directly affect quality, cost, and stability. If Chapter 1 is weak, every later chapter becomes memorization without reasoning.

Learning objectives

Differentiate generative and discriminative modeling goals in practical systems.
Explain supervised, unsupervised, and self-supervised learning and where each appears in an LLM lifecycle.
Describe representation learning, foundation models, scaling laws, and transfer learning.
Apply core deep learning mechanics: losses, backpropagation, optimization, and regularization.

1.1 Core AI foundations

Generative vs discriminative models

A discriminative model predicts labels or outcomes given input features (for example, sentiment class from text). A generative model learns how data is distributed and can synthesize new samples (for example, generate a support answer, summary, or code snippet).

In production systems, these categories are often combined:

A discriminative classifier routes user intent.
A generative model composes the final response.
A discriminative safety classifier validates output before returning to user.

Supervised, unsupervised, and self-supervised learning

Supervised learning depends on labeled examples.
Unsupervised learning discovers structure (clusters, latent factors).
Self-supervised learning creates training targets from raw data itself. LLM pretraining is mostly self-supervised (next-token prediction), while many downstream adaptations (instruction tuning, evaluation datasets) use supervised data.

Representation learning

Representation learning transforms raw inputs into latent features. For language, token embeddings and hidden states capture syntax, semantics, and contextual relationships. High-quality representations reduce the amount of task-specific labeled data needed later.

Foundation models

Foundation models are large pretrained models intended for broad transfer. Their value is not just size; it is the combination of:

broad-domain pretraining,
reusable capability,
adaptation mechanisms (prompting, fine-tuning, PEFT).

Scaling laws

Scaling laws describe how model performance trends as you increase parameters, data, and compute. The practical exam takeaway is not theory proof; it is planning:

bigger models can improve capability,
but quality data and compute efficiency still dominate ROI,
and inference cost rises quickly if optimization is ignored.

Transfer learning

Transfer learning lets you start from pretrained weights instead of training from scratch. For enterprise teams, this reduces:

time-to-value,
training cost,
data requirements for initial deployment.

1.2 Deep learning essentials for LLMs

Neural network basics

A neural network applies repeated linear transformations and nonlinear activations. In transformers, this is seen in attention blocks and feed-forward networks.

Activation functions

ReLU: simple and efficient, common in many architectures.
GELU: smooth gating behavior, common in transformers.
Sigmoid/Tanh: appear in gating and legacy recurrent patterns. Activation choice influences gradient flow and convergence behavior.

Loss functions

Cross-entropy is standard for next-token prediction and classification.
MSE is common for regression but not usually ideal for token generation. For language modeling, cross-entropy aligns naturally with probabilistic token prediction.

Backpropagation and gradient descent

Backpropagation computes parameter gradients from loss to earlier layers. Gradient descent (or adaptive variants) uses those gradients to update weights.

If gradients are noisy, unstable, or tiny, optimization slows or diverges. This is why optimizer choice, learning rate schedules, and normalization matter later.

Vanishing and exploding gradients

Vanishing gradients: updates become too small; model learns slowly.
Exploding gradients: updates become too large; loss spikes or becomes NaN. Common mitigations include normalization, residual connections, gradient clipping, better initialization, and schedule tuning.

Regularization

Regularization controls overfitting by constraining effective capacity:

L1/L2 penalties,
dropout,
early stopping,
data augmentation. Large models are not immune to overfitting. They can memorize faster when datasets are narrow or noisy.

1.3 Applied decision patterns

When designing or debugging an LLM system, use this sequence:

Define task type: generation, classification, extraction, ranking.
Choose training/adaptation approach: prompt-only, PEFT, or fine-tuning.
Validate data regime: label quality, domain coverage, drift risk.
Set optimization constraints: cost, latency, accuracy, safety.
Define evaluation gates before deployment. This method prevents the common failure of choosing model architecture before clarifying business constraints.

1.4 Common failure modes

Confusing benchmark score improvements with user value improvements.
Overfitting to a narrow instruction dataset while claiming general capability.
Treating larger model size as substitute for data quality.
Ignoring training instability signals until late-stage runs fail.

Chapter summary

Chapter 1 establishes the cause-and-effect chain: learning paradigm -> representation quality -> optimization behavior -> system capability and risk. This foundation enables correct decisions in transformers, tuning, prompt engineering, and production operations.

Mini-lab: model strategy canvas

Goal: choose a realistic model strategy for one use case.

Pick a use case (internal Q&A assistant, ticket summarizer, code helper).
Label each pipeline step as generative, discriminative, or hybrid.
Write expected data sources and whether labels exist.
Decide baseline method (prompt-only, RAG, PEFT).
List two overfitting risks and two mitigation controls.
Define one quality metric and one efficiency metric. Deliverable in Notion:

One-page strategy table with model choice rationale and validation plan.

Review questions

Why does self-supervised pretraining scale better than classic supervised labeling for language models?
What practical difference between generative and discriminative models affects architecture design?
Why is cross-entropy preferred over MSE for next-token objectives?
What symptoms signal exploding gradients during training?
How can transfer learning reduce project risk in first release cycles?
Why can larger models overfit faster on narrow domain data?
Which regularization methods are useful when fine-tuning instruction datasets?
How do scaling laws influence budget and timeline planning?
What is representation learning contributing beyond raw token lookup?
Why should evaluation criteria be defined before training starts?

Key terms

Generative model, discriminative model, self-supervised learning, representation learning, foundation model, scaling laws, transfer learning, cross-entropy, backpropagation, gradient descent, regularization.

Exam traps

Confusing training objective with business KPI.
Equating parameter count with guaranteed production quality.
Assuming dropout or regularization settings are universal defaults.

Navigation

Back to NCA-GENL course page Next: Chapter 2