Protected

NCA-GENL course chapter content is available after login. Redirecting...

If you are not redirected, login.

Courses / Nvidia / NCA-GENL

Chapter 3: Training Large Language Models

Chapter study guide page

Chapter 3 of 12 · Core ML and DL Concepts for LLMs (30%). Secondary: Data for LLM Applications (10%).

Chapter Content

Exam focus

Primary domain: Core ML and DL Concepts for LLMs (30%). Secondary: Data for LLM Applications (10%).

Pretraining
Fine-tuning
Supervised fine-tuning (SFT)
Instruction tuning
Transfer learning
Dataset curation
Data preprocessing
Data augmentation
Synthetic data generation
Curriculum learning
Distributed training
Data parallelism
Model parallelism
Mixed precision training
Gradient accumulation
Checkpointing
Optimizers (Adam, AdamW)
Learning rate scheduling
Overfitting vs underfitting

Scope Bullet Explanations

Pretraining: Large-scale self-supervised training to build general language capability.
Fine-tuning: Additional training to adapt the base model to a target task/domain.
Supervised fine-tuning (SFT): Fine-tuning with labeled prompt-response examples.
Instruction tuning: Trains the model to better follow user instructions and response formats.
Transfer learning: Reuse pretrained knowledge instead of training from scratch.
Dataset curation: Selecting, cleaning, balancing, and deduplicating training data.
Data preprocessing: Normalizing, filtering, and formatting data before training.
Data augmentation: Expanding training examples to improve robustness.
Synthetic data generation: Using models/rules to create additional examples for weak coverage areas.
Curriculum learning: Ordering training data from simpler to harder patterns.
Distributed training: Spreading training across multiple GPUs/nodes.
Data parallelism: Replicating model copies while splitting batches across workers.
Model parallelism: Splitting one large model across multiple devices.
Mixed precision training: Using lower-precision math (FP16/BF16) for speed and memory efficiency.
Gradient accumulation: Combining gradients over multiple micro-batches to emulate larger batches.
Checkpointing: Saving model states for recovery and experiment tracking.
Optimizers (Adam, AdamW): Algorithms that convert gradients into parameter updates.
Learning rate scheduling: Controlled learning-rate changes (warmup/decay) for stable convergence.
Overfitting vs underfitting: Overfitting memorizes training patterns; underfitting fails to learn enough signal.

Chapter overview

Training quality is a systems problem: objective design, data quality, optimization stability, and distributed infrastructure must all align. This chapter covers how LLMs are trained in practice and how to diagnose where runs fail.

Learning objectives

Differentiate pretraining, fine-tuning, SFT, and instruction tuning.
Build data curation and preprocessing processes that improve downstream performance.
Compare data parallelism, model parallelism, and mixed precision strategies.
Apply optimizer, schedule, checkpoint, and regularization controls to reduce failure risk.

3.1 Training lifecycle

Pretraining

Large-scale self-supervised learning on broad corpora establishes linguistic and reasoning priors. Pretraining determines the base capability envelope.

Fine-tuning and SFT

Fine-tuning adapts pretrained models to narrower tasks. SFT specifically uses labeled input-output examples to shape assistant behavior.

Instruction tuning

Instruction tuning emphasizes compliance with user intent, formatting constraints, and conversational helpfulness.

Transfer learning perspective

Think of adaptation as constrained updates on a strong prior. The stronger the prior and cleaner the target data, the less expensive adaptation can be.

3.2 Data pipeline engineering

Dataset curation

Core controls:

source quality ranking,
deduplication,
domain balancing,
toxicity and policy filtering,
leakage prevention.

Preprocessing

Normalize encoding, remove malformed content, segment long documents, and preserve metadata needed for audit and debugging.

Augmentation and synthetic data

Synthetic data can increase coverage for sparse tasks, but weak generation pipelines can inject systematic errors. Always validate synthetic distributions against real production queries.

Curriculum learning

Ordering examples from easier to harder tasks can stabilize early learning and improve final convergence on complex patterns.

3.3 Distributed training and efficiency

Data parallelism

Replicate model across workers, split batches. Good for throughput scaling if communication overhead is managed.

Model parallelism

Shard large model across devices when model size exceeds single-device memory.

Mixed precision

FP16/BF16 reduce memory and increase throughput. Must manage numerical stability with loss scaling and monitoring.

Gradient accumulation

Emulates larger effective batch sizes when memory is constrained.

Checkpointing

Defines recoverability and experiment traceability. Checkpoint cadence should reflect run cost, failure probability, and storage policy.

3.4 Optimization controls

Optimizers

Adam and AdamW are common because adaptive learning rates improve convergence in large parameter spaces. AdamW decouples weight decay more cleanly.

Learning rate scheduling

Warmup prevents unstable early updates. Decay schedules (cosine, linear, step) control later-stage convergence.

Overfitting vs underfitting

Underfitting: training and validation both poor.
Overfitting: training good, validation degrading. Use validation curves, not single snapshots, to diagnose.

3.5 Runbook for stable training

Freeze objective and success metrics before run start.
Lock dataset version and preprocessing commit hash.
Start with proven optimizer/schedule baseline.
Add mixed precision and parallelism incrementally.
Track loss, gradient norms, throughput, and GPU memory continuously.
Stop early on divergence signatures.

3.6 Failure modes

Training on unversioned datasets.
Changing multiple hyperparameters simultaneously.
Ignoring gradient norm spikes.
Using synthetic data without quality gate.
Running distributed jobs without deterministic logging and seeds.

Chapter summary

Training LLMs is an engineering pipeline, not a single model setting. Reliable outcomes require controlled data, stable optimization, and reproducible distributed execution.

Mini-lab: training run design review

Goal: produce a defensible training plan for an instruction assistant.

Define task and acceptance criteria.
Select dataset version and quality filters.
Specify optimizer, learning-rate schedule, and batch strategy.
Choose parallelism mode and precision mode.
Set checkpoint intervals and early-stop conditions.
Document failure triggers and rollback plan. Deliverable in Notion:

Run card with exact hyperparameters, data lineage, and stop criteria.

Review questions

What differs between pretraining and instruction tuning objectives?
Why does dataset deduplication matter for generalization?
When is model parallelism required?
How does mixed precision improve training efficiency?
Why is warmup commonly used with transformers?
What signals indicate impending divergence?
How can synthetic data improve and degrade outcomes?
Why should checkpoint frequency be risk-based?
What is a clean way to distinguish underfitting from overfitting?
Why is experiment reproducibility essential for production teams?

Key terms

Pretraining, fine-tuning, SFT, instruction tuning, data curation, augmentation, synthetic data, data parallelism, model parallelism, mixed precision, gradient accumulation, AdamW, learning-rate schedule, checkpointing.

Exam traps

Assuming larger batch size always improves model quality.
Treating one successful run as reproducible without lineage controls.
Ignoring data pipeline quality while tuning model hyperparameters aggressively.

Navigation

Back to NCA-GENL course page Previous: Chapter 2 Next: Chapter 4