Protected

NCP-ADS module content is available after admin verification. Redirecting…

If you are not redirected, login.

Access

Admin only

NCP-ADS module pages are restricted to admin users.

Training / NCP-ADS

Machine Learning

Module study guide

Priority 2 of 6 · Domain 5 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight
20%
Priority tier
Tier 1
Why this domain
Multi-GPU, mixed precision, profiling, and experimentation depth.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module is structured for the Machine Learning exam domain with emphasis on GPU-accelerated training, scaling strategy, profiling, and reliable generalization.

Track 1: Parallelism strategy selection

Exam prompts often test whether you can pick the right scaling strategy for memory and throughput constraints.

  • Data Parallel (DDP) is usually the baseline for models that fit on one GPU per replica.
  • Tensor Parallel and Pipeline Parallel are used when model size or layer size exceeds single GPU limits.
  • ZeRO-style sharding reduces optimizer and state memory pressure and can be combined with other techniques.

Drill: Given three scenarios (model fits, model barely fits, largest layer does not fit), map each to the best parallelism strategy and justify tradeoffs.

Track 2: Mixed precision and Tensor Core efficiency

NCP-ADS machine learning scope is strongly GPU-centric and expects precision-aware optimization decisions.

  • Mixed precision combines lower precision compute with selective FP32 stability steps.
  • Loss scaling helps preserve small gradients and avoid underflow in reduced precision training.
  • Tensor Core throughput benefits are maximized when dimensions and batch sizes satisfy hardware-friendly multiples.

Drill: Run one training job in fp32, then mixed precision with proper loss scaling, and compare step time and memory utilization.

Track 3: Throughput and memory tuning on single GPU

Practical exam cases can start on single GPU before escalating to multi-GPU.

  • Batch size choice is the first lever; maximize useful utilization without destabilizing convergence.
  • Gradient accumulation increases effective batch size but adds extra forward/backward overhead.
  • Gradient checkpointing lowers activation memory at the cost of additional recomputation.

Drill: Tune one workload with baseline, gradient accumulation, and checkpointing; document tokens/s or samples/s and memory footprint.

Track 4: Profiling and bottleneck diagnosis

Knowing optimization tools is part of software literacy for accelerated machine learning.

  • Start with nvidia-smi and utilization metrics to identify obvious underutilization.
  • Use profiler traces to isolate time-heavy kernels and non-Tensor-Core bottlenecks.
  • Optimization should be evidence-driven: batch size, precision modes, and kernel hotspots.

Drill: Capture one profile pass, rank top kernels by time, and produce a prioritized optimization action list.

Track 5: Data quality, splitting, and generalization

Model quality failures are commonly caused by split leakage, imbalance, and unstable preprocessing.

  • Stratified splitting helps preserve target distribution across train/validation/test.
  • Feature scaling and normalization choices influence gradient-based and distance-based methods.
  • Overfitting and underfitting should be diagnosed from train/validation behavior, not a single metric snapshot.

Drill: Evaluate one dataset with random split vs stratified split and compare validation stability and minority-class behavior.

Track 6: Hyperparameter optimization workflow

The exam may include strategy-level HPO decisions rather than only algorithm internals.

  • Random search scales well in parallel; Bayesian search is sequential but increasingly informed.
  • Hyperband-style early stopping can cut compute cost for large search spaces.
  • Search-space sizing and scale choice (linear vs log) can dominate tuning efficiency.

Drill: Design an HPO plan with explicit search ranges, scaling assumptions, and parallel job limits for a constrained GPU budget.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Parallelism decision framework

Choose scaling mode based on fit constraints first, then optimize communication and efficiency.

  • If model fits on one GPU, start with DDP baseline and measure scaling.
  • If layer/model does not fit, evaluate TP/PP or ZeRO-style memory sharding.
  • Use topology and interconnect awareness when interpreting scaling results.

Precision and stability engineering

Mixed precision boosts throughput, but stability checks and loss-scaling controls are required for safe training.

  • Compare fp32 and mixed precision with fixed seed and identical data path.
  • Monitor overflow/underflow signals and convergence behavior.
  • Report throughput and validation quality together.

Profiler-driven optimization loop

Optimization should follow measured bottlenecks: input pipeline, kernels, memory, or communication.

  • Capture a targeted profiling window after warm-up.
  • Rank hotspots by contribution to total step time.
  • Apply one change at a time and validate improvement.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Mixed precision improves speed but validation quality regresses

After enabling mixed precision, training throughput increases but validation accuracy drops unexpectedly.

Architecture Diagram

[Data Loader] -> [Forward/Backward] -> [Optimizer Step]
                     |
             [Precision + Loss Scaling]

Response Flow

  1. Validate loss-scaling configuration and overflow logs.
  2. Check whether numerically sensitive operations remain in higher precision where needed.
  3. Compare convergence curves against fp32 baseline for equal training budget.

Success Signals

  • Throughput gain is retained with stable convergence.
  • Validation gap to fp32 is within acceptable threshold.
  • Precision settings are documented for reproducibility.

AMP training toggle (PyTorch-style)

python train.py --precision bf16 --grad-accum 1 --log-interval 50

Expected output (example)

step=500 loss=... throughput=... tokens/s

Scenario B: DDP scale-out from 1 to 8 GPUs gives weak speedup

Training scales poorly despite additional GPUs. You need to classify whether bottleneck is data, communication, or kernel efficiency.

Architecture Diagram

[Sharded Dataset] -> [GPU Worker x N] -> [AllReduce]
          |                              |
     [Input Pipeline]               [Interconnect]

Response Flow

  1. Measure input pipeline throughput and GPU idle time.
  2. Profile communication overhead versus compute time per step.
  3. Tune batch sizing/sharding strategy and rerun controlled benchmark.

Success Signals

  • GPU idle time decreases across workers.
  • Scaling efficiency improves with documented mitigation.
  • Final recommendation includes topology-aware limits.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

Training baseline and precision comparison

Establish reproducible baseline before advanced scaling changes.

Single-GPU fp32 baseline

python train.py --precision fp32 --devices 1 --seed 42

Expected output (example)

epoch=1 step_time=... val_metric=...

Single-GPU mixed precision run

python train.py --precision bf16 --devices 1 --seed 42

Expected output (example)

epoch=1 step_time=... val_metric=...
  • Do not change dataset, seed, or augmentation settings between comparisons.
  • Track both throughput and validation metric delta.

Distributed scaling and profiling runbook

Measure DDP efficiency and inspect bottlenecks with profiler tooling.

DDP scaling sweep

for n in 1 2 4 8; do torchrun --nproc_per_node=$n train.py --precision bf16 --seed 42; done

Expected output (example)

n=1 ... samples/s
n=2 ...
n=4 ...

Nsight Systems profile sample

nsys profile -o train_profile --trace=cuda,nvtx,osrt python train.py --max-steps 200

Expected output (example)

Profile report generated: train_profile.qdrep
  • Use short profiling windows after warm-up to reduce trace noise.
  • Re-run only one tuned parameter at a time after profile analysis.

Common Problems

Failure patterns and fixes

OOM and unstable throughput when increasing effective batch size

Symptoms

  • Training crashes when batch or accumulation is raised.
  • Step time becomes inconsistent before failure.
  • GPU memory remains near ceiling with frequent spikes.

Likely Cause

Batch strategy exceeded memory headroom without checkpointing/sharding adjustments.

Remediation

  • Reduce per-device batch and evaluate gradient accumulation tradeoff.
  • Enable checkpointing or memory-sharding strategy where appropriate.
  • Re-profile memory usage after each change.

Prevention: Use memory budget planning before increasing batch-size experiments.

High training score but weak generalization

Symptoms

  • Training loss drops steadily while validation performance plateaus or degrades.
  • Model sensitivity to split choice is high.
  • Minority-class behavior is inconsistent.

Likely Cause

Overfitting combined with insufficient split discipline or preprocessing robustness.

Remediation

  • Re-check split strategy with stratification and leakage safeguards.
  • Add regularization and monitor early-stopping criteria.
  • Evaluate class-wise metrics and recalibrate preprocessing choices.

Prevention: Track train/validation divergence and class-wise metrics in every experiment review.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough A: Precision and memory tradeoff benchmark

Compare fp32 and mixed precision with stable, reproducible setup.

Prerequisites

  • Single-GPU training script with fixed seed support.
  • Validation metric pipeline available.
  • Ability to log throughput and memory usage.
  1. Run fp32 baseline and capture throughput, memory, and validation metrics.

    python train.py --precision fp32 --devices 1 --seed 42

    Expected: Baseline metrics are stored for direct comparison.

  2. Run mixed precision with identical data and seed settings.

    python train.py --precision bf16 --devices 1 --seed 42

    Expected: Throughput and memory change is measured without confounding factors.

  3. Decide deployment precision policy with quality guardrail.

    Expected: Final decision includes both speed gain and validation tolerance.

Success Criteria

  • Experiment setup is identical except precision mode.
  • Quality regression, if any, is explicitly quantified.
  • Precision policy is documented for reproducible training runs.

Walkthrough B: Distributed scaling diagnosis lab

Investigate why scaling efficiency drops as GPU count increases.

Prerequisites

  • Training job runnable with torchrun/DDP.
  • Access to profiler tooling and cluster telemetry.
  • Fixed dataset and batch policy.
  1. Run scaling sweep at 1, 2, and 4+ GPUs and capture throughput.

    Expected: Efficiency curve reveals where scaling starts to degrade.

  2. Profile one degraded run for communication and input stalls.

    Expected: Primary bottleneck category is identified with evidence.

  3. Apply one targeted change and rerun sweep.

    Expected: Efficiency improves or limitation is clearly documented.

Success Criteria

  • Scaling report includes efficiency and bottleneck evidence.
  • Mitigation decision is tied to measured root cause.
  • Runbook includes tested limits and recommended topology.

Study Sprint

10-day execution plan

Day Focus Output
1 Baseline model training and metric capture on single GPU. Reference run report with throughput, memory, and validation metrics.
2 Precision experiments (fp32 vs mixed precision settings). Precision comparison table and stability notes.
3 Memory tactics: accumulation and checkpointing sweep. Memory and runtime tradeoff matrix.
4 Parallelism strategy rehearsal (DDP, ZeRO concepts, TP/PP decision rules). Decision tree for selecting parallelism mode.
5 Profiling deep dive and kernel hotspot analysis. Top bottleneck list and mitigation plan.
6 Data split quality and stratification checks. Leakage and class-balance checklist with corrected splits.
7 Normalization and preprocessing ablations. Performance comparison across preprocessing variants.
8 HPO design and constrained-budget tuning run. Tuning strategy report and best-trial summary.
9 Distributed case study with memory-risk controls. Runbook for OOM prevention and recovery.
10 Exam simulation and final revision pass. Final cheat sheet and weak-area remediation list.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Precision and throughput

Quantify impact of mixed precision and TF32/BF16 options on performance and accuracy.

  • Establish fp32 baseline with fixed seed and data pipeline.
  • Enable mixed precision path and compare wall-clock step time.
  • Confirm no unacceptable quality regression on validation metrics.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Training baseline and precision comparison)

python train.py --precision fp32 --devices 1 --seed 42

Expected output (example)

epoch=1 step_time=... val_metric=...

Lab B: Multi-GPU strategy fit

Choose and validate a scaling strategy for model-size and interconnect constraints.

  • Classify workload into fit/no-fit/layer-too-large category.
  • Prototype DDP and one memory-sharding method where relevant.
  • Record communication overhead observations and effective speedup.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Training baseline and precision comparison)

python train.py --precision bf16 --devices 1 --seed 42

Expected output (example)

epoch=1 step_time=... val_metric=...

Lab C: Profiling-led optimization

Use profiler evidence to improve utilization.

  • Capture one targeted profiling window after warm-up iterations.
  • Identify top kernels and non-compute stalls.
  • Apply two targeted changes and verify measurable improvement.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Distributed scaling and profiling runbook)

for n in 1 2 4 8; do torchrun --nproc_per_node=$n train.py --precision bf16 --seed 42; done

Expected output (example)

n=1 ... samples/s
n=2 ...
n=4 ...

Lab D: Generalization and HPO discipline

Improve model robustness while controlling tuning cost.

  • Use stratified splits for imbalanced targets.
  • Run bounded HPO with explicit scaling and parallel-job limits.
  • Explain why chosen configuration generalizes beyond training metrics.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Distributed scaling and profiling runbook)

nsys profile -o train_profile --trace=cuda,nvtx,osrt python train.py --max-steps 200

Expected output (example)

Profile report generated: train_profile.qdrep

Exam Pitfalls

Common failure patterns

  • Using large effective batch sizes without validating optimization stability.
  • Assuming mixed precision is always stable without loss-scaling or overflow checks.
  • Ignoring interconnect limits when selecting distributed strategy.
  • Treating low training loss as success while validation quality degrades.
  • Over-searching hyperparameters with poorly bounded ranges and wrong scale assumptions.
  • Skipping stratification on imbalanced data and reporting misleading metrics.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Your model fits on one GPU but training is too slow. What is the best first distributed baseline?
  • A. Pipeline parallel only
  • B. Tensor parallel only
  • C. Distributed Data Parallel (DDP)
  • D. Full model rewrite before profiling

Answer: C

When the model fits on one GPU, DDP is generally the simplest and strongest baseline for scaling throughput.

Q2. Which statement about mixed precision is most accurate?
  • A. It always reduces memory usage in every setup
  • B. It combines low precision compute with selective higher-precision stability steps
  • C. It removes all need for numerical validation
  • D. It is only useful for inference

Answer: B

Mixed precision is a selective precision strategy; stability-critical computations often remain in higher precision.

Q3. Why can gradient accumulation hurt runtime even while helping memory limits?
  • A. It increases data-loading randomness only
  • B. It adds extra forward/backward passes before each optimizer update
  • C. It disables Tensor Cores
  • D. It removes gradient computations

Answer: B

Accumulation increases the number of micro-steps, which can reduce training throughput.

Q4. If the largest layer cannot fit on one GPU, what is commonly required?
  • A. Random search only
  • B. Tensor parallelism or equivalent layer-level sharding
  • C. Only increasing batch size
  • D. Disabling distributed communication

Answer: B

Layer-level size constraints often require tensor-level splitting beyond simple data parallel replication.

Q5. What is the main value of profiler-guided optimization?
  • A. It guarantees perfect convergence
  • B. It replaces validation metrics
  • C. It pinpoints actual bottlenecks before tuning decisions
  • D. It removes the need for experiment tracking

Answer: C

Profiling directs optimization effort to measured bottlenecks rather than guesswork.

Q6. Why is stratified splitting useful for many classification tasks?
  • A. It enforces equal feature means
  • B. It preserves target distribution across splits
  • C. It guarantees higher training accuracy
  • D. It removes all data leakage risk

Answer: B

Stratification keeps class proportions more consistent between train/validation/test.

Q7. Which HPO strategy is usually best when you need many independent parallel trials quickly?
  • A. Bayesian-only sequential tuning
  • B. Random search
  • C. Manual grid over every parameter without bounds
  • D. No search, only defaults

Answer: B

Random search scales well in parallel because each trial is independent.

Q8. What is a frequent signal of overfitting?
  • A. Training and validation both improve steadily
  • B. Training error decreases while validation error worsens
  • C. Validation improves while training worsens
  • D. All metrics are unchanged

Answer: B

A widening train-validation gap typically indicates memorization without robust generalization.

Q9. What is the key reason to use loss scaling in reduced-precision training?
  • A. Increase dataset size automatically
  • B. Preserve small gradient values that might otherwise underflow
  • C. Disable optimizer updates
  • D. Force all operations to int8

Answer: B

Loss scaling protects gradient signal in low-precision arithmetic by shifting values into representable ranges.

Q10. For gradient-based models, what is one core benefit of feature normalization?
  • A. It always improves every model equally
  • B. It helps stabilize optimization by reducing scale imbalance across features
  • C. It removes need for train-test split
  • D. It guarantees no outliers

Answer: B

Normalization improves optimization behavior by reducing disproportionate influence from large-scale features.

Primary References

Curated from your NCP-ADS vault and official documentation.

Objectives

  1. 5.1 Perform feature engineering for model development.
  2. 5.2 Identify when data scale or workload profile requires acceleration.
  3. 5.3 Run rapid experiments to balance accuracy and inference performance.
  4. 5.4 Optimize machine learning hyperparameters.
  5. 5.5 Train models and compare single-GPU versus multi-GPU strategies.
  6. 5.6 Apply GPU memory optimization techniques such as batching and mixed precision.

Navigation