1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-ADS module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-ADS module pages are restricted to admin users.
Training / NCP-ADS
Module study guide
Priority 2 of 6 · Domain 5 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
This module is structured for the Machine Learning exam domain with emphasis on GPU-accelerated training, scaling strategy, profiling, and reliable generalization.
Exam prompts often test whether you can pick the right scaling strategy for memory and throughput constraints.
Drill: Given three scenarios (model fits, model barely fits, largest layer does not fit), map each to the best parallelism strategy and justify tradeoffs.
NCP-ADS machine learning scope is strongly GPU-centric and expects precision-aware optimization decisions.
Drill: Run one training job in fp32, then mixed precision with proper loss scaling, and compare step time and memory utilization.
Practical exam cases can start on single GPU before escalating to multi-GPU.
Drill: Tune one workload with baseline, gradient accumulation, and checkpointing; document tokens/s or samples/s and memory footprint.
Knowing optimization tools is part of software literacy for accelerated machine learning.
Drill: Capture one profile pass, rank top kernels by time, and produce a prioritized optimization action list.
Model quality failures are commonly caused by split leakage, imbalance, and unstable preprocessing.
Drill: Evaluate one dataset with random split vs stratified split and compare validation stability and minority-class behavior.
The exam may include strategy-level HPO decisions rather than only algorithm internals.
Drill: Design an HPO plan with explicit search ranges, scaling assumptions, and parallel job limits for a constrained GPU budget.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Choose scaling mode based on fit constraints first, then optimize communication and efficiency.
Mixed precision boosts throughput, but stability checks and loss-scaling controls are required for safe training.
Optimization should follow measured bottlenecks: input pipeline, kernels, memory, or communication.
Scenario Playbooks
After enabling mixed precision, training throughput increases but validation accuracy drops unexpectedly.
Architecture Diagram
[Data Loader] -> [Forward/Backward] -> [Optimizer Step]
|
[Precision + Loss Scaling] Response Flow
Success Signals
AMP training toggle (PyTorch-style)
python train.py --precision bf16 --grad-accum 1 --log-interval 50 Expected output (example)
step=500 loss=... throughput=... tokens/s Training scales poorly despite additional GPUs. You need to classify whether bottleneck is data, communication, or kernel efficiency.
Architecture Diagram
[Sharded Dataset] -> [GPU Worker x N] -> [AllReduce]
| |
[Input Pipeline] [Interconnect] Response Flow
Success Signals
CLI and Commands
Establish reproducible baseline before advanced scaling changes.
Single-GPU fp32 baseline
python train.py --precision fp32 --devices 1 --seed 42 Expected output (example)
epoch=1 step_time=... val_metric=... Single-GPU mixed precision run
python train.py --precision bf16 --devices 1 --seed 42 Expected output (example)
epoch=1 step_time=... val_metric=... Measure DDP efficiency and inspect bottlenecks with profiler tooling.
DDP scaling sweep
for n in 1 2 4 8; do torchrun --nproc_per_node=$n train.py --precision bf16 --seed 42; done Expected output (example)
n=1 ... samples/s
n=2 ...
n=4 ... Nsight Systems profile sample
nsys profile -o train_profile --trace=cuda,nvtx,osrt python train.py --max-steps 200 Expected output (example)
Profile report generated: train_profile.qdrep Common Problems
Symptoms
Likely Cause
Batch strategy exceeded memory headroom without checkpointing/sharding adjustments.
Remediation
Prevention: Use memory budget planning before increasing batch-size experiments.
Symptoms
Likely Cause
Overfitting combined with insufficient split discipline or preprocessing robustness.
Remediation
Prevention: Track train/validation divergence and class-wise metrics in every experiment review.
Lab Walkthroughs
Compare fp32 and mixed precision with stable, reproducible setup.
Prerequisites
Run fp32 baseline and capture throughput, memory, and validation metrics.
python train.py --precision fp32 --devices 1 --seed 42 Expected: Baseline metrics are stored for direct comparison.
Run mixed precision with identical data and seed settings.
python train.py --precision bf16 --devices 1 --seed 42 Expected: Throughput and memory change is measured without confounding factors.
Decide deployment precision policy with quality guardrail.
Expected: Final decision includes both speed gain and validation tolerance.
Success Criteria
Investigate why scaling efficiency drops as GPU count increases.
Prerequisites
Run scaling sweep at 1, 2, and 4+ GPUs and capture throughput.
Expected: Efficiency curve reveals where scaling starts to degrade.
Profile one degraded run for communication and input stalls.
Expected: Primary bottleneck category is identified with evidence.
Apply one targeted change and rerun sweep.
Expected: Efficiency improves or limitation is clearly documented.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Baseline model training and metric capture on single GPU. | Reference run report with throughput, memory, and validation metrics. |
| 2 | Precision experiments (fp32 vs mixed precision settings). | Precision comparison table and stability notes. |
| 3 | Memory tactics: accumulation and checkpointing sweep. | Memory and runtime tradeoff matrix. |
| 4 | Parallelism strategy rehearsal (DDP, ZeRO concepts, TP/PP decision rules). | Decision tree for selecting parallelism mode. |
| 5 | Profiling deep dive and kernel hotspot analysis. | Top bottleneck list and mitigation plan. |
| 6 | Data split quality and stratification checks. | Leakage and class-balance checklist with corrected splits. |
| 7 | Normalization and preprocessing ablations. | Performance comparison across preprocessing variants. |
| 8 | HPO design and constrained-budget tuning run. | Tuning strategy report and best-trial summary. |
| 9 | Distributed case study with memory-risk controls. | Runbook for OOM prevention and recovery. |
| 10 | Exam simulation and final revision pass. | Final cheat sheet and weak-area remediation list. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Quantify impact of mixed precision and TF32/BF16 options on performance and accuracy.
Sample Command (Training baseline and precision comparison)
python train.py --precision fp32 --devices 1 --seed 42 Expected output (example)
epoch=1 step_time=... val_metric=... Choose and validate a scaling strategy for model-size and interconnect constraints.
Sample Command (Training baseline and precision comparison)
python train.py --precision bf16 --devices 1 --seed 42 Expected output (example)
epoch=1 step_time=... val_metric=... Use profiler evidence to improve utilization.
Sample Command (Distributed scaling and profiling runbook)
for n in 1 2 4 8; do torchrun --nproc_per_node=$n train.py --precision bf16 --seed 42; done Expected output (example)
n=1 ... samples/s
n=2 ...
n=4 ... Improve model robustness while controlling tuning cost.
Sample Command (Distributed scaling and profiling runbook)
nsys profile -o train_profile --trace=cuda,nvtx,osrt python train.py --max-steps 200 Expected output (example)
Profile report generated: train_profile.qdrep Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: C
When the model fits on one GPU, DDP is generally the simplest and strongest baseline for scaling throughput.
Answer: B
Mixed precision is a selective precision strategy; stability-critical computations often remain in higher precision.
Answer: B
Accumulation increases the number of micro-steps, which can reduce training throughput.
Answer: B
Layer-level size constraints often require tensor-level splitting beyond simple data parallel replication.
Answer: C
Profiling directs optimization effort to measured bottlenecks rather than guesswork.
Answer: B
Stratification keeps class proportions more consistent between train/validation/test.
Answer: B
Random search scales well in parallel because each trial is independent.
Answer: B
A widening train-validation gap typically indicates memorization without robust generalization.
Answer: B
Loss scaling protects gradient signal in low-precision arithmetic by shifting values into representable ranges.
Answer: B
Normalization improves optimization behavior by reducing disproportionate influence from large-scale features.
Primary References
Curated from your NCP-ADS vault and official documentation.
Objectives
Navigation