1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AII module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-AII module pages are restricted to admin users.
Training / NCP-AII
Module study guide
Priority 2 of 5 · Domain 4 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
Domain 4 is the highest-weight area and emphasizes proving cluster readiness using stress tests, HPL/NCCL checks, firmware and cable validation, burn-in routines, and storage verification.
You are expected to run tests in a disciplined order from single-node sanity to multi-node fabric confidence.
Drill: Design a layered validation plan showing test order, thresholds, and stop conditions.
HPL is explicitly listed in exam objectives for verification and burn-in.
Drill: Run an HPL baseline protocol and document acceptance criteria for burn-in promotion.
NCCL tests verify east-west bandwidth, collectives behavior, and NVLink/NVSwitch paths.
Drill: Execute a two-stage NCCL test sequence (single-node then multi-node) and explain result deltas.
Blueprint explicitly calls out ClusterKit for broad node and communication assessment.
Drill: Run a minimal ClusterKit scenario and produce a triage list from its findings.
Fabric quality issues can invalidate benchmark conclusions if not verified first.
Drill: Create a pre-benchmark fabric checklist covering cable, transceiver, switch, and BlueField state.
Storage bottlenecks can mimic compute/network instability in end-to-end validation.
Drill: Pair one storage test with one NCCL/HPL run and explain cross-layer interpretation.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Reliable cluster verification uses a strict progression from local sanity to distributed stress and long-duration burn-in.
Benchmark numbers are only useful when interpreted with topology, firmware, and thermal context.
Cluster readiness is end-to-end; storage and networking signals must be evaluated alongside compute metrics.
Scenario Playbooks
After infrastructure changes, multi-node all_reduce performance drops sharply while single-node tests remain stable.
Architecture Diagram
[Node A GPUs] -- NVLink/NVSwitch -- [Node A]
| |
InfiniBand/Ethernet Fabric (E/W) between nodes
| |
[Node B GPUs] -- NVLink/NVSwitch -- [Node B] Response Flow
Success Signals
NCCL all_reduce sample
all_reduce_perf -b 8 -e 1G -f 2 -g 8 Expected output (example)
# size count type redop root time(us) algbw(GB/s)\n...\n1073741824 ... 356.2 Quick HPL run meets target, but long-duration burn-in reports periodic failures.
Architecture Diagram
[Validation Pipeline]
-> [HPL Functional] -> [NCCL Functional] -> [Burn-in Window]
|
[Thermal/Power Telemetry] Response Flow
Success Signals
GPU telemetry sampling during burn-in
nvidia-smi --query-gpu=timestamp,temperature.gpu,power.draw,utilization.gpu --format=csv -l 30 Expected output (example)
2026/02/17 10:10:00, 62 C, 287 W, 96 %\n2026/02/17 10:10:30, 63 C, 291 W, 97 % CLI and Commands
Apply this sequence to validate communication path progressively and avoid ambiguous multi-node failures.
Single-node NCCL check
all_reduce_perf -b 8 -e 512M -f 2 -g 8 Expected output (example)
Single-node NCCL baseline captured with expected intra-node bandwidth. Multi-node NCCL check
mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1 Expected output (example)
Cross-node bandwidth and latency metrics generated for comparison. Use a two-phase pattern: functional performance check, then long-duration stability check.
HPL functional sample
./hpl.sh --dat ./HPL.dat Expected output (example)
HPL run complete: baseline GFLOPS captured. Burn-in sample loop
for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done Expected output (example)
12-cycle burn-in complete with no fatal errors. Common Problems
Symptoms
Likely Cause
Fabric-level issue (cabling, transceiver, switch/BlueField firmware, or topology mismatch).
Remediation
Prevention: Enforce pre-NCCL fabric integrity checklist before every full cluster validation.
Symptoms
Likely Cause
Thermal, power, or marginal hardware behavior under sustained load.
Remediation
Prevention: Treat burn-in as mandatory gate, not optional validation.
Symptoms
Likely Cause
Storage path throughput/latency instability affecting end-to-end pipeline behavior.
Remediation
Prevention: Include storage checks in every validation cycle and artifact bundle.
Lab Walkthroughs
Execute complete readiness ladder from single-node checks to burn-in and storage validation.
Prerequisites
Run single-node stress and NCCL functional checks.
all_reduce_perf -b 8 -e 512M -f 2 -g 8 Expected: Single-node communication baseline captured with no errors.
Execute HPL functional baseline.
./hpl.sh --dat ./HPL.dat Expected: Baseline output recorded and within expected range.
Run multi-node NCCL E/W bandwidth validation.
mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1 Expected: Cross-node bandwidth within policy threshold.
Start NCCL/HPL burn-in cycles with telemetry capture.
Expected: No repeated instability signatures during burn-in window.
Run storage validation and correlate with benchmark artifacts.
fio --name=clusterread --directory=/mnt/ai-data --rw=read --bs=1M --size=16G --numjobs=8 Expected: Storage throughput and latency meet readiness policy.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Validation plan design with layered test order. | Cluster validation matrix and thresholds. |
| 2 | Single-node stress and baseline health checks. | Single-node readiness report. |
| 3 | HPL functional run and parameter tuning baseline. | HPL baseline results sheet. |
| 4 | Single-node NCCL and NVLink/NVSwitch-oriented checks. | NCCL single-node validation report. |
| 5 | Fabric firmware and cabling/transceiver audit. | Fabric integrity checklist. |
| 6 | ClusterKit multifaceted assessment run. | ClusterKit triage findings. |
| 7 | Multi-node NCCL E/W bandwidth verification. | Multi-node NCCL bandwidth summary. |
| 8 | NCCL and HPL burn-in protocol execution. | Burn-in stability log with failure signatures. |
| 9 | NeMo workload burn-in and storage verification. | Application-level readiness report. |
| 10 | Exam simulation and root-cause reasoning drill. | Final verification quick-reference pack. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Execute validation in progressive layers and capture confidence evidence.
Sample Command (Single-node to multi-node NCCL runbook)
all_reduce_perf -b 8 -e 512M -f 2 -g 8 Expected output (example)
Single-node NCCL baseline captured with expected intra-node bandwidth. Use multiple tools to isolate communication issues.
Sample Command (Single-node to multi-node NCCL runbook)
mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1 Expected output (example)
Cross-node bandwidth and latency metrics generated for comparison. Validate cluster stability under prolonged load.
Sample Command (HPL + burn-in runbook)
./hpl.sh --dat ./HPL.dat Expected output (example)
HPL run complete: baseline GFLOPS captured. Validate compute, communication, and storage readiness together.
Sample Command (HPL + burn-in runbook)
for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done Expected output (example)
12-cycle burn-in complete with no fatal errors. Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Single-node validation reduces variables and prevents misattributing local faults to cluster fabric.
Answer: A
Burn-in extends duration and stresses reliability characteristics beyond quick benchmark checks.
Answer: B
NCCL E/W tests reflect cross-node communication behavior relevant to distributed AI workloads.
Answer: B
Multiple tools provide cross-validation and better fault localization.
Answer: B
Physical and firmware issues can masquerade as software benchmark problems.
Answer: B
Burn-in validation must define measurable criteria for reliability, not just peak speed.
Answer: B
Storage readiness directly affects data-loading-heavy training and validation pipelines.
Answer: C
Without criteria, teams cannot make consistent promotion decisions.
Answer: B
Promotion should be based on consistent evidence across readiness, performance, and stability checks.
Answer: B
NCP-AII scenarios emphasize operational judgment based on test evidence, not single-tool memorization.
Primary References
Curated from the NCP-AII blueprint/study-guide sources and official documentation.
Objectives
Navigation