Protected

NCP-AII module content is available after admin verification. Redirecting…

If you are not redirected, login.

Access

Admin only

NCP-AII module pages are restricted to admin users.

Training / NCP-AII

Cluster Test and Verification

Module study guide

Priority 2 of 5 · Domain 4 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight
33%
Priority tier
Tier 1
Why this domain
Highest exam weight; core to proving cluster readiness, bandwidth, burn-in, and storage reliability.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 4 is the highest-weight area and emphasizes proving cluster readiness using stress tests, HPL/NCCL checks, firmware and cable validation, burn-in routines, and storage verification.

Track 1: Validation strategy and test layering

You are expected to run tests in a disciplined order from single-node sanity to multi-node fabric confidence.

  • Start with single-node validation before multi-node collectives.
  • Use progressively stronger tests: readiness, performance, and burn-in stability.
  • Define clear pass/fail thresholds per test category.

Drill: Design a layered validation plan showing test order, thresholds, and stop conditions.

Track 2: HPL execution and burn-in

HPL is explicitly listed in exam objectives for verification and burn-in.

  • Differentiate quick HPL functional runs from long-duration burn-in runs.
  • Capture not just peak output but stability and repeatability over time.
  • Treat thermal/power stability as part of HPL result interpretation.

Drill: Run an HPL baseline protocol and document acceptance criteria for burn-in promotion.

Track 3: NCCL communication validation

NCCL tests verify east-west bandwidth, collectives behavior, and NVLink/NVSwitch paths.

  • Run single-node NCCL first, then expand to multi-node paths.
  • Use NCCL diagnostics and debug settings when behavior diverges from baseline.
  • Correlate NCCL outcomes with topology, cabling, and firmware state.

Drill: Execute a two-stage NCCL test sequence (single-node then multi-node) and explain result deltas.

Track 4: ClusterKit multifaceted node assessment

Blueprint explicitly calls out ClusterKit for broad node and communication assessment.

  • ClusterKit combines latency, bandwidth, collective, and stress-style checks.
  • ClusterKit execution should align with scheduler or passwordless orchestration prerequisites.
  • Use ClusterKit findings to prioritize deeper NCCL/HPL investigations.

Drill: Run a minimal ClusterKit scenario and produce a triage list from its findings.

Track 5: Firmware, cabling, and transceiver verification

Fabric quality issues can invalidate benchmark conclusions if not verified first.

  • Confirm firmware/software alignment on switches and BlueField components.
  • Validate cable routes and transceiver compatibility to prevent hidden link issues.
  • Treat signal-quality checks as prerequisites for interpreting bandwidth anomalies.

Drill: Create a pre-benchmark fabric checklist covering cable, transceiver, switch, and BlueField state.

Track 6: Storage verification in cluster readiness

Storage bottlenecks can mimic compute/network instability in end-to-end validation.

  • Include storage checks in readiness and burn-in flows, not only in post-failure triage.
  • Correlate storage throughput baselines with workload data path expectations.
  • Escalate storage anomalies before interpreting model or benchmark regressions.

Drill: Pair one storage test with one NCCL/HPL run and explain cross-layer interpretation.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Validation ladder design

Reliable cluster verification uses a strict progression from local sanity to distributed stress and long-duration burn-in.

  • Start single-node to reduce fault-space before multi-node collectives.
  • Separate quick readiness checks from prolonged stability tests.
  • Use explicit promotion gates between each layer.

HPL and NCCL interpretation discipline

Benchmark numbers are only useful when interpreted with topology, firmware, and thermal context.

  • A high peak score does not guarantee stable production operation.
  • NCCL anomalies often require fabric and firmware correlation.
  • Burn-in success requires duration + error-free behavior, not one run.

Integrated readiness across compute, fabric, and storage

Cluster readiness is end-to-end; storage and networking signals must be evaluated alongside compute metrics.

  • Include storage tests in validation plan, not only post-incident.
  • Run cross-layer diagnostics when regressions appear.
  • Preserve test artifacts for trend comparison across maintenance cycles.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Multi-node NCCL bandwidth lower than baseline

After infrastructure changes, multi-node all_reduce performance drops sharply while single-node tests remain stable.

Architecture Diagram

[Node A GPUs] -- NVLink/NVSwitch -- [Node A]
      |                                |
  InfiniBand/Ethernet Fabric (E/W) between nodes
      |                                |
[Node B GPUs] -- NVLink/NVSwitch -- [Node B]

Response Flow

  1. Confirm single-node NCCL remains baseline to narrow fault domain.
  2. Audit cabling/transceivers and switch/BlueField firmware state.
  3. Run targeted NCCL tests by message size and topology segment.
  4. Re-run full ladder after physical/firmware remediation.

Success Signals

  • Bandwidth returns near expected baseline across key message ranges.
  • No link or signal-quality anomalies remain.
  • Burn-in test remains stable over planned duration.

NCCL all_reduce sample

all_reduce_perf -b 8 -e 1G -f 2 -g 8

Expected output (example)

# size  count  type  redop  root  time(us)  algbw(GB/s)\n...\n1073741824 ... 356.2

Scenario B: HPL peak looks good but burn-in fails intermittently

Quick HPL run meets target, but long-duration burn-in reports periodic failures.

Architecture Diagram

[Validation Pipeline]
  -> [HPL Functional] -> [NCCL Functional] -> [Burn-in Window]
                                  |
                             [Thermal/Power Telemetry]

Response Flow

  1. Correlate failure timestamps with thermal and power telemetry.
  2. Check error logs for recurring node/component signatures.
  3. Isolate suspect nodes and perform focused stress diagnostics.
  4. Promote only when burn-in criteria pass on full target set.

Success Signals

  • No recurring error signature through full burn-in window.
  • Thermal and power telemetry remain within policy thresholds.
  • Validation artifacts show stable reproducibility.

GPU telemetry sampling during burn-in

nvidia-smi --query-gpu=timestamp,temperature.gpu,power.draw,utilization.gpu --format=csv -l 30

Expected output (example)

2026/02/17 10:10:00, 62 C, 287 W, 96 %\n2026/02/17 10:10:30, 63 C, 291 W, 97 %

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

Single-node to multi-node NCCL runbook

Apply this sequence to validate communication path progressively and avoid ambiguous multi-node failures.

Single-node NCCL check

all_reduce_perf -b 8 -e 512M -f 2 -g 8

Expected output (example)

Single-node NCCL baseline captured with expected intra-node bandwidth.

Multi-node NCCL check

mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1

Expected output (example)

Cross-node bandwidth and latency metrics generated for comparison.
  • Keep message-size and process mapping consistent between baseline and comparison runs.
  • Pair results with fabric and firmware state snapshot.

HPL + burn-in runbook

Use a two-phase pattern: functional performance check, then long-duration stability check.

HPL functional sample

./hpl.sh --dat ./HPL.dat

Expected output (example)

HPL run complete: baseline GFLOPS captured.

Burn-in sample loop

for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done

Expected output (example)

12-cycle burn-in complete with no fatal errors.
  • Collect thermal/power telemetry in parallel.
  • Fail promotion if any cycle produces unexplained node-level instability.

Common Problems

Failure patterns and fixes

Single-node healthy, multi-node NCCL degraded

Symptoms

  • Intra-node bandwidth is normal.
  • Inter-node collectives are slower or unstable.

Likely Cause

Fabric-level issue (cabling, transceiver, switch/BlueField firmware, or topology mismatch).

Remediation

  • Audit physical links and transceiver compatibility.
  • Confirm switch and BlueField firmware/software alignment.
  • Run targeted NCCL tests across suspected path segments.

Prevention: Enforce pre-NCCL fabric integrity checklist before every full cluster validation.

Burn-in instability despite good functional benchmark score

Symptoms

  • Quick benchmark passes.
  • Intermittent failures appear in long-duration runs.

Likely Cause

Thermal, power, or marginal hardware behavior under sustained load.

Remediation

  • Correlate failures with telemetry timelines.
  • Isolate and stress suspect nodes individually.
  • Repeat burn-in after remediation to confirm durability.

Prevention: Treat burn-in as mandatory gate, not optional validation.

Storage bottleneck misread as compute/network issue

Symptoms

  • Benchmark variability increases with data-heavy workloads.
  • Communication metrics alone do not explain slowdown.

Likely Cause

Storage path throughput/latency instability affecting end-to-end pipeline behavior.

Remediation

  • Run storage baseline tests alongside cluster benchmarks.
  • Compare compute/network metrics with storage telemetry windows.
  • Tune or remediate storage path before further benchmark interpretation.

Prevention: Include storage checks in every validation cycle and artifact bundle.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Full cluster verification ladder

Execute complete readiness ladder from single-node checks to burn-in and storage validation.

Prerequisites

  • Nodes passed bring-up and control-plane readiness gates.
  • Baseline firmware and topology reports available.
  • Benchmark tools and telemetry collection configured.
  1. Run single-node stress and NCCL functional checks.

    all_reduce_perf -b 8 -e 512M -f 2 -g 8

    Expected: Single-node communication baseline captured with no errors.

  2. Execute HPL functional baseline.

    ./hpl.sh --dat ./HPL.dat

    Expected: Baseline output recorded and within expected range.

  3. Run multi-node NCCL E/W bandwidth validation.

    mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1

    Expected: Cross-node bandwidth within policy threshold.

  4. Start NCCL/HPL burn-in cycles with telemetry capture.

    Expected: No repeated instability signatures during burn-in window.

  5. Run storage validation and correlate with benchmark artifacts.

    fio --name=clusterread --directory=/mnt/ai-data --rw=read --bs=1M --size=16G --numjobs=8

    Expected: Storage throughput and latency meet readiness policy.

Success Criteria

  • All ladder stages pass with documented evidence.
  • No unresolved anomalies in burn-in or storage checks.
  • Cluster marked ready for production workload onboarding.

Study Sprint

10-day execution plan

Day Focus Output
1 Validation plan design with layered test order. Cluster validation matrix and thresholds.
2 Single-node stress and baseline health checks. Single-node readiness report.
3 HPL functional run and parameter tuning baseline. HPL baseline results sheet.
4 Single-node NCCL and NVLink/NVSwitch-oriented checks. NCCL single-node validation report.
5 Fabric firmware and cabling/transceiver audit. Fabric integrity checklist.
6 ClusterKit multifaceted assessment run. ClusterKit triage findings.
7 Multi-node NCCL E/W bandwidth verification. Multi-node NCCL bandwidth summary.
8 NCCL and HPL burn-in protocol execution. Burn-in stability log with failure signatures.
9 NeMo workload burn-in and storage verification. Application-level readiness report.
10 Exam simulation and root-cause reasoning drill. Final verification quick-reference pack.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Single-node to multi-node validation ladder

Execute validation in progressive layers and capture confidence evidence.

  • Run single-node stress/HPL/NCCL baseline.
  • Promote to multi-node NCCL only after pass criteria.
  • Document where failures first appear in the ladder.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Single-node to multi-node NCCL runbook)

all_reduce_perf -b 8 -e 512M -f 2 -g 8

Expected output (example)

Single-node NCCL baseline captured with expected intra-node bandwidth.

Lab B: ClusterKit + NCCL triangulation

Use multiple tools to isolate communication issues.

  • Run ClusterKit assessment and capture anomalies.
  • Run targeted NCCL tests for confirmation.
  • Map anomalies to topology/cabling/firmware candidates.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Single-node to multi-node NCCL runbook)

mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1

Expected output (example)

Cross-node bandwidth and latency metrics generated for comparison.

Lab C: Burn-in reliability workflow

Validate cluster stability under prolonged load.

  • Execute NCCL and HPL burn-in runs with monitoring.
  • Track thermal and error-state behavior over time.
  • Define go/no-go criteria for production entry.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (HPL + burn-in runbook)

./hpl.sh --dat ./HPL.dat

Expected output (example)

HPL run complete: baseline GFLOPS captured.

Lab D: End-to-end data-path verification

Validate compute, communication, and storage readiness together.

  • Run one NeMo or representative workload stress pass.
  • Pair with storage throughput/latency checks.
  • Produce integrated readiness statement with evidence.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (HPL + burn-in runbook)

for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done

Expected output (example)

12-cycle burn-in complete with no fatal errors.

Exam Pitfalls

Common failure patterns

  • Skipping single-node checks and jumping directly to multi-node tests.
  • Reporting only peak benchmark numbers without stability context.
  • Interpreting NCCL regressions before validating firmware/cabling state.
  • Using one tool only when results conflict (no triangulation).
  • Declaring burn-in success without clear pass/fail criteria.
  • Ignoring storage effects when diagnosing end-to-end performance issues.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Why should cluster validation begin with single-node checks?
  • A. Multi-node tests are always faster
  • B. It isolates local hardware/runtime faults before fabric complexity
  • C. Single-node tests are unrelated
  • D. It avoids benchmarking

Answer: B

Single-node validation reduces variables and prevents misattributing local faults to cluster fabric.

Q2. What is the main difference between an HPL baseline run and HPL burn-in?
  • A. Baseline focuses on quick functional/performance sanity; burn-in emphasizes stability over time
  • B. Burn-in skips validation
  • C. Baseline requires no telemetry
  • D. They are identical

Answer: A

Burn-in extends duration and stresses reliability characteristics beyond quick benchmark checks.

Q3. What does multi-node NCCL E/W bandwidth testing primarily validate?
  • A. Disk encryption
  • B. Fabric communication efficiency across nodes
  • C. Login shell configuration
  • D. Package manager health

Answer: B

NCCL E/W tests reflect cross-node communication behavior relevant to distributed AI workloads.

Q4. Why combine ClusterKit with targeted NCCL checks?
  • A. To duplicate logs only
  • B. To triangulate and validate communication/performance findings
  • C. To avoid root-cause analysis
  • D. To skip topology checks

Answer: B

Multiple tools provide cross-validation and better fault localization.

Q5. Which is a prerequisite before interpreting low bandwidth results?
  • A. Skip firmware and cable checks
  • B. Validate firmware, cabling, transceivers, and signal quality
  • C. Reinstall OS immediately
  • D. Ignore topology

Answer: B

Physical and firmware issues can masquerade as software benchmark problems.

Q6. What should a burn-in acceptance policy include?
  • A. Only one maximum throughput number
  • B. Duration, error thresholds, thermal behavior, and pass/fail criteria
  • C. No criteria
  • D. Only dashboard screenshots

Answer: B

Burn-in validation must define measurable criteria for reliability, not just peak speed.

Q7. Why include storage tests in cluster verification domain?
  • A. Storage never affects AI workloads
  • B. Storage path issues can bottleneck or destabilize end-to-end runs
  • C. It is purely optional
  • D. It only matters in backup systems

Answer: B

Storage readiness directly affects data-loading-heavy training and validation pipelines.

Q8. What is a common verification anti-pattern?
  • A. Layered testing
  • B. Threshold-driven evaluation
  • C. Running tests without predefined acceptance criteria
  • D. Recording telemetry

Answer: C

Without criteria, teams cannot make consistent promotion decisions.

Q9. Which output best supports promotion to production?
  • A. One successful command
  • B. Multi-tool validation evidence with stable burn-in results
  • C. Unstructured notes
  • D. No logs

Answer: B

Promotion should be based on consistent evidence across readiness, performance, and stability checks.

Q10. In this domain, what is the strongest exam strategy?
  • A. Memorize command flags only
  • B. Practice interpretation and escalation decisions across tool outputs
  • C. Skip storage and fabric topics
  • D. Focus only on one benchmark

Answer: B

NCP-AII scenarios emphasize operational judgment based on test evidence, not single-tool memorization.

Primary References

Curated from the NCP-AII blueprint/study-guide sources and official documentation.

Objectives

  1. 4.1 Perform a single-node stress test.
  2. 4.2 Execute HPL (High-Performance Linpack).
  3. 4.3 Perform single-node NCCL (including verifying NVLink Switch).
  4. 4.4 Validate cables by verifying signal quality.
  5. 4.5 Confirm cabling is correct.
  6. 4.6 Confirm FW/SW on switches.
  7. 4.7 Confirm FW/SW on BlueField-3.
  8. 4.8 Confirm FW on transceivers.
  9. 4.9 Run ClusterKit to perform a multifaceted node assessment.
  10. 4.10 Run NCCL to verify E/W fabric bandwidth.
  11. 4.11 Perform NCCL burn-in.
  12. 4.12 Perform HPL burn-in.
  13. 4.13 Perform NeMo burn-in.
  14. 4.14 Test storage.

Navigation