Protected

NCP-AII module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-AII

Cluster Test and Verification

Module study guide

Priority 2 of 5 · Domain 4 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 33%
Priority tier: Tier 1
Why this domain: Highest exam weight; core to proving cluster readiness, bandwidth, burn-in, and storage reliability.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 4 is the highest-weight area and emphasizes proving cluster readiness using stress tests, HPL/NCCL checks, firmware and cable validation, burn-in routines, and storage verification.

Track 1: Validation strategy and test layering

You are expected to run tests in a disciplined order from single-node sanity to multi-node fabric confidence.

Start with single-node validation before multi-node collectives.
Use progressively stronger tests: readiness, performance, and burn-in stability.
Define clear pass/fail thresholds per test category.

Drill: Design a layered validation plan showing test order, thresholds, and stop conditions.

Track 2: HPL execution and burn-in

HPL is explicitly listed in exam objectives for verification and burn-in.

Differentiate quick HPL functional runs from long-duration burn-in runs.
Capture not just peak output but stability and repeatability over time.
Treat thermal/power stability as part of HPL result interpretation.

Drill: Run an HPL baseline protocol and document acceptance criteria for burn-in promotion.

Track 3: NCCL communication validation

NCCL tests verify east-west bandwidth, collectives behavior, and NVLink/NVSwitch paths.

Run single-node NCCL first, then expand to multi-node paths.
Use NCCL diagnostics and debug settings when behavior diverges from baseline.
Correlate NCCL outcomes with topology, cabling, and firmware state.

Drill: Execute a two-stage NCCL test sequence (single-node then multi-node) and explain result deltas.

Track 4: ClusterKit multifaceted node assessment

Blueprint explicitly calls out ClusterKit for broad node and communication assessment.

ClusterKit combines latency, bandwidth, collective, and stress-style checks.
ClusterKit execution should align with scheduler or passwordless orchestration prerequisites.
Use ClusterKit findings to prioritize deeper NCCL/HPL investigations.

Drill: Run a minimal ClusterKit scenario and produce a triage list from its findings.

Track 5: Firmware, cabling, and transceiver verification

Fabric quality issues can invalidate benchmark conclusions if not verified first.

Confirm firmware/software alignment on switches and BlueField components.
Validate cable routes and transceiver compatibility to prevent hidden link issues.
Treat signal-quality checks as prerequisites for interpreting bandwidth anomalies.

Drill: Create a pre-benchmark fabric checklist covering cable, transceiver, switch, and BlueField state.

Track 6: Storage verification in cluster readiness

Storage bottlenecks can mimic compute/network instability in end-to-end validation.

Include storage checks in readiness and burn-in flows, not only in post-failure triage.
Correlate storage throughput baselines with workload data path expectations.
Escalate storage anomalies before interpreting model or benchmark regressions.

Drill: Pair one storage test with one NCCL/HPL run and explain cross-layer interpretation.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Validation ladder design

Reliable cluster verification uses a strict progression from local sanity to distributed stress and long-duration burn-in.

Start single-node to reduce fault-space before multi-node collectives.
Separate quick readiness checks from prolonged stability tests.
Use explicit promotion gates between each layer.

HPL and NCCL interpretation discipline

Benchmark numbers are only useful when interpreted with topology, firmware, and thermal context.

A high peak score does not guarantee stable production operation.
NCCL anomalies often require fabric and firmware correlation.
Burn-in success requires duration + error-free behavior, not one run.

Integrated readiness across compute, fabric, and storage

Cluster readiness is end-to-end; storage and networking signals must be evaluated alongside compute metrics.

Include storage tests in validation plan, not only post-incident.
Run cross-layer diagnostics when regressions appear.
Preserve test artifacts for trend comparison across maintenance cycles.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Multi-node NCCL bandwidth lower than baseline

After infrastructure changes, multi-node all_reduce performance drops sharply while single-node tests remain stable.

Architecture Diagram

[Node A GPUs] -- NVLink/NVSwitch -- [Node A]
      |                                |
  InfiniBand/Ethernet Fabric (E/W) between nodes
      |                                |
[Node B GPUs] -- NVLink/NVSwitch -- [Node B]

Response Flow

Confirm single-node NCCL remains baseline to narrow fault domain.
Audit cabling/transceivers and switch/BlueField firmware state.
Run targeted NCCL tests by message size and topology segment.
Re-run full ladder after physical/firmware remediation.

Success Signals

Bandwidth returns near expected baseline across key message ranges.
No link or signal-quality anomalies remain.
Burn-in test remains stable over planned duration.

NCCL all_reduce sample

all_reduce_perf -b 8 -e 1G -f 2 -g 8

Expected output (example)

# size  count  type  redop  root  time(us)  algbw(GB/s)\n...\n1073741824 ... 356.2

Scenario B: HPL peak looks good but burn-in fails intermittently

Quick HPL run meets target, but long-duration burn-in reports periodic failures.

Architecture Diagram

[Validation Pipeline]
  -> [HPL Functional] -> [NCCL Functional] -> [Burn-in Window]
                                  |
                             [Thermal/Power Telemetry]

Response Flow

Correlate failure timestamps with thermal and power telemetry.
Check error logs for recurring node/component signatures.
Isolate suspect nodes and perform focused stress diagnostics.
Promote only when burn-in criteria pass on full target set.

Success Signals

No recurring error signature through full burn-in window.
Thermal and power telemetry remain within policy thresholds.
Validation artifacts show stable reproducibility.

GPU telemetry sampling during burn-in

nvidia-smi --query-gpu=timestamp,temperature.gpu,power.draw,utilization.gpu --format=csv -l 30

Expected output (example)

2026/02/17 10:10:00, 62 C, 287 W, 96 %\n2026/02/17 10:10:30, 63 C, 291 W, 97 %

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Single-node to multi-node NCCL runbook

Apply this sequence to validate communication path progressively and avoid ambiguous multi-node failures.

Single-node NCCL check

all_reduce_perf -b 8 -e 512M -f 2 -g 8

Expected output (example)

Single-node NCCL baseline captured with expected intra-node bandwidth.

Multi-node NCCL check

mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1

Expected output (example)

Cross-node bandwidth and latency metrics generated for comparison.

Keep message-size and process mapping consistent between baseline and comparison runs.
Pair results with fabric and firmware state snapshot.

HPL + burn-in runbook

Use a two-phase pattern: functional performance check, then long-duration stability check.

HPL functional sample

./hpl.sh --dat ./HPL.dat

Expected output (example)

HPL run complete: baseline GFLOPS captured.

Burn-in sample loop

for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done

Expected output (example)

12-cycle burn-in complete with no fatal errors.

Collect thermal/power telemetry in parallel.
Fail promotion if any cycle produces unexplained node-level instability.

Common Problems

Failure patterns and fixes

Single-node healthy, multi-node NCCL degraded

Symptoms

Intra-node bandwidth is normal.
Inter-node collectives are slower or unstable.

Likely Cause

Fabric-level issue (cabling, transceiver, switch/BlueField firmware, or topology mismatch).

Remediation

Audit physical links and transceiver compatibility.
Confirm switch and BlueField firmware/software alignment.
Run targeted NCCL tests across suspected path segments.

Prevention: Enforce pre-NCCL fabric integrity checklist before every full cluster validation.

Burn-in instability despite good functional benchmark score

Symptoms

Quick benchmark passes.
Intermittent failures appear in long-duration runs.

Likely Cause

Thermal, power, or marginal hardware behavior under sustained load.

Remediation

Correlate failures with telemetry timelines.
Isolate and stress suspect nodes individually.
Repeat burn-in after remediation to confirm durability.

Prevention: Treat burn-in as mandatory gate, not optional validation.

Storage bottleneck misread as compute/network issue

Symptoms

Benchmark variability increases with data-heavy workloads.
Communication metrics alone do not explain slowdown.

Likely Cause

Storage path throughput/latency instability affecting end-to-end pipeline behavior.

Remediation

Run storage baseline tests alongside cluster benchmarks.
Compare compute/network metrics with storage telemetry windows.
Tune or remediate storage path before further benchmark interpretation.

Prevention: Include storage checks in every validation cycle and artifact bundle.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Full cluster verification ladder

Execute complete readiness ladder from single-node checks to burn-in and storage validation.

Prerequisites

Nodes passed bring-up and control-plane readiness gates.
Baseline firmware and topology reports available.
Benchmark tools and telemetry collection configured.

Run single-node stress and NCCL functional checks.
```
all_reduce_perf -b 8 -e 512M -f 2 -g 8
```
Expected: Single-node communication baseline captured with no errors.
Execute HPL functional baseline.
```
./hpl.sh --dat ./HPL.dat
```
Expected: Baseline output recorded and within expected range.
Run multi-node NCCL E/W bandwidth validation.
```
mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1
```
Expected: Cross-node bandwidth within policy threshold.
Start NCCL/HPL burn-in cycles with telemetry capture.

Expected: No repeated instability signatures during burn-in window.
Run storage validation and correlate with benchmark artifacts.
```
fio --name=clusterread --directory=/mnt/ai-data --rw=read --bs=1M --size=16G --numjobs=8
```
Expected: Storage throughput and latency meet readiness policy.

Success Criteria

All ladder stages pass with documented evidence.
No unresolved anomalies in burn-in or storage checks.
Cluster marked ready for production workload onboarding.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Validation plan design with layered test order.	Cluster validation matrix and thresholds.
2	Single-node stress and baseline health checks.	Single-node readiness report.
3	HPL functional run and parameter tuning baseline.	HPL baseline results sheet.
4	Single-node NCCL and NVLink/NVSwitch-oriented checks.	NCCL single-node validation report.
5	Fabric firmware and cabling/transceiver audit.	Fabric integrity checklist.
6	ClusterKit multifaceted assessment run.	ClusterKit triage findings.
7	Multi-node NCCL E/W bandwidth verification.	Multi-node NCCL bandwidth summary.
8	NCCL and HPL burn-in protocol execution.	Burn-in stability log with failure signatures.
9	NeMo workload burn-in and storage verification.	Application-level readiness report.
10	Exam simulation and root-cause reasoning drill.	Final verification quick-reference pack.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Single-node to multi-node validation ladder

Execute validation in progressive layers and capture confidence evidence.

Run single-node stress/HPL/NCCL baseline.
Promote to multi-node NCCL only after pass criteria.
Document where failures first appear in the ladder.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Single-node to multi-node NCCL runbook)

all_reduce_perf -b 8 -e 512M -f 2 -g 8

Expected output (example)

Single-node NCCL baseline captured with expected intra-node bandwidth.

Lab B: ClusterKit + NCCL triangulation

Use multiple tools to isolate communication issues.

Run ClusterKit assessment and capture anomalies.
Run targeted NCCL tests for confirmation.
Map anomalies to topology/cabling/firmware candidates.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Single-node to multi-node NCCL runbook)

mpirun -np 16 -H node1:8,node2:8 all_reduce_perf -b 8 -e 1G -f 2 -g 1

Expected output (example)

Cross-node bandwidth and latency metrics generated for comparison.

Lab C: Burn-in reliability workflow

Validate cluster stability under prolonged load.

Execute NCCL and HPL burn-in runs with monitoring.
Track thermal and error-state behavior over time.
Define go/no-go criteria for production entry.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (HPL + burn-in runbook)

./hpl.sh --dat ./HPL.dat

Expected output (example)

HPL run complete: baseline GFLOPS captured.

Lab D: End-to-end data-path verification

Validate compute, communication, and storage readiness together.

Run one NeMo or representative workload stress pass.
Pair with storage throughput/latency checks.
Produce integrated readiness statement with evidence.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (HPL + burn-in runbook)

for i in {1..12}; do ./hpl.sh --dat ./HPL.dat; sleep 300; done

Expected output (example)

12-cycle burn-in complete with no fatal errors.

Exam Pitfalls

Common failure patterns

Skipping single-node checks and jumping directly to multi-node tests.
Reporting only peak benchmark numbers without stability context.
Interpreting NCCL regressions before validating firmware/cabling state.
Using one tool only when results conflict (no triangulation).
Declaring burn-in success without clear pass/fail criteria.
Ignoring storage effects when diagnosing end-to-end performance issues.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Why should cluster validation begin with single-node checks?

A. Multi-node tests are always faster
B. It isolates local hardware/runtime faults before fabric complexity
C. Single-node tests are unrelated
D. It avoids benchmarking

Answer: B

Single-node validation reduces variables and prevents misattributing local faults to cluster fabric.

Q2. What is the main difference between an HPL baseline run and HPL burn-in?

A. Baseline focuses on quick functional/performance sanity; burn-in emphasizes stability over time
B. Burn-in skips validation
C. Baseline requires no telemetry
D. They are identical

Answer: A

Burn-in extends duration and stresses reliability characteristics beyond quick benchmark checks.

Q3. What does multi-node NCCL E/W bandwidth testing primarily validate?

A. Disk encryption
B. Fabric communication efficiency across nodes
C. Login shell configuration
D. Package manager health

Answer: B

NCCL E/W tests reflect cross-node communication behavior relevant to distributed AI workloads.

Q4. Why combine ClusterKit with targeted NCCL checks?

A. To duplicate logs only
B. To triangulate and validate communication/performance findings
C. To avoid root-cause analysis
D. To skip topology checks

Answer: B

Multiple tools provide cross-validation and better fault localization.

Q5. Which is a prerequisite before interpreting low bandwidth results?

A. Skip firmware and cable checks
B. Validate firmware, cabling, transceivers, and signal quality
C. Reinstall OS immediately
D. Ignore topology

Answer: B

Physical and firmware issues can masquerade as software benchmark problems.

Q6. What should a burn-in acceptance policy include?

A. Only one maximum throughput number
B. Duration, error thresholds, thermal behavior, and pass/fail criteria
C. No criteria
D. Only dashboard screenshots

Answer: B

Burn-in validation must define measurable criteria for reliability, not just peak speed.

Q7. Why include storage tests in cluster verification domain?

A. Storage never affects AI workloads
B. Storage path issues can bottleneck or destabilize end-to-end runs
C. It is purely optional
D. It only matters in backup systems

Answer: B

Storage readiness directly affects data-loading-heavy training and validation pipelines.

Q8. What is a common verification anti-pattern?

A. Layered testing
B. Threshold-driven evaluation
C. Running tests without predefined acceptance criteria
D. Recording telemetry

Answer: C

Without criteria, teams cannot make consistent promotion decisions.

Q9. Which output best supports promotion to production?

A. One successful command
B. Multi-tool validation evidence with stable burn-in results
C. Unstructured notes
D. No logs

Answer: B

Promotion should be based on consistent evidence across readiness, performance, and stability checks.

Q10. In this domain, what is the strongest exam strategy?

A. Memorize command flags only
B. Practice interpretation and escalation decisions across tool outputs
C. Skip storage and fabric topics
D. Focus only on one benchmark

Answer: B

NCP-AII scenarios emphasize operational judgment based on test evidence, not single-tool memorization.

Primary References

Curated from the NCP-AII blueprint/study-guide sources and official documentation.

Objectives

4.1 Perform a single-node stress test.
4.2 Execute HPL (High-Performance Linpack).
4.3 Perform single-node NCCL (including verifying NVLink Switch).
4.4 Validate cables by verifying signal quality.
4.5 Confirm cabling is correct.
4.6 Confirm FW/SW on switches.
4.7 Confirm FW/SW on BlueField-3.
4.8 Confirm FW on transceivers.
4.9 Run ClusterKit to perform a multifaceted node assessment.
4.10 Run NCCL to verify E/W fabric bandwidth.
4.11 Perform NCCL burn-in.
4.12 Perform HPL burn-in.
4.13 Perform NeMo burn-in.
4.14 Test storage.

Navigation

Back to NCP-AII landing Previous: System and Server Bring-up Next: Control Plane Installation and Configuration