1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AII module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-AII module pages are restricted to admin users.
Training / NCP-AII
Module study guide
Priority 1 of 5 · Domain 1 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
Domain 1 focuses on first-principles bring-up: deployment sequence, topology decisions, BMC/OOB setup, firmware readiness, thermal/power checks, cable validation, and workload-ready hardware state.
This domain expects disciplined sequencing from rack state to workload-ready server validation.
Drill: Write a one-node bring-up runbook with stage gates and artifacts collected at each gate.
Wrong topology and cabling assumptions create hidden bottlenecks before software even starts.
Drill: Given a sample rack diagram, identify one topology risk and one cable/transceiver risk.
Initial management-plane setup is required for safe operations and remote lifecycle management.
Drill: Create a secure first-boot checklist for BMC/OOB onboarding and validation.
Exam scope explicitly includes firmware upgrade flow (including HGX) and fault detection.
Drill: Plan one firmware update cycle and list pre-checks, execution steps, and rollback conditions.
Thermal and power misconfiguration causes performance throttling and false-negative diagnostics.
Drill: Build a pre-benchmark readiness checklist covering power, thermals, and fan-state metrics.
Bring-up is incomplete if storage path parameters are undefined or unstable.
Drill: Run a single-node storage sanity test and record minimum throughput thresholds for promotion.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Treat bring-up as a gated workflow: each stage must produce artifacts before the next stage starts.
Topology choices determine collective communication behavior long before software tuning begins.
Firmware, thermals, power, and management-plane health must be validated as a single readiness package.
Scenario Playbooks
A newly installed node passes OS boot but fails first cluster communication tests. The goal is to isolate whether failure is physical, firmware, or runtime readiness.
Architecture Diagram
[BMC/OOB Network]
|
[Mgmt Switch]---[Host Node]---[Top-of-Rack Switch]
|
[HGX GPUs]
|
[Storage Path] Response Flow
Success Signals
GPU and thermal health snapshot
nvidia-smi --query-gpu=name,temperature.gpu,power.draw,clocks.sm --format=csv Expected output (example)
NVIDIA H100, 38 C, 185 W, 1410 MHz\nNVIDIA H100, 40 C, 191 W, 1410 MHz Link and transceiver status sample
mlxlink -d <device> --json Expected output (example)
{\n "state": "Active",\n "speed": "400G",\n "signal_quality": "Pass"\n} After a maintenance window, one node shows random communication failures. Objective is safe rollback-or-promote decision.
Architecture Diagram
[Firmware Repo] -> [Staging Node] -> [Canary Node] -> [Production Node Group]
| | | |
version matrix pre-checks post-checks burn-in gate Response Flow
Success Signals
Firmware inventory baseline
sudo nvfwupd --query --format table Expected output (example)
Component Version\nGPU FW 96.00.7A\nNVSwitch FW 28.42.1000\nBMC FW 24.05.01 CLI and Commands
Use this sequence before any cluster-scale benchmark to eliminate common false negatives from bring-up gaps.
Management-plane reachability
ipmitool -I lanplus -H <bmc_ip> -U <user> chassis status Expected output (example)
System Power: on\nPower Overload: false\nCooling/Fan Fault: false GPU inventory and ECC mode
nvidia-smi --query-gpu=index,name,ecc.mode.current --format=csv Expected output (example)
0, NVIDIA H100, Enabled\n1, NVIDIA H100, Enabled Topology sanity
nvidia-smi topo -m Expected output (example)
GPU0 GPU1 CPU Affinity\nGPU0 X NV2 0-55\nGPU1 NV2 X 0-55 Validate storage path behavior before interpreting training or verification performance.
Mount and path check
mount | rg -i '<storage_mount>' Expected output (example)
10.0.0.24:/ai-data on /mnt/ai-data type nfs4 (rw,hard,timeo=600) Quick throughput sanity
fio --name=readcheck --directory=/mnt/ai-data --rw=read --bs=1M --size=8G --numjobs=4 --iodepth=16 Expected output (example)
READ: bw=11.2GiB/s, iops=11468, lat (msec): avg=5.34 Common Problems
Symptoms
Likely Cause
Fabric link quality, cable/transceiver mismatch, or post-maintenance firmware drift.
Remediation
Prevention: Enforce pre-NCCL physical and firmware gate checklist for every maintenance window.
Symptoms
Likely Cause
Thermal/power variance, partial hardware install issue, or hidden management-plane fault.
Remediation
Prevention: Capture baseline thermal/power signatures per node at initial bring-up and compare during every audit.
Symptoms
Likely Cause
Compatibility mismatch or incomplete update scope.
Remediation
Prevention: Use staged upgrade path with explicit canary criteria and rollback automation.
Lab Walkthroughs
Complete first-node onboarding with evidence-driven validation across management, hardware, and storage paths.
Prerequisites
Validate BMC/OOB access and baseline health.
ipmitool -I lanplus -H <bmc_ip> -U <user> sel list | tail -n 10 Expected: No critical unresolved hardware events.
Confirm GPU inventory and topology visibility.
nvidia-smi -L && nvidia-smi topo -m Expected: All expected GPUs visible with stable topology map.
Run thermal/power readiness check under light load.
nvidia-smi --query-gpu=temperature.gpu,power.draw --format=csv Expected: Thermal and power values remain within policy thresholds.
Validate link/transceiver state for data-path correctness.
mlxlink -d <device> --json Expected: All required links active at expected speed with pass signal quality.
Run storage sanity benchmark and record baseline.
fio --name=sanity --directory=/mnt/ai-data --rw=read --bs=1M --size=4G --numjobs=2 Expected: Throughput meets minimum readiness threshold for domain policy.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Deployment sequence design and stage-gate definition. | Bring-up runbook with pass/fail gates. |
| 2 | Topology and cabling map review for target AI workload class. | Topology diagram with risk annotations. |
| 3 | BMC/OOB onboarding and security baseline checks. | Management-plane baseline report. |
| 4 | Firmware inventory and upgrade path planning. | Firmware matrix and upgrade plan. |
| 5 | Power/cooling validation and thermal telemetry baseline. | Readiness checklist signed for stress testing. |
| 6 | Hardware install/validation checklist rehearsal. | Hardware validation evidence log. |
| 7 | Cables/transceivers compatibility and signal quality pre-check. | Cabling validation worksheet. |
| 8 | Single-node workload readiness dry run. | Node readiness report. |
| 9 | Third-party storage initial parameter validation. | Storage baseline and tuning notes. |
| 10 | Final domain revision and timed objective drill. | Bring-up quick-reference sheet. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Execute first-node bring-up with documented gate checks.
Sample Command (Pre-benchmark readiness runbook)
ipmitool -I lanplus -H <bmc_ip> -U <user> chassis status Expected output (example)
System Power: on\nPower Overload: false\nCooling/Fan Fault: false Build a consistent firmware/software baseline across components.
Sample Command (Pre-benchmark readiness runbook)
nvidia-smi --query-gpu=index,name,ecc.mode.current --format=csv Expected output (example)
0, NVIDIA H100, Enabled\n1, NVIDIA H100, Enabled Prove that physical connectivity matches intended design.
Sample Command (Pre-benchmark readiness runbook)
nvidia-smi topo -m Expected output (example)
GPU0 GPU1 CPU Affinity\nGPU0 X NV2 0-55\nGPU1 NV2 X 0-55 Confirm storage path is operational before performance validation.
Sample Command (Initial storage readiness runbook)
mount | rg -i '<storage_mount>' Expected output (example)
10.0.0.24:/ai-data on /mnt/ai-data type nfs4 (rw,hard,timeo=600) Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Stage gates localize failures and create a deterministic path from installation to workload readiness.
Answer: B
Incorrect topology or cabling can introduce bottlenecks and make benchmark results misleading.
Answer: B
Management-plane access enables early control, monitoring, and recovery during provisioning.
Answer: B
Compatibility matrix checks plus validation reduce drift and upgrade-induced instability.
Answer: B
Unvalidated thermal and power state can distort performance and hide root causes.
Answer: B
Single-node storage baselining prevents amplifying unresolved path issues across the cluster.
Answer: B
Correct cable/transceiver pairing and signal quality are prerequisites for reliable fabric behavior.
Answer: B
A complete bring-up result requires multi-domain validation evidence, not single-command success.
Answer: B
Rollback readiness limits downtime and risk when post-upgrade checks fail.
Answer: B
Artifact discipline improves troubleshooting speed and standardizes future node deployments.
Primary References
Curated from the NCP-AII blueprint/study-guide sources and official documentation.
Objectives
Navigation