Protected

NCP-AII module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-AII

System and Server Bring-up

Module study guide

Priority 1 of 5 · Domain 1 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 31%
Priority tier: Tier 1
Why this domain: Foundation domain for deployment sequencing, firmware, hardware, and readiness checks.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 1 focuses on first-principles bring-up: deployment sequence, topology decisions, BMC/OOB setup, firmware readiness, thermal/power checks, cable validation, and workload-ready hardware state.

Track 1: Bring-up sequence and validation gates

This domain expects disciplined sequencing from rack state to workload-ready server validation.

Use a repeatable bring-up checklist with explicit pass/fail gates for each step.
Separate physical installation, firmware baseline, and workload validation into distinct phases.
Capture evidence per phase so later failures can be traced quickly.

Drill: Write a one-node bring-up runbook with stage gates and artifacts collected at each gate.

Track 2: AI factory topology and cabling basics

Wrong topology and cabling assumptions create hidden bottlenecks before software even starts.

Map workload communication patterns to north-south and east-west traffic needs.
Validate cable type, speed class, and transceiver compatibility before cluster tests.
Confirm that planned topology supports target NCCL/HPL validation goals.

Drill: Given a sample rack diagram, identify one topology risk and one cable/transceiver risk.

Track 3: BMC, OOB, and platform security baseline

Initial management-plane setup is required for safe operations and remote lifecycle management.

Bring up BMC access and out-of-band management before in-band OS workflows.
Establish trusted boot and hardware security baseline checks (including TPM state).
Document management network segmentation and credentials handling policy.

Drill: Create a secure first-boot checklist for BMC/OOB onboarding and validation.

Track 4: Firmware lifecycle and fault detection

Exam scope explicitly includes firmware upgrade flow (including HGX) and fault detection.

Standardize firmware version baseline across compute, switch, and management components.
Use vendor-supported tooling for upgrade planning and post-upgrade verification.
Treat partial upgrades as a risk and verify firmware/software matrix consistency.

Drill: Plan one firmware update cycle and list pre-checks, execution steps, and rollback conditions.

Track 5: Power, cooling, and hardware validation

Thermal and power misconfiguration causes performance throttling and false-negative diagnostics.

Validate power budget against server profile and rack-level limits.
Verify cooling and airflow conditions before stress or burn-in tests.
Use baseline health telemetry before running performance workloads.

Drill: Build a pre-benchmark readiness checklist covering power, thermals, and fan-state metrics.

Track 6: Third-party storage readiness

Bring-up is incomplete if storage path parameters are undefined or unstable.

Define initial storage connectivity and performance expectations before cluster validation.
Validate mount and throughput sanity on first node before scaling cluster-wide.
Treat storage readiness as a gating dependency for later burn-in tests.

Drill: Run a single-node storage sanity test and record minimum throughput thresholds for promotion.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Bring-up gate design and evidence discipline

Treat bring-up as a gated workflow: each stage must produce artifacts before the next stage starts.

A gate should have a binary pass/fail rule and named owner.
Artifacts should include command outputs, firmware versions, and telemetry snapshots.
Gate failures should stop promotion to avoid compounding unresolved faults.

Topology-first thinking for AI factories

Topology choices determine collective communication behavior long before software tuning begins.

Match topology design to workload traffic shape and scaling profile.
Validate transceiver and cable classes against target bandwidth lanes.
Use topology review as a prerequisite for NCCL and ClusterKit test interpretation.

Firmware and platform readiness as one system

Firmware, thermals, power, and management-plane health must be validated as a single readiness package.

Mixed firmware state can mimic application-layer instability.
Thermal and power limits can invalidate benchmark confidence if unchecked.
BMC/OOB visibility should be operational before in-band troubleshooting.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: New HGX node fails distributed validation on day 1

A newly installed node passes OS boot but fails first cluster communication tests. The goal is to isolate whether failure is physical, firmware, or runtime readiness.

Architecture Diagram

        [BMC/OOB Network]
              |
[Mgmt Switch]---[Host Node]---[Top-of-Rack Switch]
                  |
               [HGX GPUs]
                  |
             [Storage Path]

Response Flow

Validate BMC reachability and management health before touching workload stack.
Confirm firmware baseline for host, GPUs, switches, and transceivers.
Run cable/transceiver verification and link-state checks.
Execute single-node workload sanity before rejoining cluster tests.

Success Signals

All physical links report expected speed and stable state.
Firmware matrix aligns with approved baseline.
Single-node GPU workload runs without thermal or power alarms.

GPU and thermal health snapshot

nvidia-smi --query-gpu=name,temperature.gpu,power.draw,clocks.sm --format=csv

Expected output (example)

NVIDIA H100, 38 C, 185 W, 1410 MHz\nNVIDIA H100, 40 C, 191 W, 1410 MHz

Link and transceiver status sample

mlxlink -d <device> --json

Expected output (example)

{\n  "state": "Active",\n  "speed": "400G",\n  "signal_quality": "Pass"\n}

Scenario B: Firmware update window introduces intermittent node instability

After a maintenance window, one node shows random communication failures. Objective is safe rollback-or-promote decision.

Architecture Diagram

[Firmware Repo] -> [Staging Node] -> [Canary Node] -> [Production Node Group]
      |                |                |                 |
  version matrix    pre-checks      post-checks       burn-in gate

Response Flow

Compare current versions against approved firmware matrix.
Run canary post-upgrade health and workload checks.
If mismatch or instability appears, execute rollback plan.
Promote only after stable burn-in evidence.

Success Signals

No version drift between planned and observed state.
No thermal/power anomalies under load.
Canary node remains stable through validation window.

Firmware inventory baseline

sudo nvfwupd --query --format table

Expected output (example)

Component        Version\nGPU FW           96.00.7A\nNVSwitch FW      28.42.1000\nBMC FW           24.05.01

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Pre-benchmark readiness runbook

Use this sequence before any cluster-scale benchmark to eliminate common false negatives from bring-up gaps.

Management-plane reachability

ipmitool -I lanplus -H <bmc_ip> -U <user> chassis status

Expected output (example)

System Power: on\nPower Overload: false\nCooling/Fan Fault: false

GPU inventory and ECC mode

nvidia-smi --query-gpu=index,name,ecc.mode.current --format=csv

Expected output (example)

0, NVIDIA H100, Enabled\n1, NVIDIA H100, Enabled

Topology sanity

nvidia-smi topo -m

Expected output (example)

GPU0  GPU1  CPU Affinity\nGPU0   X    NV2   0-55\nGPU1  NV2    X    0-55

Run with consistent ambient conditions and record timestamp.
Any non-pass result should block benchmark promotion.

Initial storage readiness runbook

Validate storage path behavior before interpreting training or verification performance.

Mount and path check

mount | rg -i '<storage_mount>'

Expected output (example)

10.0.0.24:/ai-data on /mnt/ai-data type nfs4 (rw,hard,timeo=600)

Quick throughput sanity

fio --name=readcheck --directory=/mnt/ai-data --rw=read --bs=1M --size=8G --numjobs=4 --iodepth=16

Expected output (example)

READ: bw=11.2GiB/s, iops=11468, lat (msec): avg=5.34

Use the same dataset profile as planned benchmark workload.
Capture baseline to compare against later incident investigations.

Common Problems

Failure patterns and fixes

Node passes boot but fails first NCCL communication test

Symptoms

NCCL reports timeout or low bandwidth on first multi-node run.
Single-node GPU checks appear healthy.

Likely Cause

Fabric link quality, cable/transceiver mismatch, or post-maintenance firmware drift.

Remediation

Validate switch, transceiver, and link state before rerunning NCCL.
Compare observed firmware versions with approved matrix.
Re-run single-node then multi-node ladder after corrections.

Prevention: Enforce pre-NCCL physical and firmware gate checklist for every maintenance window.

Inconsistent performance across identical nodes

Symptoms

One node is slower during baseline sanity workload.
No obvious software config difference in scheduler.

Likely Cause

Thermal/power variance, partial hardware install issue, or hidden management-plane fault.

Remediation

Capture thermal and power telemetry under same load conditions.
Re-validate hardware inventory and link topology on affected node.
Check BMC event logs for persistent hardware warnings.

Prevention: Capture baseline thermal/power signatures per node at initial bring-up and compare during every audit.

Post-upgrade instability after firmware change

Symptoms

Intermittent failures begin only after firmware maintenance.
Issues are not reproduced on pre-upgrade canary baseline.

Likely Cause

Compatibility mismatch or incomplete update scope.

Remediation

Run compatibility matrix diff between planned and actual versions.
Apply rollback to known-good baseline if mismatch persists.
Re-validate with controlled canary gate before cluster-wide promotion.

Prevention: Use staged upgrade path with explicit canary criteria and rollback automation.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: First-node bring-up to workload-ready state

Complete first-node onboarding with evidence-driven validation across management, hardware, and storage paths.

Prerequisites

Rack power and network ports provisioned.
Access to BMC credentials and approved firmware matrix.
Storage endpoint reachable from management network.

Validate BMC/OOB access and baseline health.
```
ipmitool -I lanplus -H <bmc_ip> -U <user> sel list | tail -n 10
```
Expected: No critical unresolved hardware events.
Confirm GPU inventory and topology visibility.
```
nvidia-smi -L && nvidia-smi topo -m
```
Expected: All expected GPUs visible with stable topology map.
Run thermal/power readiness check under light load.
```
nvidia-smi --query-gpu=temperature.gpu,power.draw --format=csv
```
Expected: Thermal and power values remain within policy thresholds.
Validate link/transceiver state for data-path correctness.
```
mlxlink -d <device> --json
```
Expected: All required links active at expected speed with pass signal quality.
Run storage sanity benchmark and record baseline.
```
fio --name=sanity --directory=/mnt/ai-data --rw=read --bs=1M --size=4G --numjobs=2
```
Expected: Throughput meets minimum readiness threshold for domain policy.

Success Criteria

All bring-up gates marked PASS with collected artifacts.
Node eligible for cluster validation ladder (single-node to multi-node).
Runbook evidence saved for future drift comparison.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Deployment sequence design and stage-gate definition.	Bring-up runbook with pass/fail gates.
2	Topology and cabling map review for target AI workload class.	Topology diagram with risk annotations.
3	BMC/OOB onboarding and security baseline checks.	Management-plane baseline report.
4	Firmware inventory and upgrade path planning.	Firmware matrix and upgrade plan.
5	Power/cooling validation and thermal telemetry baseline.	Readiness checklist signed for stress testing.
6	Hardware install/validation checklist rehearsal.	Hardware validation evidence log.
7	Cables/transceivers compatibility and signal quality pre-check.	Cabling validation worksheet.
8	Single-node workload readiness dry run.	Node readiness report.
9	Third-party storage initial parameter validation.	Storage baseline and tuning notes.
10	Final domain revision and timed objective drill.	Bring-up quick-reference sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: End-to-end first node bring-up

Execute first-node bring-up with documented gate checks.

Run physical, firmware, and management checks in defined order.
Capture baseline telemetry before and after each major step.
Fail fast on gating issues and document corrective actions.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Pre-benchmark readiness runbook)

ipmitool -I lanplus -H <bmc_ip> -U <user> chassis status

Expected output (example)

System Power: on\nPower Overload: false\nCooling/Fan Fault: false

Lab B: Firmware baseline integrity

Build a consistent firmware/software baseline across components.

Inventory versions for host, GPU, switch, and DPU-adjacent components.
Execute one controlled upgrade simulation in staging.
Verify post-upgrade health and rollback readiness.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Pre-benchmark readiness runbook)

nvidia-smi --query-gpu=index,name,ecc.mode.current --format=csv

Expected output (example)

0, NVIDIA H100, Enabled\n1, NVIDIA H100, Enabled

Lab C: Topology and cable validation

Prove that physical connectivity matches intended design.

Validate transceiver and cable compatibility against design specs.
Run basic signal-quality and link-state checks.
Document topology mismatches before cluster-scale tests.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Pre-benchmark readiness runbook)

nvidia-smi topo -m

Expected output (example)

GPU0  GPU1  CPU Affinity\nGPU0   X    NV2   0-55\nGPU1  NV2    X    0-55

Lab D: Storage readiness sanity

Confirm storage path is operational before performance validation.

Validate mount and path-level accessibility.
Run quick throughput and latency spot checks.
Record safe baseline thresholds for cluster test entry.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Initial storage readiness runbook)

mount | rg -i '<storage_mount>'

Expected output (example)

10.0.0.24:/ai-data on /mnt/ai-data type nfs4 (rw,hard,timeo=600)

Exam Pitfalls

Common failure patterns

Skipping explicit bring-up gates and discovering faults only during cluster tests.
Treating topology design as static without validating workload communication behavior.
Applying firmware updates without a compatibility matrix and rollback plan.
Ignoring thermal/power constraints until throttling appears in benchmark runs.
Assuming cable insertion is enough without transceiver/signal verification.
Declaring readiness before storage path baselines are validated.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the best operational reason to use stage gates during server bring-up?

A. To make the process longer
B. To isolate failures early and preserve reproducibility
C. To avoid firmware baselines
D. To skip topology validation

Answer: B

Stage gates localize failures and create a deterministic path from installation to workload readiness.

Q2. Why should topology assumptions be validated before NCCL/HPL cluster testing?

A. Topology does not affect communication
B. Topology and cabling can cap bandwidth and skew test outcomes
C. NCCL only tests CPUs
D. HPL ignores network behavior

Answer: B

Incorrect topology or cabling can introduce bottlenecks and make benchmark results misleading.

Q3. What is a key benefit of initial BMC/OOB setup in bring-up flow?

A. It replaces all in-band diagnostics
B. It provides remote lifecycle control and early fault visibility
C. It removes need for firmware updates
D. It disables security policies

Answer: B

Management-plane access enables early control, monitoring, and recovery during provisioning.

Q4. Which firmware practice is most defensible for production readiness?

A. Upgrade components independently without matrix checks
B. Use a validated version matrix and post-upgrade verification
C. Ignore rollback planning
D. Keep mixed firmware versions indefinitely

Answer: B

Compatibility matrix checks plus validation reduce drift and upgrade-induced instability.

Q5. Why are power and cooling checks required before workload validation?

A. They only matter for idle systems
B. Thermal/power limits can trigger throttling and false diagnostics
C. They are optional once GPUs are detected
D. They apply only to storage nodes

Answer: B

Unvalidated thermal and power state can distort performance and hide root causes.

Q6. What should happen before scaling storage checks cluster-wide?

A. Skip all single-node checks
B. Validate storage baseline on first node
C. Run only network tests
D. Disable monitoring

Answer: B

Single-node storage baselining prevents amplifying unresolved path issues across the cluster.

Q7. In bring-up context, why validate cable and transceiver types explicitly?

A. Physical link presence always implies correct performance
B. Compatibility and signal quality directly affect stability and bandwidth
C. Only switch firmware matters
D. It is irrelevant once OS boots

Answer: B

Correct cable/transceiver pairing and signal quality are prerequisites for reliable fabric behavior.

Q8. Which outcome best indicates a successful initial bring-up phase?

A. GPU visible in one command only
B. Hardware, firmware, thermal, and management checks all pass documented gates
C. Cluster benchmarks started immediately
D. Logs are not required

Answer: B

A complete bring-up result requires multi-domain validation evidence, not single-command success.

Q9. What is the operational purpose of a firmware rollback plan?

A. To avoid documenting versions
B. To recover quickly if upgrade validation fails
C. To keep old hardware unsupported
D. To skip pre-checks

Answer: B

Rollback readiness limits downtime and risk when post-upgrade checks fail.

Q10. What is the strongest reason to keep bring-up artifacts (logs/checklists) per node?

A. They are not needed later
B. They support root-cause analysis and consistent scaling
C. They replace all benchmarks
D. They only help UI dashboards

Answer: B

Artifact discipline improves troubleshooting speed and standardizes future node deployments.

Primary References

Curated from the NCP-AII blueprint/study-guide sources and official documentation.

Objectives

1.1 Describe sequence of events for deployment and validation.
1.2 Describe network topologies for AI factories.
1.3 Perform initial configuration of BMC, OOB, and TPM.
1.4 Perform firmware upgrades (including on HGX) and fault detection.
1.5 Validate power and cooling parameters.
1.6 Install GPU-based servers (SMI).
1.7 Validate installed hardware.
1.8 Describe and validate cable types and transceivers.
1.9 Install physical GPUs.
1.10 Validate hardware operation for workloads.
1.11 Configure initial parameters for third-party storage.

Navigation

Back to NCP-AII landing Next: Cluster Test and Verification