Protected

NCP-AII content is available after admin verification. Redirecting…

If you are not redirected, login.

Access

Admin only

The NCP-AII study guide tab is restricted to admin users.

Training / NCP-AII

NCP-AII Study Guide

This landing page centralizes module-level study notes, drills, and references aligned to the NCP-AII exam blueprint and study guide.

Sidebar documentation-style landing page for AI Infrastructure Professional exam prep.

Admin-only · Official NVIDIA blueprint/study guide sources · 5/5 modules published

Recommended Study Priority

Priority Domain Why
Tier 1 System and Server Bring-up Foundation domain for deployment sequencing, firmware, hardware, and readiness checks.
Tier 1 Cluster Test and Verification Highest exam weight; core to proving cluster readiness, bandwidth, burn-in, and storage reliability.
Tier 2 Control Plane Installation and Configuration Critical integration domain for OS, scheduler stack, drivers, containers, and tooling.
Tier 2 Troubleshoot and Optimize Operational excellence domain for fault isolation, replacement workflows, and performance tuning.
Tier 3 Physical Layer Management Low exam weight but specialized operational tasks (BlueField and MIG).

Domain 1 - System and Server Bring-up

Exam Weight: 31%

Deployment sequence, topology awareness, firmware/hardware validation, and initial storage integration.

Domain Overview

Build operational readiness from first power-on through validated hardware state for AI infrastructure nodes.

Objectives

  • Describe sequence of events for deployment and validation.
  • Describe network topologies for AI factories.
  • Perform initial configuration of BMC, OOB, and TPM.
  • Perform firmware upgrades (including on HGX) and fault detection.
  • Validate power and cooling parameters.
  • Install GPU-based servers (SMI).
  • Validate installed hardware.
  • Describe and validate cable types and transceivers.
  • Install physical GPUs.
  • Validate hardware operation for workloads.
  • Configure initial parameters for third-party storage.

Domain 2 - Physical Layer Management

Exam Weight: 5%

BlueField platform management and MIG configuration patterns for AI and HPC usage models.

Domain Overview

Operate physical-layer resources for performance isolation, utilization control, and fabric reliability.

Objectives

  • Configure and manage a BlueField network platform.
  • Configure MIG (AI and HPC).

Domain 3 - Control Plane Installation and Configuration

Exam Weight: 19%

Install and configure cluster management stack, GPU/DOCA drivers, container runtime, and NGC tooling.

Domain Overview

Establish the software control plane required to run AI workloads across clustered NVIDIA infrastructure.

Objectives

  • Install Base Command Manager (BCM), configure and verify HA.
  • Install OS.
  • Install Cluster (configure category, configure interfaces, install Slurm/Enroot/Pyxis).
  • Install/update/remove NVIDIA GPU and DOCA drivers.
  • Install the NVIDIA container toolkit.
  • Demonstrate how to use NVIDIA GPUs with Docker.
  • Install NGC CLI on hosts.

Domain 4 - Cluster Test and Verification

Exam Weight: 33%

Stress tests, HPL/NCCL validation, firmware checks, burn-in procedures, and storage testing.

Domain Overview

Validate end-to-end cluster performance and stability before production workload onboarding.

Objectives

  • Perform a single-node stress test.
  • Execute HPL (High-Performance Linpack).
  • Perform single-node NCCL (including verifying NVLink Switch).
  • Validate cables by verifying signal quality.
  • Confirm cabling is correct.
  • Confirm FW/SW on switches.
  • Confirm FW/SW on BlueField-3.
  • Confirm FW on transceivers.
  • Run ClusterKit to perform a multifaceted node assessment.
  • Run NCCL to verify E/W fabric bandwidth.
  • Perform NCCL burn-in.
  • Perform HPL burn-in.
  • Perform NeMo burn-in.
  • Test storage.

Domain 5 - Troubleshoot and Optimize

Exam Weight: 12%

Fault diagnosis, component replacement, server optimization, and storage tuning workflows.

Domain Overview

Identify root causes quickly and optimize infrastructure performance under production constraints.

Objectives

  • Identify and troubleshoot hardware faults (e.g., GPU, fan, network card).
  • Identify faulty cards, GPUs, and power supplies.
  • Replace faulty cards, GPUs, and power supplies.
  • Execute performance optimization for AMD and Intel servers.
  • Optimize storage.