1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AII module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-AII module pages are restricted to admin users.
Training / NCP-AII
Module study guide
Priority 3 of 5 · Domain 3 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
Domain 3 focuses on control-plane assembly: BCM/HA setup, OS and cluster stack deployment, drivers (GPU + DOCA), container runtime integration, and host-level NGC tooling.
Installation order mistakes create hard-to-debug cascading failures across scheduler and runtime layers.
Drill: Create a dependency-aware install order from bare host to workload-capable node.
Blueprint explicitly calls out Base Command Manager and high availability checks.
Drill: Design a BCM HA validation test with expected outcomes for failover and recovery.
Scheduler and container integration are central to production AI workload orchestration.
Drill: Submit one Slurm job through Enroot/Pyxis and document end-to-end execution path.
Driver install/update/remove flows are explicit exam objectives and operational risk points.
Drill: Build a driver lifecycle checklist covering install, update, rollback, and verification.
GPU workload execution depends on correct container toolkit and runtime wiring.
Drill: Run a GPU-enabled Docker validation flow and capture each verification checkpoint.
Host-level tooling readiness affects artifact access and operational speed.
Drill: Create a host onboarding checklist that includes NGC CLI validation and access policy steps.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Control-plane setup is a layered dependency system where each layer must validate before the next one begins.
Slurm + Enroot + Pyxis defines the path from scheduling policy to GPU container execution.
GPU/DOCA drivers and container toolkit updates can destabilize clusters if changed without staged controls.
Scenario Playbooks
You need to bring a new 16-node environment to workload-ready status with BCM HA, scheduler stack, and GPU container runtime.
Architecture Diagram
[BCM HA Pair]
|
[Management Network]---[Head/Control Nodes]---[Compute Nodes x16]
|
[Slurm + Enroot/Pyxis]
|
[Docker + NVIDIA Toolkit] Response Flow
Success Signals
Docker GPU runtime validation
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi Expected output (example)
NVIDIA-SMI 550.xx ...\nGPU Name Persistence-M|... After a planned driver update, scheduled container jobs fail to launch on a subset of compute nodes.
Architecture Diagram
[Scheduler]
|
[Node Group A: updated + validated] [Node Group B: updated + failing]
\ /
[Container Runtime + Driver Stack] Response Flow
Success Signals
Node driver version check
nvidia-smi --query-gpu=driver_version --format=csv,noheader Expected output (example)
550.90.07 CLI and Commands
Use this sequence to confirm host, driver, runtime, and scheduler path health after installation.
Host GPU driver check
nvidia-smi Expected output (example)
NVIDIA-SMI 550.xx ...\nGPU Name Persistence-M|... Container runtime check
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected output (example)
GPU 0: NVIDIA H100\nGPU 1: NVIDIA H100 Slurm path smoke test
srun --nodes=1 --gpus-per-node=1 --container-image=nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected output (example)
GPU 0: NVIDIA H100 Validate host-level tooling readiness for NVIDIA artifact workflows used in operations and testing.
NGC CLI version check
ngc --version Expected output (example)
NGC CLI Version 3.x.x Authenticated catalog test
ngc registry image list nvidia Expected output (example)
NAMESPACE\tIMAGE\n nvidia\tcuda\n nvidia\tpytorch ... Common Problems
Symptoms
Likely Cause
Scheduler-container integration misconfiguration (resource mapping or plugin/runtime wiring).
Remediation
Prevention: Include Docker-path and scheduler-path checks in every post-change validation set.
Symptoms
Likely Cause
Driver/runtime version drift or incomplete update across node groups.
Remediation
Prevention: Enforce staged rollout with node-group compliance checks before promotion.
Symptoms
Likely Cause
HA configured but not validated under realistic failover conditions.
Remediation
Prevention: Schedule recurring HA drills with evidence retention.
Lab Walkthroughs
Validate OS baseline, driver/runtime readiness, and scheduler-container execution path on a representative node set.
Prerequisites
Verify host GPU driver health on each role group.
nvidia-smi Expected: All target nodes report expected driver and visible GPUs.
Validate Docker GPU runtime path.
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected: Containerized GPU visibility works on target nodes.
Run scheduler-managed containerized GPU test.
srun --nodes=1 --gpus-per-node=1 --container-image=nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected: Scheduler path launches and maps GPU successfully.
Verify NGC CLI availability on management hosts.
ngc --version Expected: CLI returns expected version and is usable.
Record evidence bundle for promotion gate.
Expected: All checks archived with pass/fail status and timestamps.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Dependency graph and installation order planning. | Control-plane install sequence blueprint. |
| 2 | BCM initial setup and baseline validation. | BCM install evidence log. |
| 3 | BCM HA test scenario and failover validation. | HA verification report. |
| 4 | OS + interface configuration for cluster nodes. | Node baseline configuration checklist. |
| 5 | Slurm/Enroot/Pyxis integration and job-path checks. | Scheduler-container integration report. |
| 6 | GPU and DOCA driver lifecycle rehearsal. | Driver lifecycle runbook. |
| 7 | NVIDIA Container Toolkit install + Docker GPU validation. | Container runtime readiness checklist. |
| 8 | NGC CLI install and host-level tooling checks. | Tooling compliance checklist. |
| 9 | Timed end-to-end cluster setup simulation. | Execution trace and gap list. |
| 10 | Final objective revision and command-level recall. | Control-plane exam cheat sheet. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Install BCM and validate high-availability behavior.
Sample Command (Control-plane post-install validation runbook)
nvidia-smi Expected output (example)
NVIDIA-SMI 550.xx ...\nGPU Name Persistence-M|... Validate Slurm + Enroot + Pyxis execution path.
Sample Command (Control-plane post-install validation runbook)
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected output (example)
GPU 0: NVIDIA H100\nGPU 1: NVIDIA H100 Practice safe install/update/remove workflows for GPU and DOCA drivers.
Sample Command (Control-plane post-install validation runbook)
srun --nodes=1 --gpus-per-node=1 --container-image=nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi -L Expected output (example)
GPU 0: NVIDIA H100 Confirm container and artifact tooling readiness on host baseline.
Sample Command (NGC CLI and artifact access runbook)
ngc --version Expected output (example)
NGC CLI Version 3.x.x Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Layered infrastructure depends on correct sequencing to avoid compounded misconfigurations.
Answer: B
HA testing confirms management continuity when a component fails.
Answer: A
The blueprint explicitly names Slurm, Enroot, and Pyxis in cluster installation scope.
Answer: B
Staged rollout with compatibility validation reduces outage risk.
Answer: B
Runtime validation confirms container GPU access but does not prove scheduler or HA readiness.
Answer: B
NGC CLI provides consistent access to NVIDIA resources used in operations and validation.
Answer: B
Driver readiness is necessary but not sufficient for complete control-plane readiness.
Answer: B
Versioned runbooks make installations reproducible and auditable.
Answer: B
Version drift across nodes introduces unpredictable runtime and scheduler behavior.
Answer: B
The domain spans coordinated installation and validation across all these control-plane layers.
Primary References
Curated from the NCP-AII blueprint/study-guide sources and official documentation.
Objectives
Navigation