1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AII module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-AII module pages are restricted to admin users.
Training / NCP-AII
Module study guide
Priority 5 of 5 · Domain 2 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
Domain 2 covers low-level resource control: BlueField network platform operations and MIG partitioning strategies for AI and HPC multi-tenant environments.
The exam expects operational familiarity with DPU-backed networking and infrastructure control planes.
Drill: Diagram one host-network path and annotate BlueField responsibilities in the flow.
Domain scope includes configuration and management, not just conceptual understanding.
Drill: Write a BlueField maintenance runbook with pre-checks, execution controls, and verification steps.
MIG configuration is explicitly in scope and can affect utilization and isolation outcomes.
Drill: Given three workloads, assign MIG profiles and justify expected tradeoffs.
Correct MIG setup requires operational checks and stable inventory handling.
Drill: Perform a MIG create/validate/reset cycle and record all state transitions.
The blueprint highlights both AI and HPC contexts for physical-layer decisions.
Drill: Build an allocation policy table: workload type, MIG policy, and expected operational impact.
Most operational regressions come from drift, undocumented changes, or profile mismatch.
Drill: Create a post-maintenance audit checklist for BlueField and MIG state consistency.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
BlueField acts as a programmable data-path and infrastructure-control element that can influence networking, isolation, and observability outcomes.
MIG is a resource isolation mechanism, not just a utilization toggle. Profile choice must align with workload SLOs.
Most physical-layer incidents come from undocumented drift across node-level MIG or platform-level BlueField states.
Scenario Playbooks
A subset of nodes starts violating inference latency SLOs after maintenance. Investigation reveals inconsistent MIG profiles across nodes.
Architecture Diagram
[Scheduler] -> [Node Group A: MIG 1g/2g mix] -> [Inference Services]
[Node Group B: drifted profiles] -> [Latency spikes] Response Flow
Success Signals
MIG inventory check
nvidia-smi mig -lgi Expected output (example)
GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb After BlueField update, a node subset exhibits unstable communication despite healthy host-level checks.
Architecture Diagram
[Control Plane]
|
[BlueField Set A: baseline] --- [Switch Fabric] --- [BlueField Set B: updated]
\ intermittent path /
[GPU Hosts] Response Flow
Success Signals
Network interface state sample
ethtool <iface> | rg -i 'Speed|Link detected' Expected output (example)
Speed: 400000Mb/s\nLink detected: yes CLI and Commands
Use this runbook when enabling, validating, or resetting MIG layout for workload classes.
Enable MIG mode
sudo nvidia-smi -i 0 -mig 1 Expected output (example)
Enabled MIG mode for GPU 00000000:17:00.0 List available MIG profiles
nvidia-smi mig -lgip Expected output (example)
GPU instance profiles:\n1g.10gb\n2g.20gb\n3g.40gb List created GPU instances
nvidia-smi mig -lgi Expected output (example)
GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb Validate link-level health when physical-layer regressions are suspected.
Link status and speed check
ethtool <iface> Expected output (example)
Speed: 400000Mb/s\nDuplex: Full\nLink detected: yes Mellanox link diagnostics sample
mlxlink -d <device> --json Expected output (example)
{\n "state": "Active",\n "BER": "Within threshold",\n "signal_quality": "Pass"\n} Common Problems
Symptoms
Likely Cause
Undocumented MIG changes or failed policy application after maintenance.
Remediation
Prevention: Automate MIG policy enforcement and post-change audit checks.
Symptoms
Likely Cause
BlueField software/firmware mismatch across nodes or incomplete rollout.
Remediation
Prevention: Use canary-first lifecycle workflow and block partial production rollout.
Symptoms
Likely Cause
MIG profiles chosen without workload class modeling.
Remediation
Prevention: Review profile policy during capacity planning cycles.
Lab Walkthroughs
Implement and validate a workload-class MIG policy with reproducible inventory state.
Prerequisites
Enable MIG mode for target GPUs.
sudo nvidia-smi -i 0 -mig 1 Expected: MIG mode enabled successfully.
Create planned instance profiles for each class.
nvidia-smi mig -cgi <profile_id> -C Expected: GPU instances created matching plan.
Validate created inventory against design.
nvidia-smi mig -lgi Expected: Inventory output equals policy baseline.
Run one validation workload per class.
python run_validation.py --class <class_name> Expected: Each class runs within expected performance bounds.
Record baseline and integrate with scheduler metadata.
Expected: Resource definitions and node states are synchronized.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | BlueField architecture and role mapping review. | Annotated architecture notes. |
| 2 | BlueField management surface and lifecycle workflow. | BlueField operations checklist. |
| 3 | MIG concepts and profile taxonomy review. | MIG profile quick-reference card. |
| 4 | Hands-on MIG create/list/validate/reset drill. | Validated MIG operations log. |
| 5 | AI workload to MIG mapping exercise. | AI mapping decision matrix. |
| 6 | HPC workload to MIG/full-GPU policy design. | HPC policy recommendation sheet. |
| 7 | BlueField and host consistency checks. | Cross-component consistency report. |
| 8 | Incident scenario: drifted MIG/BlueField state. | Drift response runbook. |
| 9 | Timed objective rehearsal with command-level decisions. | Exam scenario answer sheet. |
| 10 | Final revision and weak-area patch. | Physical-layer rapid revision notes. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Confirm BlueField platform readiness in a controlled environment.
Sample Command (MIG lifecycle verification runbook)
sudo nvidia-smi -i 0 -mig 1 Expected output (example)
Enabled MIG mode for GPU 00000000:17:00.0 Practice deterministic MIG lifecycle management.
Sample Command (MIG lifecycle verification runbook)
nvidia-smi mig -lgip Expected output (example)
GPU instance profiles:\n1g.10gb\n2g.20gb\n3g.40gb Translate workload requirements into physical-layer configuration policy.
Sample Command (MIG lifecycle verification runbook)
nvidia-smi mig -lgi Expected output (example)
GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb Detect and remediate state drift after simulated maintenance.
Sample Command (Physical-link readiness runbook)
ethtool <iface> Expected output (example)
Speed: 400000Mb/s\nDuplex: Full\nLink detected: yes Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
MIG partitions a GPU into isolated instances, supporting controlled multi-tenant usage.
Answer: B
BlueField updates can affect core platform behavior and must be validated like other critical changes.
Answer: B
MIG profile strategy should match workload needs for isolation, latency, and utilization.
Answer: B
Undocumented ad-hoc changes create drift and complicate troubleshooting.
Answer: B
Workload-class policies allow controlled partitioning without sacrificing operational consistency.
Answer: B
Post-reset validation confirms that resource state matches intended baseline.
Answer: B
Drift is measured as divergence from approved, expected operational baseline.
Answer: B
Physical-layer issues can manifest as higher-level symptoms; explicit checks speed triage.
Answer: B
Baseline + checklist creates consistency and traceability across operations.
Answer: B
Even with lower exam weight, physical-layer misconfiguration can undermine multiple higher-weight domains.
Primary References
Curated from the NCP-AII blueprint/study-guide sources and official documentation.
Objectives
Navigation