1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AIN module content is available after admin verification. Redirecting...
If you are not redirected, login.
Access
Admin only
NCP-AIN module pages are restricted to admin users.
Training / NCP-AIN
Module study guide
Priority 2 of 6 ยท Domain 3 in exam order
Scope
This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.
Exam Framework
Exam Scope Coverage
Domain 3 focuses on NVIDIA InfiniBand operations with UFM-led configuration, validation, monitoring, congestion troubleshooting, and multi-tenant security controls.
You need to reason about InfiniBand behavior and management surfaces before tuning commands.
Drill: Explain which InfiniBand components you would check first for cluster-wide communication regressions.
Blueprint scope explicitly requires configuring and validating InfiniBand using UFM.
Drill: Run a UFM health snapshot and identify one high-priority warning class.
Command-line diagnostics remain critical for rapid fault isolation and exam-style troubleshooting.
Drill: Create a command sequence to isolate whether issue is endpoint, path, or congestion related.
Partitioning and access controls are required for multi-team AI clusters with shared infrastructure.
Drill: Draft a partition and access model for two tenant groups sharing one fabric.
A dedicated objective covers congestion and bottleneck optimization.
Drill: Design a congestion triage loop that ends with measurable acceptance criteria.
Module Resources
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Reliable diagnosis combines management-plane visibility with low-level path and endpoint diagnostics.
Congestion often emerges from interacting workload, routing, and policy conditions rather than one broken link.
Security controls should enforce tenant boundaries while preserving controlled operations workflows.
Scenario Playbooks
Nightly distributed training jobs exhibit intermittent latency spikes and reduced all-reduce efficiency.
Architecture Diagram
GPU Pods
|
InfiniBand Fabric
|-- Core Switches
|-- Edge Switches
|-- Management via UFM Response Flow
Success Signals
Port and link status
ibstat Expected output (example)
All relevant ports report Active state and expected link rate. Path diagnostics
ibdiagnet -v Expected output (example)
No critical path errors; warnings are mapped to specific remediation candidates. A newly provisioned tenant reports communication failures across assigned compute resources.
Architecture Diagram
Tenant A Partition
Tenant B Partition
|
Shared InfiniBand Fabric
|
UFM + Access Control Response Flow
Success Signals
CLI and Commands
Capture baseline link and topology health before advanced troubleshooting.
Link and port state
ibstat Expected output (example)
Ports are Active with expected physical and link layer attributes. Fabric topology discovery
ibnetdiscover | head -n 40 Expected output (example)
Topology inventory aligns with expected node and switch map. Management health snapshot
ufm_health --summary Expected output (example)
Management summary highlights stable fabric state or specific warning categories. Localize and remediate sustained bandwidth or latency bottlenecks.
Bandwidth probe
ib_write_bw -d mlx5_0 -F --report_gbits <peer_host> Expected output (example)
Measured bandwidth indicates whether path performance meets baseline. Performance counters
perfquery -x Expected output (example)
Counter output reveals error trends and congestion-linked symptoms. Path validation
saquery -s Expected output (example)
Service and path queries return expected records for active fabric routes. Common Problems
Symptoms
Likely Cause
Sustained congestion on specific paths with insufficient targeted remediation.
Remediation
Prevention: Integrate periodic congestion audit into production operations cadence.
Symptoms
Likely Cause
Partition policy not applied consistently across all endpoints.
Remediation
Prevention: Use automated onboarding checks for partition integrity and access validation.
Symptoms
Likely Cause
Health status lacks workload-context correlation; hidden path inefficiencies remain.
Remediation
Prevention: Adopt workload-aware validation as standard post-change gate.
Lab Walkthroughs
Use UFM and CLI diagnostics to validate fabric readiness and isolate candidate bottlenecks.
Prerequisites
Capture management-plane health snapshot.
ufm_health --summary Expected: Health summary identifies stable fabric or actionable warnings.
Validate endpoint link state.
ibstat Expected: Relevant ports are Active and configured as expected.
Run topology/path diagnostics.
ibdiagnet -v Expected: Critical path errors are absent or isolated to explicit links.
Measure path bandwidth baseline.
ib_write_bw -d mlx5_0 -F --report_gbits <peer_host> Expected: Bandwidth remains inside accepted baseline range.
Success Criteria
Validate security and multi-tenant behavior without degrading operations access.
Prerequisites
Query partition and service state.
saquery -s Expected: Service entries and partition records are present as expected.
Run intra-tenant connectivity checks.
ibping -S && ibping -c <target_guid> Expected: Allowed tenant paths succeed with stable response.
Run cross-tenant validation check.
ibping -c <other_tenant_guid> Expected: Disallowed path is blocked per policy design.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | InfiniBand objective mapping and architecture review. | Domain 3 objective checklist. |
| 2 | UFM inventory, topology, and baseline health workflows. | UFM baseline capture template. |
| 3 | CLI diagnostics for link and path validation. | Core command runbook. |
| 4 | Bandwidth and latency test interpretation drills. | Interpretation matrix for common outcomes. |
| 5 | Partitioning and tenant isolation design. | Security and partition policy plan. |
| 6 | Congestion and hot-link triage simulation. | Congestion response flowchart. |
| 7 | Fabric change validation and rollback planning. | Change-validation checklist. |
| 8 | End-to-end workload communication validation. | Workload communication readiness report. |
| 9 | Timed troubleshooting scenario practice. | Exam-style remediation notes. |
| 10 | Final revision and weak-area closeout. | Domain 3 quick reference guide. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Capture fabric topology and health state in UFM and validate consistency with expected design.
Sample Command (InfiniBand health baseline runbook)
ibstat Expected output (example)
Ports are Active with expected physical and link layer attributes. Use InfiniBand commands to isolate performance regression source.
Sample Command (InfiniBand health baseline runbook)
ibnetdiscover | head -n 40 Expected output (example)
Topology inventory aligns with expected node and switch map. Verify multi-tenant controls are applied and enforceable.
Sample Command (InfiniBand health baseline runbook)
ufm_health --summary Expected output (example)
Management summary highlights stable fabric state or specific warning categories. Locate and remediate congestion hotspots with evidence-driven tuning.
Sample Command (Congestion and bottleneck triage runbook)
ib_write_bw -d mlx5_0 -F --report_gbits <peer_host> Expected output (example)
Measured bandwidth indicates whether path performance meets baseline. Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
The blueprint explicitly calls for configuring and validating InfiniBand with UFM.
Answer: B
Bottleneck isolation requires repeated evidence on the same path or component.
Answer: B
Security and multi-tenant controls are in scope and can directly impact connectivity outcomes.
Answer: B
Single-change validation preserves causality and avoids hidden regressions.
Answer: B
A cross-source evidence package supports accurate diagnosis and defensible remediation.
Answer: B
Transient improvements can hide unresolved root causes; repeated windows confirm stability.
Answer: B
Context-free diagnostics often misattribute symptoms and prolong remediation.
Answer: B
Readiness requires validated health, security, and performance behavior across the fabric.
Primary References
Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary InfiniBand/UFM documentation.
Objectives
Navigation