Protected

NCP-AIN module content is available after admin verification. Redirecting...

If you are not redirected, login.

Access

Admin only

NCP-AIN module pages are restricted to admin users.

Training / NCP-AIN

NVIDIA InfiniBand Networking

Module study guide

Priority 2 of 6 ยท Domain 3 in exam order

Scope

Exam study content

This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.

Exam weight
30%
Priority tier
Tier 1
Why this domain
High-weight operational domain for UFM-driven deployment, congestion-aware optimization, and secure fabric operations.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 3 focuses on NVIDIA InfiniBand operations with UFM-led configuration, validation, monitoring, congestion troubleshooting, and multi-tenant security controls.

Track 1: InfiniBand architecture essentials

You need to reason about InfiniBand behavior and management surfaces before tuning commands.

  • Understand subnet management and control-plane role in fabric health.
  • Map collective communication sensitivity to path and congestion behavior.
  • Separate host, switch, and management responsibilities in diagnostics.

Drill: Explain which InfiniBand components you would check first for cluster-wide communication regressions.

Track 2: UFM-based configuration and validation

Blueprint scope explicitly requires configuring and validating InfiniBand using UFM.

  • Use UFM for inventory, topology state, and policy-aware operations.
  • Validate post-change topology and link-state consistency.
  • Preserve rollback-capable evidence before large-scale updates.

Drill: Run a UFM health snapshot and identify one high-priority warning class.

Track 3: CLI monitoring and optimization

Command-line diagnostics remain critical for rapid fault isolation and exam-style troubleshooting.

  • Use InfiniBand CLI tools to verify link state, errors, and throughput behavior.
  • Correlate host-level and switch-level indicators when diagnosing bottlenecks.
  • Tune only after confirming topology and firmware baseline integrity.

Drill: Create a command sequence to isolate whether issue is endpoint, path, or congestion related.

Track 4: Security and multi-tenancy

Partitioning and access controls are required for multi-team AI clusters with shared infrastructure.

  • Apply partitioning strategy for tenant separation.
  • Validate management and operations access boundaries.
  • Audit policy consistency after fabric changes.

Drill: Draft a partition and access model for two tenant groups sharing one fabric.

Track 5: Congestion and bottleneck troubleshooting

A dedicated objective covers congestion and bottleneck optimization.

  • Differentiate transient spikes from sustained congestion.
  • Use targeted tests to locate hot links or misrouted traffic.
  • Confirm improvement with repeat validation, not one-time samples.

Drill: Design a congestion triage loop that ends with measurable acceptance criteria.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Evidence stack for InfiniBand troubleshooting

Reliable diagnosis combines management-plane visibility with low-level path and endpoint diagnostics.

  • Start from topology state, then move to path-level counters.
  • Anchor diagnostics to workload time windows.
  • Record all remediation decisions with expected impact.

Congestion as a systems problem

Congestion often emerges from interacting workload, routing, and policy conditions rather than one broken link.

  • Measure persistence and locality of pressure before tuning.
  • Validate that tuning does not break isolation or reliability.
  • Retest under representative load after each change.

Partitioning with operations in mind

Security controls should enforce tenant boundaries while preserving controlled operations workflows.

  • Keep management access paths explicit and auditable.
  • Validate partition membership continuously, not only at creation.
  • Treat policy drift as ongoing operational risk.

Scenario Playbooks

Exam-style scenario explanations

Scenario: Fabric-level latency spikes under nightly training load

Nightly distributed training jobs exhibit intermittent latency spikes and reduced all-reduce efficiency.

Architecture Diagram

GPU Pods
   |
InfiniBand Fabric
   |-- Core Switches
   |-- Edge Switches
   |-- Management via UFM

Response Flow

  1. Capture UFM health and topology state during spike window.
  2. Run endpoint and path-level CLI checks for error and throughput signatures.
  3. Identify sustained hot links and policy/routing contributors.
  4. Apply one targeted adjustment and verify in next workload cycle.

Success Signals

  • Latency spikes reduce below threshold in repeated windows.
  • No regression in tenant isolation and management access.
  • Root cause and remediation are traceable in evidence log.

Port and link status

ibstat

Expected output (example)

All relevant ports report Active state and expected link rate.

Path diagnostics

ibdiagnet -v

Expected output (example)

No critical path errors; warnings are mapped to specific remediation candidates.

Scenario: New tenant cannot communicate with assigned resources

A newly provisioned tenant reports communication failures across assigned compute resources.

Architecture Diagram

Tenant A Partition
Tenant B Partition
   |
Shared InfiniBand Fabric
   |
UFM + Access Control

Response Flow

  1. Validate partition membership and access rules in management plane.
  2. Run endpoint-level checks from tenant nodes.
  3. Confirm no cross-tenant leakage exists while restoring expected connectivity.
  4. Record corrected policy and post-change validation output.

Success Signals

  • Tenant connectivity is restored within policy boundaries.
  • Cross-tenant isolation remains intact.
  • Runbook is updated with prevention checks for future onboarding.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

InfiniBand health baseline runbook

Capture baseline link and topology health before advanced troubleshooting.

Link and port state

ibstat

Expected output (example)

Ports are Active with expected physical and link layer attributes.

Fabric topology discovery

ibnetdiscover | head -n 40

Expected output (example)

Topology inventory aligns with expected node and switch map.

Management health snapshot

ufm_health --summary

Expected output (example)

Management summary highlights stable fabric state or specific warning categories.
  • Run before and after any fabric policy or routing changes.
  • Store snapshots with timestamps tied to workload events.

Congestion and bottleneck triage runbook

Localize and remediate sustained bandwidth or latency bottlenecks.

Bandwidth probe

ib_write_bw -d mlx5_0 -F --report_gbits <peer_host>

Expected output (example)

Measured bandwidth indicates whether path performance meets baseline.

Performance counters

perfquery -x

Expected output (example)

Counter output reveals error trends and congestion-linked symptoms.

Path validation

saquery -s

Expected output (example)

Service and path queries return expected records for active fabric routes.
  • Re-run after each corrective action to validate improvement.
  • Avoid broad tuning without locating persistent hot links first.

Common Problems

Failure patterns and fixes

Intermittent all-reduce slowdown during peak windows

Symptoms

  • Latency spikes align with scheduled training bursts.
  • Bandwidth tests vary widely between runs.

Likely Cause

Sustained congestion on specific paths with insufficient targeted remediation.

Remediation

  • Identify persistent hot links using repeated counter snapshots.
  • Apply one scoped routing or policy adjustment.
  • Revalidate under same load profile.

Prevention: Integrate periodic congestion audit into production operations cadence.

Tenant partition misconfiguration after onboarding

Symptoms

  • Assigned resources are unreachable within tenant scope.
  • Policy audits show inconsistent partition membership.

Likely Cause

Partition policy not applied consistently across all endpoints.

Remediation

  • Audit partition membership from management and endpoint perspectives.
  • Correct mismatched assignments and retest communication.
  • Log final policy state for future onboarding template.

Prevention: Use automated onboarding checks for partition integrity and access validation.

Fabric appears healthy but workload still underperforms

Symptoms

  • Management dashboards show mostly green state.
  • Workload throughput remains below target.

Likely Cause

Health status lacks workload-context correlation; hidden path inefficiencies remain.

Remediation

  • Pair management snapshots with endpoint bandwidth tests.
  • Map symptoms to specific job windows and path segments.
  • Tune and retest with representative workload profile.

Prevention: Adopt workload-aware validation as standard post-change gate.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: UFM-led InfiniBand validation cycle

Use UFM and CLI diagnostics to validate fabric readiness and isolate candidate bottlenecks.

Prerequisites

  • UFM access with topology visibility.
  • Endpoint access for CLI diagnostics.
  • Known baseline performance targets.
  1. Capture management-plane health snapshot.

    ufm_health --summary

    Expected: Health summary identifies stable fabric or actionable warnings.

  2. Validate endpoint link state.

    ibstat

    Expected: Relevant ports are Active and configured as expected.

  3. Run topology/path diagnostics.

    ibdiagnet -v

    Expected: Critical path errors are absent or isolated to explicit links.

  4. Measure path bandwidth baseline.

    ib_write_bw -d mlx5_0 -F --report_gbits <peer_host>

    Expected: Bandwidth remains inside accepted baseline range.

Success Criteria

  • Fabric state aligns with expected topology and health criteria.
  • Performance checks are reproducible across two windows.
  • Any anomalies are mapped to prioritized remediation queue.

Walkthrough: Partition and isolation verification

Validate security and multi-tenant behavior without degrading operations access.

Prerequisites

  • At least two tenant groups configured.
  • Documented partition policy and expected access matrix.
  • Test nodes in each tenant partition.
  1. Query partition and service state.

    saquery -s

    Expected: Service entries and partition records are present as expected.

  2. Run intra-tenant connectivity checks.

    ibping -S && ibping -c <target_guid>

    Expected: Allowed tenant paths succeed with stable response.

  3. Run cross-tenant validation check.

    ibping -c <other_tenant_guid>

    Expected: Disallowed path is blocked per policy design.

Success Criteria

  • Tenant communication follows policy matrix.
  • Management and diagnostics access remains functional.
  • Post-check policy record is saved for audit.

Study Sprint

10-day execution plan

Day Focus Output
1 InfiniBand objective mapping and architecture review. Domain 3 objective checklist.
2 UFM inventory, topology, and baseline health workflows. UFM baseline capture template.
3 CLI diagnostics for link and path validation. Core command runbook.
4 Bandwidth and latency test interpretation drills. Interpretation matrix for common outcomes.
5 Partitioning and tenant isolation design. Security and partition policy plan.
6 Congestion and hot-link triage simulation. Congestion response flowchart.
7 Fabric change validation and rollback planning. Change-validation checklist.
8 End-to-end workload communication validation. Workload communication readiness report.
9 Timed troubleshooting scenario practice. Exam-style remediation notes.
10 Final revision and weak-area closeout. Domain 3 quick reference guide.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: UFM-driven fabric health baseline

Capture fabric topology and health state in UFM and validate consistency with expected design.

  • Inventory nodes, links, and switch states.
  • Record warnings and classify by severity.
  • Confirm topology consistency with documented architecture.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (InfiniBand health baseline runbook)

ibstat

Expected output (example)

Ports are Active with expected physical and link layer attributes.

Lab B: CLI-based throughput and error triage

Use InfiniBand commands to isolate performance regression source.

  • Run link and error counters on suspect paths.
  • Run bandwidth probe across candidate endpoints.
  • Correlate results with workload timing window.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (InfiniBand health baseline runbook)

ibnetdiscover | head -n 40

Expected output (example)

Topology inventory aligns with expected node and switch map.

Lab C: Partition and tenant policy validation

Verify multi-tenant controls are applied and enforceable.

  • Apply partition policy and verify endpoint membership.
  • Test allowed and denied communication paths.
  • Review management access boundaries.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (InfiniBand health baseline runbook)

ufm_health --summary

Expected output (example)

Management summary highlights stable fabric state or specific warning categories.

Lab D: Congestion hotspot investigation

Locate and remediate congestion hotspots with evidence-driven tuning.

  • Identify sustained high-pressure links.
  • Apply targeted adjustment and rerun validation tests.
  • Confirm stability across multiple windows.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Congestion and bottleneck triage runbook)

ib_write_bw -d mlx5_0 -F --report_gbits <peer_host>

Expected output (example)

Measured bandwidth indicates whether path performance meets baseline.

Exam Pitfalls

Common failure patterns

  • Assuming UFM green status means every path is workload-ready.
  • Skipping endpoint-level checks and blaming fabric globally.
  • Changing multiple congestion levers simultaneously.
  • Ignoring partition policy drift in shared environments.
  • Accepting one successful test as proof of sustained stability.
  • Troubleshooting without timestamped evidence across tools.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. Why is UFM central to this domain?
  • A. It replaces all CLI tools
  • B. It provides topology/health control surface aligned to blueprint objectives
  • C. It only changes host BIOS settings
  • D. It is optional for validation

Answer: B

The blueprint explicitly calls for configuring and validating InfiniBand with UFM.

Q2. What is a high-confidence sign of fabric bottleneck localization?
  • A. One ping timeout
  • B. Repeated high-pressure counters and throughput degradation on specific links
  • C. Random log warning
  • D. User complaint without metrics

Answer: B

Bottleneck isolation requires repeated evidence on the same path or component.

Q3. Why should partition policy validation be included in troubleshooting?
  • A. Security is unrelated to performance
  • B. Policy errors can appear as communication failures or route anomalies
  • C. Partitions only affect storage
  • D. It is not part of exam scope

Answer: B

Security and multi-tenant controls are in scope and can directly impact connectivity outcomes.

Q4. Which approach best handles congestion optimization?
  • A. Tune everything at once
  • B. Apply one change, validate, and compare against baseline
  • C. Restart cluster repeatedly
  • D. Ignore telemetry

Answer: B

Single-change validation preserves causality and avoids hidden regressions.

Q5. What is the strongest evidence package for exam-style incident response?
  • A. Memory of previous outage
  • B. UFM snapshot, CLI counters, workload symptom timeline, and remediation result
  • C. Email thread
  • D. Screenshot of one metric

Answer: B

A cross-source evidence package supports accurate diagnosis and defensible remediation.

Q6. Why are repeated validation windows important after congestion tuning?
  • A. They are unnecessary
  • B. They confirm improvements persist beyond one transient window
  • C. They only help with documentation
  • D. They reduce cluster size

Answer: B

Transient improvements can hide unresolved root causes; repeated windows confirm stability.

Q7. What is a frequent anti-pattern when using CLI diagnostics?
  • A. Comparing endpoint and fabric outputs
  • B. Running tools without workload context and drawing immediate conclusions
  • C. Keeping timestamps
  • D. Validating after changes

Answer: B

Context-free diagnostics often misattribute symptoms and prolong remediation.

Q8. Which outcome best indicates successful Domain 3 readiness?
  • A. One link is healthy
  • B. UFM and CLI validations show stable performance, secure partitioning, and resolved congestion risks
  • C. Documentation is complete
  • D. Cluster has no users

Answer: B

Readiness requires validated health, security, and performance behavior across the fabric.

Primary References

Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary InfiniBand/UFM documentation.

Objectives

  • Explain architecture and technologies of InfiniBand.
  • Configure and validate InfiniBand network by using UFM.
  • Monitor and optimize network by using command line (CLI).
  • Configure and validate security and multi-tenant network.
  • Troubleshoot and optimize network performance.
  • Troubleshoot and optimize network bottlenecks and congestion.

Navigation