Protected

NCP-AIN module content is available after admin verification. Redirecting...

If you are not redirected, login.

Training / NCP-AIN

Troubleshooting Tools

Module study guide

Priority 3 of 6 · Domain 5 in exam order

Scope

Exam study content

This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.

Exam weight: 20%
Priority tier: Tier 2
Why this domain: High-value scenario domain focused on layered diagnostics across switches, hosts, and fabric control planes.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module focuses on the Troubleshooting Tools domain: using switch and host diagnostics, CM trace/Cumulus CLI/BERT workflows, packet evidence, and benchmark interpretation to isolate and remediate AI network issues.

Track 1: Layered troubleshooting model

High-severity incidents are resolved faster when symptoms are classified by layer before tuning.

Start with symptom classification: host, switch, fabric, or policy.
Use deterministic command order and timestamp evidence.
Correlate telemetry with workload windows before declaring root cause.

Drill: Build a one-page triage flow from application symptom to network-layer candidate.

Track 2: Switch-side diagnostics (Cumulus CLI and BERT)

Switch health and link-quality evidence often reveals root cause before host changes are attempted.

Validate interface states, counters, errors, and queue pressure first.
Use BERT and optics/link diagnostics to isolate physical-layer instability.
Confirm route/policy correctness before performance tuning.

Drill: Collect switch evidence for one degraded link and propose first remediation step.

Track 3: Host-side diagnostics and end-to-end path checks

Host NIC/runtime conditions can mimic fabric-level failures.

Use host tools to verify NIC health, drops, offload state, and interface errors.
Confirm endpoint reachability and throughput from both ends of a path.
Separate host-local bottlenecks from shared-fabric congestion.

Drill: Run host diagnostics on two nodes and identify whether issue is local or network-wide.

Track 4: Packet and benchmark evidence

Packet captures and benchmark tools provide objective proof, not assumptions.

Use tcpdump and counters to confirm packet loss, retransmit, or policy drops.
Use repeatable benchmark windows for before/after remediation validation.
Track latency percentile and throughput changes together.

Drill: Capture and compare packet and benchmark evidence before and after one change.

Track 5: Congestion and bottleneck troubleshooting

Bottleneck and congestion scenarios are explicit blueprint expectations.

Identify sustained hot links versus short burst events.
Apply one controlled change at a time and validate impact.
Use rollback-safe plans when tuning buffer/queue behavior.

Drill: Design a congestion remediation plan with measurable acceptance criteria.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Evidence hierarchy in incident response

Symptoms become actionable only when anchored to objective evidence from multiple layers.

Correlate workload behavior with network counters and packet data.
Capture command output with timestamps for replayability.
Prioritize reversible actions during uncertainty.

Switch and host parity checks

Cross-validating switch and host views prevents false conclusions from one-sided telemetry.

If counters disagree, verify collection windows and interface mapping.
Host-local defects can present as network-wide symptoms.
Consistent two-sided checks reduce misdiagnosis risk.

Troubleshooting as controlled experimentation

Each remediation is an experiment with a hypothesis, action, and measurable outcome.

Define expected change before execution.
Reject fixes without repeatable improvement.
Keep rollback trigger criteria explicit.

Scenario Playbooks

Exam-style scenario explanations

Scenario: Throughput drops 25% during nightly distributed training

Training jobs started failing SLO after a maintenance window; throughput dropped and latency spikes increased.

Architecture Diagram

Training Nodes
   |
Leaf-Spine Fabric
   |
Storage and Services

Response Flow

Capture workload timeline and identify exact degradation window.
Collect switch counters and queue pressure at suspected links.
Run host-side NIC and interface checks on source/destination nodes.
Apply one corrective action and rerun benchmark validation.

Success Signals

Throughput returns within baseline tolerance.
Queue pressure and error counters normalize.
No new policy or routing regressions are introduced.

Switch interface and counter snapshot

nv show interface && nv show interface counters

Expected output (example)

Counter trends identify whether errors/drops are localized.

Host-level NIC statistics

ethtool -S <nic> | egrep 'drop|error|timeout'

Expected output (example)

Host-side drops/errors confirm or reject endpoint-local cause.

Scenario: Intermittent packet loss appears after tenant policy change

A security policy update introduced intermittent communication failure for one tenant during peak hours.

Architecture Diagram

Tenant A/B Workloads
     |
Policy Controls + Fabric
     |
Shared Services

Response Flow

Validate policy diff and intended allow/deny matrix.
Capture packets at affected host interfaces during failure.
Correlate packet/counter evidence with policy hit paths.
Patch policy minimally and rerun end-to-end validation.

Success Signals

Packet loss is eliminated for allowed traffic paths.
Denied paths remain blocked by design.
Incident report includes policy diff and packet evidence.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Runbook: Switch and host baseline collection

Collect deterministic pre-change evidence for high-confidence root-cause analysis.

Switch route and interface state

nv show interface && nv show route

Expected output (example)

State matches expected topology and policy intent.

Host interface and errors

ip -s link show <nic> && ethtool -S <nic>

Expected output (example)

Error/drop patterns are identified with clear directionality.

Always compare to known-good baseline or previous window.
Record both source and destination endpoint evidence.

Runbook: Packet and benchmark validation

Prove remediation with packet-level evidence and workload impact checks.

Packet capture at affected endpoint

sudo tcpdump -i <nic> host <peer_ip> -w incident-window.pcap

Expected output (example)

Capture supports packet-loss/retransmit or policy-drop analysis.

Benchmark confirmation after fix

iperf3 -c <peer_ip> -P 8 -t 30

Expected output (example)

Throughput and stability align with target baseline range.

Use the same benchmark profile for before/after comparison.
Store pcap and benchmark logs with incident timestamps.

Common Problems

Failure patterns and fixes

False root cause due to single-layer troubleshooting

Symptoms

Repeated fixes fail to stabilize performance.
Metrics conflict between host and switch views.

Likely Cause

Diagnosis relied on one telemetry source and skipped cross-layer validation.

Remediation

Rebuild incident timeline with switch, host, and workload evidence.
Validate data collection window consistency.
Re-run triage using layered decision tree.

Prevention: Standardize incident templates requiring evidence from all critical layers.

Congestion returns after temporary mitigation

Symptoms

Short-term improvement followed by recurring queue pressure.
Throughput variance increases again within days.

Likely Cause

Mitigation addressed symptom but not underlying bottleneck topology/path.

Remediation

Identify persistent hot links and path imbalance.
Apply targeted structural remediation with staged rollout.
Validate in repeated peak windows.

Prevention: Use repeated-window validation and long-horizon trend monitoring.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: End-to-end troubleshooting evidence capture

Build complete evidence package for a network performance regression.

Prerequisites

Access to one affected workload and test window.
Switch CLI and host shell access.
Benchmark and packet capture tooling available.

Capture baseline switch and host counters.
```
nv show interface counters && ip -s link show <nic>
```
Expected: Pre-change evidence captured with timestamps.
Capture packet sample during incident window.
```
sudo tcpdump -i <nic> -w pre-fix.pcap
```
Expected: Packet file includes representative failure window.
Apply one scoped remediation and rerun benchmark.
```
iperf3 -c <peer_ip> -P 8 -t 30
```
Expected: Post-fix benchmark shows measurable improvement.

Success Criteria

Root cause and remediation are linked with evidence.
Performance and stability metrics improve against baseline.
Rollback plan is documented if regression reappears.

Walkthrough: Congestion hotspot isolation

Identify and remediate one sustained congestion hotspot safely.

Prerequisites

Topology map with known high-traffic links.
Queue and interface telemetry access.
Permission for one controlled configuration change.

Locate persistent high-pressure interfaces.
```
nv show interface counters
```
Expected: Hotspot links are clearly identified.
Validate endpoint behavior and route consistency.
```
ip route show && ethtool -S <nic>
```
Expected: Endpoint and path evidence supports congestion hypothesis.
Apply one remediation and revalidate over repeated windows.

Expected: Queue pressure and latency variance reduce sustainably.

Success Criteria

Hotspot root cause is confirmed with cross-layer evidence.
Post-fix behavior remains stable across multiple runs.
No unintended policy or isolation regressions occur.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Domain objective mapping and troubleshooting framework setup.	Layered triage decision tree.
2	Switch CLI and interface counter diagnostics drills.	Switch evidence checklist.
3	Host diagnostics and endpoint path validation.	Host-vs-fabric isolation worksheet.
4	Packet capture techniques for policy/loss verification.	tcpdump evidence template.
5	Benchmark interpretation for regression detection.	Before/after benchmark validation matrix.
6	Congestion and queue-pressure incident simulations.	Congestion runbook.
7	Physical layer checks and BERT workflow.	Link-quality validation sheet.
8	Cross-layer incident drill (switch + host + policy).	Integrated incident timeline.
9	Timed exam-style troubleshooting scenarios.	Scenario answer templates with command evidence.
10	Final command recall and remediation checklist review.	Troubleshooting Tools quick revision sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Switch-first incident triage

Isolate whether issue originates in interface/link/queue state.

Capture baseline switch interface and error counters.
Collect queue/utilization evidence during workload peak.
Classify issue as physical, route/policy, or congestion-related.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Switch and host baseline collection)

nv show interface && nv show route

Expected output (example)

State matches expected topology and policy intent.

Lab B: Host and path diagnostics

Differentiate host-local NIC issues from shared network faults.

Run host NIC and interface checks on source and destination nodes.
Execute controlled path validation tests.
Correlate endpoint evidence with switch telemetry.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Switch and host baseline collection)

ip -s link show <nic> && ethtool -S <nic>

Expected output (example)

Error/drop patterns are identified with clear directionality.

Lab C: Packet evidence and policy validation

Use packet capture and counters to prove drop cause.

Capture traffic at host edge during failure window.
Map packet behavior to policy and route expectations.
Document root cause with packet + counter evidence.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Packet and benchmark validation)

sudo tcpdump -i <nic> host <peer_ip> -w incident-window.pcap

Expected output (example)

Capture supports packet-loss/retransmit or policy-drop analysis.

Lab D: Congestion remediation validation

Apply one tuning change and confirm measurable improvement.

Define baseline latency/throughput/counter metrics.
Apply one targeted remediation and rerun workload.
Validate improvement and absence of side effects.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Packet and benchmark validation)

iperf3 -c <peer_ip> -P 8 -t 30

Expected output (example)

Throughput and stability align with target baseline range.

Exam Pitfalls

Common failure patterns

Skipping baseline capture and troubleshooting by guesswork.
Changing multiple controls at once and losing causality.
Using only dashboard views without CLI/packet validation.
Assuming host issues are always fabric-wide incidents.
Calling a fix complete after one successful test window.
Ignoring policy and segmentation checks during performance incidents.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the strongest first action in a new performance incident?

A. Tune queue settings immediately
B. Capture baseline evidence across switch, host, and workload surfaces
C. Restart all services
D. Rebuild cluster configuration

Answer: B

Baseline evidence prevents misattribution and makes remediation outcomes measurable.

Q2. Why is single-variable change discipline essential in troubleshooting?

A. It reduces command count
B. It preserves cause-and-effect attribution
C. It disables rollback
D. It removes need for logging

Answer: B

Without single-variable changes, you cannot identify which action created the observed result.

Q3. Which tool is most directly aligned with packet-level debugging?

A. tcpdump
B. scheduler UI
C. BIOS splash screen
D. package manager

Answer: A

tcpdump provides packet-level visibility needed to validate drops, retransmits, and flow behavior.

Q4. What does sustained high queue pressure usually indicate?

A. Normal idle behavior
B. Potential bottleneck or congestion on the path
C. DNS-only issue
D. Time synchronization error

Answer: B

Persistent queue growth is a strong congestion/bottleneck signal requiring path-level investigation.

Q5. Why combine benchmark output with switch/host counters?

A. To increase command volume
B. To connect workload impact with infrastructure root cause
C. To avoid policy validation
D. To replace packet capture

Answer: B

Benchmark deltas alone show symptom; counters and telemetry show where and why.

Q6. Which statement best reflects troubleshooting completion?

A. Ticket closed by timestamp
B. Root cause identified, fix applied, and repeated validation confirms stability
C. One ping succeeds
D. Dashboard color changed to green

Answer: B

Completion requires reproducible evidence that remediation resolved the issue safely.

Q7. What is a common anti-pattern in congestion incidents?

A. Measuring pre/post metrics
B. Applying broad config changes without localized evidence
C. Reviewing queue counters
D. Testing both source and destination hosts

Answer: B

Unscoped changes often add new variables and delay true root-cause isolation.

Q8. Which blueprint scope item belongs directly to this domain?

A. Use CM trace, Cumulus CLI, and BERT for troubleshooting
B. Build LLM prompts
C. Only install Kubernetes control-plane
D. Replace all user workflows with automation

Answer: A

That tooling stack is explicitly listed under the Troubleshooting Tools domain.

Primary References

Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary troubleshooting/tooling documentation.

Objectives

Use CM trace, Cumulus Linux command line, and BERT for troubleshooting.
Troubleshoot and optimize network performance by using host and switch tools.
Troubleshoot and optimize network bottlenecks and congestion.
Explain and validate network and infrastructure benchmarking tools.
Perform packet-level debugging with tcpdump and interface counters.

Navigation

Back to NCP-AIN landing Previous: NVIDIA InfiniBand Networking Next: Automation and Configuration