Protected

NCP-AII module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-AII

Troubleshoot and Optimize

Module study guide

Priority 4 of 5 · Domain 5 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 12%
Priority tier: Tier 2
Why this domain: Operational excellence domain for fault isolation, replacement workflows, and performance tuning.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 5 focuses on fault isolation and performance recovery: diagnosing hardware issues, replacing faulty components safely, and optimizing host + storage behavior for sustained cluster performance.

Track 1: Fault-domain triage framework

Fast isolation across hardware, firmware, driver, and fabric layers is critical in production incidents.

Start with symptom classification: compute, thermal, power, network, or storage.
Use deterministic triage order to avoid random command-driven debugging.
Correlate telemetry across host and fabric before replacing hardware.

Drill: Build an incident triage flowchart mapping symptom patterns to likely fault domains.

Track 2: GPU and host health diagnostics

Exam tasks include identifying faulty GPUs/cards/power components, not only software errors.

Use standardized GPU telemetry and diagnostics to detect thermal, ECC, and stability issues.
Differentiate transient software errors from persistent hardware fault signatures.
Capture pre-replacement evidence to support RMA or maintenance decisions.

Drill: Run a structured GPU diagnostic pass and produce a replace vs observe decision memo.

Track 3: Component replacement workflow discipline

Safe replacement procedures reduce repeat incidents and post-maintenance regressions.

Define replacement prerequisites: evidence, approvals, spares, rollback and validation plan.
Re-validate firmware/driver alignment after component swaps.
Use post-replacement burn-in checks before returning node to service.

Drill: Write a component replacement SOP with evidence capture, execution controls, and exit criteria.

Track 4: CPU platform optimization (AMD/Intel server context)

Blueprint includes optimization workflows on mixed server CPU platforms.

Treat BIOS/NUMA/power-profile tuning as workload-dependent, not one-size-fits-all.
Benchmark after each tuning change to avoid placebo or regression effects.
Validate CPU and memory policy interaction with GPU pipeline performance.

Drill: Create a CPU tuning experiment table with one change per run and measured impact.

Track 5: Storage path optimization

Storage bottlenecks frequently appear as training or inference slowdowns.

Separate storage latency issues from network or compute issues during triage.
Use storage benchmarks and telemetry for evidence-driven tuning.
Evaluate direct storage-to-GPU paths where supported.

Drill: Run storage baseline tests and produce a prioritized tuning backlog.

Track 6: Verification after remediation

Troubleshooting is incomplete without proving the fix holds under realistic load.

Require post-fix validation with representative workload and duration.
Track recurrence indicators and define rollback/escalation thresholds.
Update runbooks and knowledge base from incident outcomes.

Drill: Create a post-remediation validation template with recurrence watchpoints.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Evidence-first triage model

Reliable troubleshooting starts with symptom-to-domain mapping before remediation action.

Classify incident by compute, thermal/power, network, storage, or mixed signals.
Use smallest set of high-signal diagnostics first.
Preserve evidence before component replacement.

Replace vs tune vs rollback decisions

Operational excellence requires choosing the right remediation path based on confidence and blast radius.

Replace components only when evidence supports hardware fault signatures.
Use tuning when behavior is stable but below expected performance target.
Rollback after high-risk changes when confidence in root cause is low.

Post-fix verification and recurrence control

A fix is incomplete until it holds under representative workload duration and no recurrence indicators remain.

Validate with the same workload class that exposed the issue.
Track recurrence watchpoints for at least one stability window.
Update runbooks and alerts with discovered failure signatures.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Intermittent GPU Xid errors under load

A node passes smoke tests but emits intermittent Xid-like failures during sustained workload periods.

Architecture Diagram

[Workload Scheduler] -> [GPU Node]
                      |
               [Telemetry + Kernel Logs]
                      |
                [Decision: replace/tune/rollback]

Response Flow

Capture kernel and GPU telemetry around failure windows.
Correlate errors with thermal, power, and job characteristics.
Run focused stress test to reproduce pattern.
Decide replace vs observe based on recurrence and severity.

Success Signals

Root cause confidence is documented before action.
Post-remediation workload run is stable through monitoring window.
Incident signature added to runbook and alert policy.

Kernel error scan

dmesg | rg -i 'xid|nvrm|pcie' | tail -n 30

Expected output (example)

[NVRM] Xid ...\n[NVRM] Channel exception ... (example)

Scenario B: Throughput regression after platform tuning changes

Performance dropped after combined BIOS and runtime tuning changes on a mixed AMD/Intel fleet.

Architecture Diagram

[Baseline Config] -> [Change Set A+B+C] -> [Regression]
      |                     |
  benchmark pack       unclear causality
      \_________________rollback + one-change retest________________/

Response Flow

Rollback to baseline and verify regression disappears.
Re-apply tuning one change at a time with controlled benchmarks.
Keep only changes with reproducible improvement.
Update platform-specific tuning matrix for AMD and Intel groups.

Success Signals

Regression source isolated to specific change.
Final tuned profile shows stable gain over baseline.
No unintended side effects in adjacent workloads.

CPU/memory context snapshot

lscpu && numactl --hardware

Expected output (example)

Architecture: x86_64\nNUMA node(s): 2\nnode distances: ...

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Fault isolation command runbook

Use this compact set to classify incidents quickly without command sprawl.

GPU health details

nvidia-smi -q -d TEMPERATURE,POWER,UTILIZATION,ECC,PERFORMANCE

Expected output (example)

Temperature : 61 C\nPower Draw : 286 W\nEcc Mode : Enabled\n...

Recent kernel-level fault signatures

journalctl -k -n 200 | rg -i 'nvrm|xid|pcie|aer'

Expected output (example)

kernel: NVRM: Xid ...\nkernel: pcieport ... (example)

Fabric communication sanity

all_reduce_perf -b 8 -e 256M -f 2 -g 8

Expected output (example)

NCCL sanity run complete with measured bandwidth table.

Use timestamp alignment across logs and telemetry to build root-cause confidence.
Capture command outputs before any replacement action.

Storage and host optimization runbook

Separate host and storage effects when investigating throughput regressions.

Storage throughput baseline

fio --name=optcheck --directory=/mnt/ai-data --rw=readwrite --bs=1M --size=8G --numjobs=4 --iodepth=16

Expected output (example)

READ: bw=8.4GiB/s ...\nWRITE: bw=6.9GiB/s ...

CPU scheduler/load context

mpstat -P ALL 1 5

Expected output (example)

Average: CPU %usr %sys %iowait ... (example)

Keep workload and dataset constant when comparing tuning outcomes.
If storage bottleneck dominates, host tuning alone will not recover target performance.

Common Problems

Failure patterns and fixes

False hardware replacement due to incomplete diagnosis

Symptoms

Component replaced but issue persists.
No clear fault evidence was captured before swap.

Likely Cause

Premature replacement decision without triage confidence.

Remediation

Reconstruct incident timeline from logs/telemetry.
Re-run structured diagnostics with domain classification.
Apply targeted remediation after root-cause confidence improves.

Prevention: Require evidence checklist before physical replacement approval.

Performance tuning causes unstable workloads

Symptoms

Throughput improves on one workload but stability degrades on another.
Regression appears after multi-parameter tuning change.

Likely Cause

Over-broad tuning changes without workload-specific validation.

Remediation

Rollback to baseline.
Re-test one parameter at a time with controlled benchmarks.
Adopt per-platform (AMD/Intel) and per-workload tuning matrix.

Prevention: Use controlled experiment design with one-change-per-run policy.

Storage bottleneck hidden behind GPU utilization drops

Symptoms

GPU utilization oscillates while job input stalls.
Network and compute diagnostics show no major anomalies.

Likely Cause

Storage path latency or throughput instability.

Remediation

Run storage benchmarks during workload windows.
Tune storage parameters and retest end-to-end workload.
Correlate improvements with GPU utilization recovery.

Prevention: Include storage telemetry in standard incident dashboards and runbooks.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Detect -> isolate -> remediate -> verify cycle

Execute full incident-response cycle for a realistic infrastructure fault and prove durable recovery.

Prerequisites

Access to node logs, telemetry, and benchmark tools.
Known-good baseline metrics available for comparison.
Change-control path approved for remediation actions.

Collect initial high-signal diagnostics and classify fault domain.
```
nvidia-smi -q -d TEMPERATURE,POWER,ECC && journalctl -k -n 200
```
Expected: Symptoms mapped to a prioritized fault domain.
Run focused reproducer for suspected domain.
```
all_reduce_perf -b 8 -e 256M -f 2 -g 8
```
Expected: Issue is reproduced or ruled out in communication path.
Apply least-risk remediation (tune, rollback, or replace).

Expected: Remediation executed with change record and rollback plan.
Perform post-fix workload validation window.
```
nvidia-smi --query-gpu=timestamp,utilization.gpu,temperature.gpu,power.draw --format=csv -l 30
```
Expected: No recurrence indicators during defined validation duration.
Close incident with updated runbook entry.

Expected: Knowledge base updated with signature, fix, and prevention controls.

Success Criteria

Root cause documented with evidence.
Remediation outcome stable across validation window.
Runbook and alerting improvements applied.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Incident taxonomy and triage workflow design.	Fault-domain triage decision tree.
2	GPU/host telemetry baseline and alert threshold review.	Health baseline and threshold sheet.
3	Hands-on diagnostic drill for simulated GPU fault.	Diagnostic evidence report.
4	Component replacement SOP and validation policy.	Replacement runbook with exit criteria.
5	AMD/Intel server tuning experiment design.	CPU platform tuning matrix.
6	Storage bottleneck isolation and measurement.	Storage path performance baseline.
7	NCCL + system telemetry correlation for communication faults.	Cross-layer fault-correlation notes.
8	Run one full remediation cycle (detect -> fix -> verify).	End-to-end remediation case file.
9	Timed troubleshooting scenario with constrained evidence.	Decision log and escalation rationale.
10	Final revision and high-yield incident pattern recap.	Troubleshoot/optimize rapid cheat sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Fault-domain isolation

Identify root cause domain using minimal but high-signal diagnostics.

Collect initial symptoms and classify likely domain.
Run focused diagnostics and avoid blind command sweeps.
Document root-cause confidence and alternatives.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Fault isolation command runbook)

nvidia-smi -q -d TEMPERATURE,POWER,UTILIZATION,ECC,PERFORMANCE

Expected output (example)

Temperature : 61 C\nPower Draw : 286 W\nEcc Mode : Enabled\n...

Lab B: Replace-and-validate workflow

Execute safe component replacement with proof of fix.

Capture pre-replacement fault evidence.
Perform controlled replacement and baseline restore.
Run post-replacement validation under representative load.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Fault isolation command runbook)

journalctl -k -n 200 | rg -i 'nvrm|xid|pcie|aer'

Expected output (example)

kernel: NVRM: Xid ...\nkernel: pcieport ... (example)

Lab C: CPU and host-level optimization

Measure performance impact of one-at-a-time platform tuning changes.

Select one AMD/Intel host tuning candidate per run.
Collect before/after metrics with constant workload.
Keep only changes with reproducible improvement.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Fault isolation command runbook)

all_reduce_perf -b 8 -e 256M -f 2 -g 8

Expected output (example)

NCCL sanity run complete with measured bandwidth table.

Lab D: Storage tuning and regression guardrails

Improve storage path performance while preventing regressions.

Measure baseline throughput/latency and variability.
Apply prioritized storage tuning changes.
Define regression alarms for post-change monitoring.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Storage and host optimization runbook)

fio --name=optcheck --directory=/mnt/ai-data --rw=readwrite --bs=1M --size=8G --numjobs=4 --iodepth=16

Expected output (example)

READ: bw=8.4GiB/s ...\nWRITE: bw=6.9GiB/s ...

Exam Pitfalls

Common failure patterns

Jumping to hardware replacement before evidence-driven diagnosis.
Mixing multiple tuning changes per test and losing causality.
Ignoring firmware/driver alignment after component swap.
Stopping at symptom reduction without validating long-duration stability.
Treating storage as out-of-scope during performance incidents.
Failing to convert incident outcomes into updated runbooks.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the best first principle in troubleshooting this domain?

A. Replace parts immediately
B. Isolate likely fault domain before remediation
C. Reinstall all software
D. Ignore telemetry

Answer: B

Efficient troubleshooting starts with narrowing fault domain using evidence and structured triage.

Q2. Why capture pre-replacement evidence before swapping hardware?

A. It is optional paperwork only
B. It supports correct diagnosis, auditability, and post-fix comparison
C. It removes need for validation
D. It guarantees no future faults

Answer: B

Pre-replacement evidence validates the decision and helps confirm whether the fix addressed root cause.

Q3. What is a reliable optimization experiment pattern?

A. Change many BIOS/runtime parameters at once
B. One change per run with controlled workload and baseline comparison
C. No baseline
D. Keep only best outlier run

Answer: B

Single-variable testing preserves causal interpretation of optimization results.

Q4. Why should post-fix validation include representative workload duration?

A. Quick smoke tests always catch recurrence
B. Some faults appear only under sustained load
C. Duration is unrelated to reliability
D. It only matters for UI services

Answer: B

Longer validation windows reveal intermittent faults missed by short checks.

Q5. Which is a common cause of false optimization conclusions?

A. Stable baseline metrics
B. Simultaneous multi-parameter changes
C. Controlled test design
D. Repeatable measurements

Answer: B

Multiple concurrent changes make it impossible to attribute outcomes confidently.

Q6. Why include storage in performance troubleshooting?

A. Storage cannot bottleneck AI systems
B. Storage latency/throughput issues can impact end-to-end workload performance
C. It only affects backups
D. It is never measurable

Answer: B

Storage path health is a frequent contributor to performance regressions in data-heavy workloads.

Q7. What should be done after component replacement?

A. Return node immediately to production without checks
B. Verify firmware/driver alignment and run validation workload
C. Disable monitoring
D. Skip documentation

Answer: B

Post-replacement validation confirms operational integrity and prevents silent drift.

Q8. For CPU platform tuning (AMD/Intel), what is a safe assumption?

A. One universal profile works for all workloads
B. Tuning should be workload-specific and benchmark-validated
C. CPU tuning never affects GPU pipelines
D. Defaults are always optimal

Answer: B

Host-level tuning impact depends on workload behavior and needs measured validation.

Q9. What is the strongest indicator of remediation success?

A. Incident closed quickly
B. Root cause evidence, stable post-fix metrics, and no recurrence signals
C. One successful command
D. Team consensus only

Answer: B

Durable remediation requires evidence and sustained stable operation.

Q10. Why should runbooks be updated after incidents?

A. To increase page count
B. To institutionalize learnings and reduce repeated failure modes
C. To hide prior mistakes
D. To replace monitoring

Answer: B

Operational maturity depends on feeding incident lessons back into standard procedures.

Primary References

Curated from the NCP-AII blueprint/study-guide sources and official documentation.

Objectives

5.1 Identify and troubleshoot hardware faults (e.g., GPU, fan, network card).
5.2 Identify faulty cards, GPUs, and power supplies.
5.3 Replace faulty cards, GPUs, and power supplies.
5.4 Execute performance optimization for AMD and Intel servers.
5.5 Optimize storage.

Navigation

Back to NCP-AII landing Previous: Control Plane Installation and Configuration Next: Physical Layer Management