1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AII module content is available after admin verification. Redirecting…
If you are not redirected, login.
Access
Admin only
NCP-AII module pages are restricted to admin users.
Training / NCP-AII
Module study guide
Priority 4 of 5 · Domain 5 in exam order
Scope
This module contains expanded study notes, practical drills, and an exam-style question set.
Exam Framework
Exam Scope Coverage
Domain 5 focuses on fault isolation and performance recovery: diagnosing hardware issues, replacing faulty components safely, and optimizing host + storage behavior for sustained cluster performance.
Fast isolation across hardware, firmware, driver, and fabric layers is critical in production incidents.
Drill: Build an incident triage flowchart mapping symptom patterns to likely fault domains.
Exam tasks include identifying faulty GPUs/cards/power components, not only software errors.
Drill: Run a structured GPU diagnostic pass and produce a replace vs observe decision memo.
Safe replacement procedures reduce repeat incidents and post-maintenance regressions.
Drill: Write a component replacement SOP with evidence capture, execution controls, and exit criteria.
Blueprint includes optimization workflows on mixed server CPU platforms.
Drill: Create a CPU tuning experiment table with one change per run and measured impact.
Storage bottlenecks frequently appear as training or inference slowdowns.
Drill: Run storage baseline tests and produce a prioritized tuning backlog.
Troubleshooting is incomplete without proving the fix holds under realistic load.
Drill: Create a post-remediation validation template with recurrence watchpoints.
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Reliable troubleshooting starts with symptom-to-domain mapping before remediation action.
Operational excellence requires choosing the right remediation path based on confidence and blast radius.
A fix is incomplete until it holds under representative workload duration and no recurrence indicators remain.
Scenario Playbooks
A node passes smoke tests but emits intermittent Xid-like failures during sustained workload periods.
Architecture Diagram
[Workload Scheduler] -> [GPU Node]
|
[Telemetry + Kernel Logs]
|
[Decision: replace/tune/rollback] Response Flow
Success Signals
Kernel error scan
dmesg | rg -i 'xid|nvrm|pcie' | tail -n 30 Expected output (example)
[NVRM] Xid ...\n[NVRM] Channel exception ... (example) Performance dropped after combined BIOS and runtime tuning changes on a mixed AMD/Intel fleet.
Architecture Diagram
[Baseline Config] -> [Change Set A+B+C] -> [Regression]
| |
benchmark pack unclear causality
\_________________rollback + one-change retest________________/ Response Flow
Success Signals
CPU/memory context snapshot
lscpu && numactl --hardware Expected output (example)
Architecture: x86_64\nNUMA node(s): 2\nnode distances: ... CLI and Commands
Use this compact set to classify incidents quickly without command sprawl.
GPU health details
nvidia-smi -q -d TEMPERATURE,POWER,UTILIZATION,ECC,PERFORMANCE Expected output (example)
Temperature : 61 C\nPower Draw : 286 W\nEcc Mode : Enabled\n... Recent kernel-level fault signatures
journalctl -k -n 200 | rg -i 'nvrm|xid|pcie|aer' Expected output (example)
kernel: NVRM: Xid ...\nkernel: pcieport ... (example) Fabric communication sanity
all_reduce_perf -b 8 -e 256M -f 2 -g 8 Expected output (example)
NCCL sanity run complete with measured bandwidth table. Separate host and storage effects when investigating throughput regressions.
Storage throughput baseline
fio --name=optcheck --directory=/mnt/ai-data --rw=readwrite --bs=1M --size=8G --numjobs=4 --iodepth=16 Expected output (example)
READ: bw=8.4GiB/s ...\nWRITE: bw=6.9GiB/s ... CPU scheduler/load context
mpstat -P ALL 1 5 Expected output (example)
Average: CPU %usr %sys %iowait ... (example) Common Problems
Symptoms
Likely Cause
Premature replacement decision without triage confidence.
Remediation
Prevention: Require evidence checklist before physical replacement approval.
Symptoms
Likely Cause
Over-broad tuning changes without workload-specific validation.
Remediation
Prevention: Use controlled experiment design with one-change-per-run policy.
Symptoms
Likely Cause
Storage path latency or throughput instability.
Remediation
Prevention: Include storage telemetry in standard incident dashboards and runbooks.
Lab Walkthroughs
Execute full incident-response cycle for a realistic infrastructure fault and prove durable recovery.
Prerequisites
Collect initial high-signal diagnostics and classify fault domain.
nvidia-smi -q -d TEMPERATURE,POWER,ECC && journalctl -k -n 200 Expected: Symptoms mapped to a prioritized fault domain.
Run focused reproducer for suspected domain.
all_reduce_perf -b 8 -e 256M -f 2 -g 8 Expected: Issue is reproduced or ruled out in communication path.
Apply least-risk remediation (tune, rollback, or replace).
Expected: Remediation executed with change record and rollback plan.
Perform post-fix workload validation window.
nvidia-smi --query-gpu=timestamp,utilization.gpu,temperature.gpu,power.draw --format=csv -l 30 Expected: No recurrence indicators during defined validation duration.
Close incident with updated runbook entry.
Expected: Knowledge base updated with signature, fix, and prevention controls.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Incident taxonomy and triage workflow design. | Fault-domain triage decision tree. |
| 2 | GPU/host telemetry baseline and alert threshold review. | Health baseline and threshold sheet. |
| 3 | Hands-on diagnostic drill for simulated GPU fault. | Diagnostic evidence report. |
| 4 | Component replacement SOP and validation policy. | Replacement runbook with exit criteria. |
| 5 | AMD/Intel server tuning experiment design. | CPU platform tuning matrix. |
| 6 | Storage bottleneck isolation and measurement. | Storage path performance baseline. |
| 7 | NCCL + system telemetry correlation for communication faults. | Cross-layer fault-correlation notes. |
| 8 | Run one full remediation cycle (detect -> fix -> verify). | End-to-end remediation case file. |
| 9 | Timed troubleshooting scenario with constrained evidence. | Decision log and escalation rationale. |
| 10 | Final revision and high-yield incident pattern recap. | Troubleshoot/optimize rapid cheat sheet. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Identify root cause domain using minimal but high-signal diagnostics.
Sample Command (Fault isolation command runbook)
nvidia-smi -q -d TEMPERATURE,POWER,UTILIZATION,ECC,PERFORMANCE Expected output (example)
Temperature : 61 C\nPower Draw : 286 W\nEcc Mode : Enabled\n... Execute safe component replacement with proof of fix.
Sample Command (Fault isolation command runbook)
journalctl -k -n 200 | rg -i 'nvrm|xid|pcie|aer' Expected output (example)
kernel: NVRM: Xid ...\nkernel: pcieport ... (example) Measure performance impact of one-at-a-time platform tuning changes.
Sample Command (Fault isolation command runbook)
all_reduce_perf -b 8 -e 256M -f 2 -g 8 Expected output (example)
NCCL sanity run complete with measured bandwidth table. Improve storage path performance while preventing regressions.
Sample Command (Storage and host optimization runbook)
fio --name=optcheck --directory=/mnt/ai-data --rw=readwrite --bs=1M --size=8G --numjobs=4 --iodepth=16 Expected output (example)
READ: bw=8.4GiB/s ...\nWRITE: bw=6.9GiB/s ... Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Efficient troubleshooting starts with narrowing fault domain using evidence and structured triage.
Answer: B
Pre-replacement evidence validates the decision and helps confirm whether the fix addressed root cause.
Answer: B
Single-variable testing preserves causal interpretation of optimization results.
Answer: B
Longer validation windows reveal intermittent faults missed by short checks.
Answer: B
Multiple concurrent changes make it impossible to attribute outcomes confidently.
Answer: B
Storage path health is a frequent contributor to performance regressions in data-heavy workloads.
Answer: B
Post-replacement validation confirms operational integrity and prevents silent drift.
Answer: B
Host-level tuning impact depends on workload behavior and needs measured validation.
Answer: B
Durable remediation requires evidence and sustained stable operation.
Answer: B
Operational maturity depends on feeding incident lessons back into standard procedures.
Primary References
Curated from the NCP-AII blueprint/study-guide sources and official documentation.
Objectives
Navigation