1. Stabilize Before Optimizing
- Verify hardware and management-plane integrity first.
- Confirm firmware/software baseline consistency.
- Only then run performance tuning decisions.
Protected
NCP-AIN module content is available after admin verification. Redirecting...
If you are not redirected, login.
Access
Admin only
NCP-AIN module pages are restricted to admin users.
Training / NCP-AIN
Module study guide
Priority 6 of 6 ยท Domain 4 in exam order
Scope
This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.
Exam Framework
Exam Scope Coverage
This module covers Kubernetes integration scope for NCP-AIN: cluster bring-up dependencies, NVIDIA GPU Operator validation, CNI and networking plugin behavior, and kubectl-driven diagnostics for AI workloads.
You need to understand where orchestration interacts with networking and GPU scheduling before troubleshooting symptoms.
Drill: Map one training workload from API submit to pod networking and GPU allocation steps.
Many Kubernetes networking failures originate in node prerequisites, not in overlay or routing logic.
Drill: Create a pre-flight checklist and run it on one control-plane and one worker node.
The exam explicitly expects operator usage and operational verification patterns.
Drill: Run end-to-end validation proving a GPU pod is admitted and can see expected devices.
Network plugin misconfiguration can mimic compute or application failures.
Drill: Test pod-to-pod, pod-to-service, and node-to-pod connectivity with one policy-enabled namespace.
Fast, deterministic command flow is required for exam scenarios and production incidents.
Drill: Resolve a pending GPU pod incident and produce a short root-cause timeline.
Module Resources
Concept Explanations
Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.
Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.
Kubernetes integration failures often appear as application incidents but originate from node, CNI, or policy coupling issues.
Consistent command flow shortens mean-time-to-isolation and improves exam response quality.
Performance fixes must preserve namespace isolation and intended policy behavior.
Scenario Playbooks
A cluster update completed, but new training pods requesting GPUs remain Pending in one namespace.
Architecture Diagram
API Server
|
Scheduler ---- etcd
|
GPU Worker Nodes
|
CNI Plugin + GPU Operator Components Response Flow
Success Signals
Pod and event triage
kubectl describe pod <pod> -n <ns> && kubectl get events -n <ns> --sort-by=.lastTimestamp Expected output (example)
Scheduling or resource errors are explicit and time-correlated. GPU allocatable check
kubectl get nodes -o custom-columns=NAME:.metadata.name,ALLOC:.status.allocatable.nvidia\.com/gpu Expected output (example)
Node allocatable GPU values match expected capacity. A namespace policy hardening rollout was applied; cross-namespace service calls now fail intermittently.
Architecture Diagram
Namespace A Pods ---- Service A
Namespace B Pods ---- Service B
|
CNI Plugin + NetworkPolicy Engine Response Flow
Success Signals
CLI and Commands
Establish whether issue source is scheduler/resource/operator before CNI tuning.
Node and system pod health
kubectl get nodes -o wide && kubectl get pods -n kube-system Expected output (example)
All required nodes are Ready and system pods are healthy. GPU operator and device plugin health
kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80 Expected output (example)
Operator components are Running with no device registration errors. Isolate data-path versus policy enforcement failures.
Policy inventory and namespace scoping
kubectl get networkpolicy -A Expected output (example)
Policy set matches intended namespace segmentation model. Endpoint-level connectivity probes
kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>' Expected output (example)
Connectivity result aligns with expected allow/deny behavior. Common Problems
Symptoms
Likely Cause
Driver/runtime mismatch or incomplete node prerequisites.
Remediation
Prevention: Run node pre-flight compatibility checks before operator rollout.
Symptoms
Likely Cause
CNI config drift or MTU mismatch introduced during update.
Remediation
Prevention: Use staged CNI rollouts with explicit pre/post connectivity tests.
Lab Walkthroughs
Confirm end-to-end GPU workload admission and device visibility in Kubernetes.
Prerequisites
Verify cluster and operator baseline health.
kubectl get nodes && kubectl get pods -n gpu-operator Expected: Nodes are Ready and operator components are Running.
Launch a GPU validation pod.
kubectl apply -f gpu-smoke-test.yaml Expected: Pod schedules onto GPU node and enters Running state.
Confirm GPU visibility from inside pod.
kubectl exec -it <gpu-pod> -- nvidia-smi Expected: Expected GPU inventory is visible without runtime errors.
Success Criteria
Prove intended connectivity behavior across namespaces with policy controls.
Prerequisites
List active policies and namespace bindings.
kubectl get networkpolicy -A Expected: Policies appear in intended namespaces with expected selectors.
Run allow-path connectivity test.
kubectl exec -n ns-a pod-a -- nc -vz service-b.ns-b.svc.cluster.local 8080 Expected: Allowed path succeeds.
Run deny-path connectivity test.
kubectl exec -n ns-a pod-a -- nc -vz denied-target.ns-b.svc.cluster.local 8080 Expected: Denied path fails as intended.
Success Criteria
Study Sprint
| Day | Focus | Output |
|---|---|---|
| 1 | Kubernetes architecture refresh for AI networking context. | Control-plane and data-path dependency map. |
| 2 | Node prerequisite and baseline validation workflow. | Pre-flight checklist for production clusters. |
| 3 | GPU Operator install and health verification. | Operator readiness runbook with pass/fail gates. |
| 4 | CNI plugin architecture and config inspection. | CNI verification checklist. |
| 5 | Network policy and namespace isolation checks. | Policy test matrix (allow/deny paths). |
| 6 | kubectl diagnostics drill for pending/not-ready workloads. | Incident triage command sequence. |
| 7 | Service routing and DNS failure simulation. | Troubleshooting decision tree for service reachability. |
| 8 | GPU scheduling and capacity constraints scenarios. | Scheduler and allocatable-capacity playbook. |
| 9 | Timed scenario set combining CNI and GPU Operator faults. | Exam-style remediation notes. |
| 10 | Final revision and checklist compression. | Kubernetes Integration quick revision sheet. |
Hands-on Labs
Each lab includes a collapsed execution sample with representative CLI usage and expected output.
Prove control-plane and worker nodes are ready before CNI/operator work.
Sample Command (Runbook: Cluster and GPU readiness baseline)
kubectl get nodes -o wide && kubectl get pods -n kube-system Expected output (example)
All required nodes are Ready and system pods are healthy. Deploy and validate operator-managed components on GPU workers.
Sample Command (Runbook: Cluster and GPU readiness baseline)
kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80 Expected output (example)
Operator components are Running with no device registration errors. Validate data-path and policy behavior with deterministic checks.
Sample Command (Runbook: CNI and policy troubleshooting)
kubectl get networkpolicy -A Expected output (example)
Policy set matches intended namespace segmentation model. Diagnose and remediate a simulated cluster networking incident.
Sample Command (Runbook: CNI and policy troubleshooting)
kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>' Expected output (example)
Connectivity result aligns with expected allow/deny behavior. Exam Pitfalls
Practice Set
Attempt each question first, then open the answer and explanation.
Answer: B
Pending state is usually scheduler/resource related first; events and node allocatable fields provide the quickest signal.
Answer: B
GPU Operator focuses on GPU stack lifecycle and related components, not full CNI management.
Answer: A
Misordered or invalid CNI configs can break pod networking even when nodes appear healthy.
Answer: B
Policy behavior should be proven with controlled positive and negative traffic tests.
Answer: B
These provide immediate scheduling/runtime/policy context needed to isolate root cause.
Answer: B
MTU mismatches commonly create hard-to-diagnose network performance issues.
Answer: B
Exam answers are strongest when they include cause, action, and validated outcome.
Answer: A
GPU Operator installation and usage is explicitly included in this blueprint domain.
Primary References
Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary Kubernetes/GPU operator documentation.
Objectives
Navigation