Protected

NCP-AIN module content is available after admin verification. Redirecting...

If you are not redirected, login.

Access

Admin only

NCP-AIN module pages are restricted to admin users.

Training / NCP-AIN

Kubernetes Integration

Module study guide

Priority 6 of 6 ยท Domain 4 in exam order

Scope

Exam study content

This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.

Exam weight
5%
Priority tier
Tier 3
Why this domain
Low-weight but critical integration scope for GPU platform readiness and CNI-aware orchestration.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module covers Kubernetes integration scope for NCP-AIN: cluster bring-up dependencies, NVIDIA GPU Operator validation, CNI and networking plugin behavior, and kubectl-driven diagnostics for AI workloads.

Track 1: Kubernetes architecture for AI networking

You need to understand where orchestration interacts with networking and GPU scheduling before troubleshooting symptoms.

  • Control-plane, worker-plane, and CNI responsibilities are separate failure domains.
  • Pod scheduling, device plugins, and CNI pathing jointly determine workload readiness.
  • Node labels, taints, and runtime classes directly affect GPU workload placement.

Drill: Map one training workload from API submit to pod networking and GPU allocation steps.

Track 2: Cluster/node prerequisites and baseline validation

Many Kubernetes networking failures originate in node prerequisites, not in overlay or routing logic.

  • Kernel modules, container runtime, and kubelet settings must match cluster standard.
  • Time sync, DNS reachability, and MTU consistency are required before CNI validation.
  • GPU nodes require deterministic driver/runtime compatibility for stable operator behavior.

Drill: Create a pre-flight checklist and run it on one control-plane and one worker node.

Track 3: NVIDIA GPU Operator and node readiness

The exam explicitly expects operator usage and operational verification patterns.

  • Validate operator-managed components: driver, toolkit, device plugin, and DCGM exporter.
  • Distinguish operator rollout failure from CNI or scheduler issues.
  • Confirm allocatable GPU resources at node and namespace scope.

Drill: Run end-to-end validation proving a GPU pod is admitted and can see expected devices.

Track 4: CNI plugin and Kubernetes network integration

Network plugin misconfiguration can mimic compute or application failures.

  • Know CNI config locations and plugin chain order on worker nodes.
  • Validate network policy behavior separately from CNI data-path health.
  • Understand secondary interface patterns for high-performance data paths.

Drill: Test pod-to-pod, pod-to-service, and node-to-pod connectivity with one policy-enabled namespace.

Track 5: kubectl diagnostics and incident workflow

Fast, deterministic command flow is required for exam scenarios and production incidents.

  • Use describe/events/logs to isolate scheduling vs runtime vs network causes.
  • Capture node condition, daemonset health, and namespace-scoped policy state.
  • Tie every remediation to objective before/after validation evidence.

Drill: Resolve a pending GPU pod incident and produce a short root-cause timeline.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Orchestrator and network coupling

Kubernetes integration failures often appear as application incidents but originate from node, CNI, or policy coupling issues.

  • Always classify failures by layer before applying fixes.
  • Node readiness and CNI readiness are separate checks.
  • Operator health does not guarantee workload path health.

Deterministic kubectl triage

Consistent command flow shortens mean-time-to-isolation and improves exam response quality.

  • Start with pod status and events.
  • Validate node conditions and allocatable resources next.
  • Confirm policy and service routing only after baseline checks.

Policy-safe performance operations

Performance fixes must preserve namespace isolation and intended policy behavior.

  • Do not bypass policy controls to mask data-path defects.
  • Validate allowed and denied flows after every change.
  • Keep rollback conditions explicit before policy updates.

Scenario Playbooks

Exam-style scenario explanations

Scenario: GPU training pods stay Pending after cluster update

A cluster update completed, but new training pods requesting GPUs remain Pending in one namespace.

Architecture Diagram

API Server
  |
Scheduler ---- etcd
  |
GPU Worker Nodes
  |
CNI Plugin + GPU Operator Components

Response Flow

  1. Check pod events and scheduler reasons for failed placement.
  2. Validate node allocatable GPU resources and taints/tolerations alignment.
  3. Confirm GPU Operator daemonsets and device plugin state.
  4. Apply one targeted remediation and re-validate placement.

Success Signals

  • Pods transition from Pending to Running with expected GPU allocation.
  • No policy or namespace isolation regressions are introduced.
  • Root cause is tied to one control layer and documented.

Pod and event triage

kubectl describe pod <pod> -n <ns> && kubectl get events -n <ns> --sort-by=.lastTimestamp

Expected output (example)

Scheduling or resource errors are explicit and time-correlated.

GPU allocatable check

kubectl get nodes -o custom-columns=NAME:.metadata.name,ALLOC:.status.allocatable.nvidia\.com/gpu

Expected output (example)

Node allocatable GPU values match expected capacity.

Scenario: Pod-to-pod traffic fails across namespaces after policy change

A namespace policy hardening rollout was applied; cross-namespace service calls now fail intermittently.

Architecture Diagram

Namespace A Pods ---- Service A
Namespace B Pods ---- Service B
        |
CNI Plugin + NetworkPolicy Engine

Response Flow

  1. Validate network policy objects and intended allow/deny matrix.
  2. Run controlled connectivity probes between namespace endpoints.
  3. Inspect CNI plugin health and node-level logs for policy application errors.
  4. Rollback or patch policy with minimal scope and retest.

Success Signals

  • Expected traffic is restored and denied paths remain blocked.
  • Policy behavior is validated in repeat tests.
  • CNI/plugin health remains stable post-change.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

Runbook: Cluster and GPU readiness baseline

Establish whether issue source is scheduler/resource/operator before CNI tuning.

Node and system pod health

kubectl get nodes -o wide && kubectl get pods -n kube-system

Expected output (example)

All required nodes are Ready and system pods are healthy.

GPU operator and device plugin health

kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80

Expected output (example)

Operator components are Running with no device registration errors.
  • If nodes are NotReady, resolve node/runtime baseline before CNI troubleshooting.
  • If operator is unhealthy, remediation should focus on GPU stack lifecycle first.

Runbook: CNI and policy troubleshooting

Isolate data-path versus policy enforcement failures.

Policy inventory and namespace scoping

kubectl get networkpolicy -A

Expected output (example)

Policy set matches intended namespace segmentation model.

Endpoint-level connectivity probes

kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>'

Expected output (example)

Connectivity result aligns with expected allow/deny behavior.
  • Use both positive and negative tests for policy validation.
  • Capture timestamped evidence before and after changes.

Common Problems

Failure patterns and fixes

GPU Operator components remain CrashLoopBackOff

Symptoms

  • Pods in gpu-operator namespace repeatedly restart.
  • GPU resources are not allocatable on worker nodes.

Likely Cause

Driver/runtime mismatch or incomplete node prerequisites.

Remediation

  • Check operator and device plugin logs for registration failures.
  • Verify runtime/driver compatibility and node kernel dependencies.
  • Redeploy operator components after prerequisite correction.

Prevention: Run node pre-flight compatibility checks before operator rollout.

Pod networking unstable after CNI update

Symptoms

  • Intermittent packet loss between namespaces.
  • Service reachability flaps during peak traffic.

Likely Cause

CNI config drift or MTU mismatch introduced during update.

Remediation

  • Validate CNI config consistency across worker nodes.
  • Check MTU alignment for node and pod interfaces.
  • Rollback/patch CNI config and rerun connectivity matrix.

Prevention: Use staged CNI rollouts with explicit pre/post connectivity tests.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Validate GPU Operator readiness

Confirm end-to-end GPU workload admission and device visibility in Kubernetes.

Prerequisites

  • Kubernetes cluster with at least one GPU worker node.
  • Admin access for namespace and daemonset inspection.
  • GPU test container image available.
  1. Verify cluster and operator baseline health.

    kubectl get nodes && kubectl get pods -n gpu-operator

    Expected: Nodes are Ready and operator components are Running.

  2. Launch a GPU validation pod.

    kubectl apply -f gpu-smoke-test.yaml

    Expected: Pod schedules onto GPU node and enters Running state.

  3. Confirm GPU visibility from inside pod.

    kubectl exec -it <gpu-pod> -- nvidia-smi

    Expected: Expected GPU inventory is visible without runtime errors.

Success Criteria

  • Scheduler allocates GPU resources as requested.
  • Operator and device plugin show no critical errors.
  • Validation output is captured for future baseline comparison.

Walkthrough: CNI and network policy validation

Prove intended connectivity behavior across namespaces with policy controls.

Prerequisites

  • Two namespaces with test pods deployed.
  • Defined allow/deny policy matrix.
  • Access to kubectl exec for test pods.
  1. List active policies and namespace bindings.

    kubectl get networkpolicy -A

    Expected: Policies appear in intended namespaces with expected selectors.

  2. Run allow-path connectivity test.

    kubectl exec -n ns-a pod-a -- nc -vz service-b.ns-b.svc.cluster.local 8080

    Expected: Allowed path succeeds.

  3. Run deny-path connectivity test.

    kubectl exec -n ns-a pod-a -- nc -vz denied-target.ns-b.svc.cluster.local 8080

    Expected: Denied path fails as intended.

Success Criteria

  • Policy behavior matches design for both allow and deny cases.
  • No unexpected cross-namespace leakage is observed.
  • Test evidence is attached to change record.

Study Sprint

10-day execution plan

Day Focus Output
1 Kubernetes architecture refresh for AI networking context. Control-plane and data-path dependency map.
2 Node prerequisite and baseline validation workflow. Pre-flight checklist for production clusters.
3 GPU Operator install and health verification. Operator readiness runbook with pass/fail gates.
4 CNI plugin architecture and config inspection. CNI verification checklist.
5 Network policy and namespace isolation checks. Policy test matrix (allow/deny paths).
6 kubectl diagnostics drill for pending/not-ready workloads. Incident triage command sequence.
7 Service routing and DNS failure simulation. Troubleshooting decision tree for service reachability.
8 GPU scheduling and capacity constraints scenarios. Scheduler and allocatable-capacity playbook.
9 Timed scenario set combining CNI and GPU Operator faults. Exam-style remediation notes.
10 Final revision and checklist compression. Kubernetes Integration quick revision sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Baseline cluster health and node prerequisites

Prove control-plane and worker nodes are ready before CNI/operator work.

  • Validate node readiness and kube-system pod health.
  • Confirm runtime, DNS, and clock state are consistent across nodes.
  • Record baseline outputs for comparison during incident drills.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Cluster and GPU readiness baseline)

kubectl get nodes -o wide && kubectl get pods -n kube-system

Expected output (example)

All required nodes are Ready and system pods are healthy.

Lab B: GPU Operator validation lab

Deploy and validate operator-managed components on GPU workers.

  • Verify operator deployment and daemonset status.
  • Launch a GPU test pod and confirm device visibility.
  • Collect logs/events for failure-path handling.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Cluster and GPU readiness baseline)

kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80

Expected output (example)

Operator components are Running with no device registration errors.

Lab C: CNI and policy behavior lab

Validate data-path and policy behavior with deterministic checks.

  • Inspect CNI configuration and plugin chain on worker nodes.
  • Run cross-namespace connectivity tests with policy toggles.
  • Confirm expected allow/deny behavior with packet evidence.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: CNI and policy troubleshooting)

kubectl get networkpolicy -A

Expected output (example)

Policy set matches intended namespace segmentation model.

Lab D: kubectl-led incident response lab

Diagnose and remediate a simulated cluster networking incident.

  • Identify symptom class from events and pod status.
  • Narrow root cause to scheduler, runtime, CNI, or policy layer.
  • Apply one fix and validate workload recovery.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: CNI and policy troubleshooting)

kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>'

Expected output (example)

Connectivity result aligns with expected allow/deny behavior.

Exam Pitfalls

Common failure patterns

  • Troubleshooting CNI before verifying node and runtime prerequisites.
  • Assuming GPU Operator readiness means network path is healthy.
  • Skipping namespace or network policy checks during connectivity incidents.
  • Changing multiple cluster controls at once and losing causality.
  • Reading pod status without checking events and node conditions.
  • Declaring fix success without rerunning workload-level validation.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the most reliable first step when GPU pods remain Pending?
  • A. Restart all pods
  • B. Check pod events, node allocatable resources, and taints/tolerations
  • C. Disable network policies immediately
  • D. Reinstall Kubernetes control-plane

Answer: B

Pending state is usually scheduler/resource related first; events and node allocatable fields provide the quickest signal.

Q2. Which statement best describes GPU Operator scope?
  • A. It replaces Kubernetes networking plugins
  • B. It manages GPU software stack components and device exposure workflows
  • C. It configures all storage backends
  • D. It disables node labels and taints

Answer: B

GPU Operator focuses on GPU stack lifecycle and related components, not full CNI management.

Q3. Why should CNI debugging include plugin-chain inspection?
  • A. Plugin order and config directly affect pod network setup behavior
  • B. It increases scheduler throughput
  • C. It removes need for policies
  • D. It only matters for control-plane nodes

Answer: A

Misordered or invalid CNI configs can break pod networking even when nodes appear healthy.

Q4. What is the best way to validate network policy behavior?
  • A. Assume defaults are secure
  • B. Execute explicit allow/deny connectivity tests across namespaces
  • C. Disable policy engine
  • D. Test only DNS queries

Answer: B

Policy behavior should be proven with controlled positive and negative traffic tests.

Q5. Which command set is highest value for fast incident triage?
  • A. kubectl get pods only
  • B. kubectl describe pods, kubectl get events, and node condition checks
  • C. External benchmark tools only
  • D. Git log history

Answer: B

These provide immediate scheduling/runtime/policy context needed to isolate root cause.

Q6. Why does MTU consistency matter in Kubernetes AI clusters?
  • A. MTU has no effect on pod networking
  • B. Mismatched MTU can trigger fragmentation, drops, and unstable throughput
  • C. It only affects storage and not pods
  • D. It is controlled only by the API server

Answer: B

MTU mismatches commonly create hard-to-diagnose network performance issues.

Q7. In exam scenarios, what proves your remediation is complete?
  • A. One pod reached Running
  • B. Root cause identified, targeted fix applied, and workload-level validation passes
  • C. Monitoring dashboards are open
  • D. Incident ticket is closed

Answer: B

Exam answers are strongest when they include cause, action, and validated outcome.

Q8. Which scope item is explicitly tied to this domain?
  • A. Install and use NVIDIA GPU Operator in Kubernetes
  • B. Replace all switch firmware
  • C. Build only bare-metal clusters
  • D. Disable kubectl

Answer: A

GPU Operator installation and usage is explicitly included in this blueprint domain.

Primary References

Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary Kubernetes/GPU operator documentation.

Objectives

  • Describe architecture and technologies of Kubernetes.
  • Install and configure node and software infrastructure for Kubernetes.
  • Install and use NVIDIA GPU Operator in Kubernetes.
  • Describe architecture and configuration of Kubernetes networking plugins and CNIs.
  • Install and configure Kubernetes networking plugin.
  • Use Kubernetes command line (kubectl) for management and diagnostics.

Navigation