Protected

NCP-AIN module content is available after admin verification. Redirecting...

If you are not redirected, login.

Training / NCP-AIN

Kubernetes Integration

Module study guide

Priority 6 of 6 · Domain 4 in exam order

Scope

Exam study content

This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.

Exam weight: 5%
Priority tier: Tier 3
Why this domain: Low-weight but critical integration scope for GPU platform readiness and CNI-aware orchestration.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module covers Kubernetes integration scope for NCP-AIN: cluster bring-up dependencies, NVIDIA GPU Operator validation, CNI and networking plugin behavior, and kubectl-driven diagnostics for AI workloads.

Track 1: Kubernetes architecture for AI networking

You need to understand where orchestration interacts with networking and GPU scheduling before troubleshooting symptoms.

Control-plane, worker-plane, and CNI responsibilities are separate failure domains.
Pod scheduling, device plugins, and CNI pathing jointly determine workload readiness.
Node labels, taints, and runtime classes directly affect GPU workload placement.

Drill: Map one training workload from API submit to pod networking and GPU allocation steps.

Track 2: Cluster/node prerequisites and baseline validation

Many Kubernetes networking failures originate in node prerequisites, not in overlay or routing logic.

Kernel modules, container runtime, and kubelet settings must match cluster standard.
Time sync, DNS reachability, and MTU consistency are required before CNI validation.
GPU nodes require deterministic driver/runtime compatibility for stable operator behavior.

Drill: Create a pre-flight checklist and run it on one control-plane and one worker node.

Track 3: NVIDIA GPU Operator and node readiness

The exam explicitly expects operator usage and operational verification patterns.

Validate operator-managed components: driver, toolkit, device plugin, and DCGM exporter.
Distinguish operator rollout failure from CNI or scheduler issues.
Confirm allocatable GPU resources at node and namespace scope.

Drill: Run end-to-end validation proving a GPU pod is admitted and can see expected devices.

Track 4: CNI plugin and Kubernetes network integration

Network plugin misconfiguration can mimic compute or application failures.

Know CNI config locations and plugin chain order on worker nodes.
Validate network policy behavior separately from CNI data-path health.
Understand secondary interface patterns for high-performance data paths.

Drill: Test pod-to-pod, pod-to-service, and node-to-pod connectivity with one policy-enabled namespace.

Track 5: kubectl diagnostics and incident workflow

Fast, deterministic command flow is required for exam scenarios and production incidents.

Use describe/events/logs to isolate scheduling vs runtime vs network causes.
Capture node condition, daemonset health, and namespace-scoped policy state.
Tie every remediation to objective before/after validation evidence.

Drill: Resolve a pending GPU pod incident and produce a short root-cause timeline.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

Orchestrator and network coupling

Kubernetes integration failures often appear as application incidents but originate from node, CNI, or policy coupling issues.

Always classify failures by layer before applying fixes.
Node readiness and CNI readiness are separate checks.
Operator health does not guarantee workload path health.

Deterministic kubectl triage

Consistent command flow shortens mean-time-to-isolation and improves exam response quality.

Start with pod status and events.
Validate node conditions and allocatable resources next.
Confirm policy and service routing only after baseline checks.

Policy-safe performance operations

Performance fixes must preserve namespace isolation and intended policy behavior.

Do not bypass policy controls to mask data-path defects.
Validate allowed and denied flows after every change.
Keep rollback conditions explicit before policy updates.

Scenario Playbooks

Exam-style scenario explanations

Scenario: GPU training pods stay Pending after cluster update

A cluster update completed, but new training pods requesting GPUs remain Pending in one namespace.

Architecture Diagram

API Server
  |
Scheduler ---- etcd
  |
GPU Worker Nodes
  |
CNI Plugin + GPU Operator Components

Response Flow

Check pod events and scheduler reasons for failed placement.
Validate node allocatable GPU resources and taints/tolerations alignment.
Confirm GPU Operator daemonsets and device plugin state.
Apply one targeted remediation and re-validate placement.

Success Signals

Pods transition from Pending to Running with expected GPU allocation.
No policy or namespace isolation regressions are introduced.
Root cause is tied to one control layer and documented.

Pod and event triage

kubectl describe pod <pod> -n <ns> && kubectl get events -n <ns> --sort-by=.lastTimestamp

Expected output (example)

Scheduling or resource errors are explicit and time-correlated.

GPU allocatable check

kubectl get nodes -o custom-columns=NAME:.metadata.name,ALLOC:.status.allocatable.nvidia\.com/gpu

Expected output (example)

Node allocatable GPU values match expected capacity.

Scenario: Pod-to-pod traffic fails across namespaces after policy change

A namespace policy hardening rollout was applied; cross-namespace service calls now fail intermittently.

Architecture Diagram

Namespace A Pods ---- Service A
Namespace B Pods ---- Service B
        |
CNI Plugin + NetworkPolicy Engine

Response Flow

Validate network policy objects and intended allow/deny matrix.
Run controlled connectivity probes between namespace endpoints.
Inspect CNI plugin health and node-level logs for policy application errors.
Rollback or patch policy with minimal scope and retest.

Success Signals

Expected traffic is restored and denied paths remain blocked.
Policy behavior is validated in repeat tests.
CNI/plugin health remains stable post-change.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

Runbook: Cluster and GPU readiness baseline

Establish whether issue source is scheduler/resource/operator before CNI tuning.

Node and system pod health

kubectl get nodes -o wide && kubectl get pods -n kube-system

Expected output (example)

All required nodes are Ready and system pods are healthy.

GPU operator and device plugin health

kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80

Expected output (example)

Operator components are Running with no device registration errors.

If nodes are NotReady, resolve node/runtime baseline before CNI troubleshooting.
If operator is unhealthy, remediation should focus on GPU stack lifecycle first.

Runbook: CNI and policy troubleshooting

Isolate data-path versus policy enforcement failures.

Policy inventory and namespace scoping

kubectl get networkpolicy -A

Expected output (example)

Policy set matches intended namespace segmentation model.

Endpoint-level connectivity probes

kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>'

Expected output (example)

Connectivity result aligns with expected allow/deny behavior.

Use both positive and negative tests for policy validation.
Capture timestamped evidence before and after changes.

Common Problems

Failure patterns and fixes

GPU Operator components remain CrashLoopBackOff

Symptoms

Pods in gpu-operator namespace repeatedly restart.
GPU resources are not allocatable on worker nodes.

Likely Cause

Driver/runtime mismatch or incomplete node prerequisites.

Remediation

Check operator and device plugin logs for registration failures.
Verify runtime/driver compatibility and node kernel dependencies.
Redeploy operator components after prerequisite correction.

Prevention: Run node pre-flight compatibility checks before operator rollout.

Pod networking unstable after CNI update

Symptoms

Intermittent packet loss between namespaces.
Service reachability flaps during peak traffic.

Likely Cause

CNI config drift or MTU mismatch introduced during update.

Remediation

Validate CNI config consistency across worker nodes.
Check MTU alignment for node and pod interfaces.
Rollback/patch CNI config and rerun connectivity matrix.

Prevention: Use staged CNI rollouts with explicit pre/post connectivity tests.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Validate GPU Operator readiness

Confirm end-to-end GPU workload admission and device visibility in Kubernetes.

Prerequisites

Kubernetes cluster with at least one GPU worker node.
Admin access for namespace and daemonset inspection.
GPU test container image available.

Verify cluster and operator baseline health.
```
kubectl get nodes && kubectl get pods -n gpu-operator
```
Expected: Nodes are Ready and operator components are Running.
Launch a GPU validation pod.
```
kubectl apply -f gpu-smoke-test.yaml
```
Expected: Pod schedules onto GPU node and enters Running state.
Confirm GPU visibility from inside pod.
```
kubectl exec -it <gpu-pod> -- nvidia-smi
```
Expected: Expected GPU inventory is visible without runtime errors.

Success Criteria

Scheduler allocates GPU resources as requested.
Operator and device plugin show no critical errors.
Validation output is captured for future baseline comparison.

Walkthrough: CNI and network policy validation

Prove intended connectivity behavior across namespaces with policy controls.

Prerequisites

Two namespaces with test pods deployed.
Defined allow/deny policy matrix.
Access to kubectl exec for test pods.

List active policies and namespace bindings.
```
kubectl get networkpolicy -A
```
Expected: Policies appear in intended namespaces with expected selectors.

Run allow-path connectivity test.

kubectl exec -n ns-a pod-a -- nc -vz service-b.ns-b.svc.cluster.local 8080

Expected: Allowed path succeeds.

Run deny-path connectivity test.

kubectl exec -n ns-a pod-a -- nc -vz denied-target.ns-b.svc.cluster.local 8080

Expected: Denied path fails as intended.

Success Criteria

Policy behavior matches design for both allow and deny cases.
No unexpected cross-namespace leakage is observed.
Test evidence is attached to change record.

Study Sprint

10-day execution plan

Day	Focus	Output
1	Kubernetes architecture refresh for AI networking context.	Control-plane and data-path dependency map.
2	Node prerequisite and baseline validation workflow.	Pre-flight checklist for production clusters.
3	GPU Operator install and health verification.	Operator readiness runbook with pass/fail gates.
4	CNI plugin architecture and config inspection.	CNI verification checklist.
5	Network policy and namespace isolation checks.	Policy test matrix (allow/deny paths).
6	kubectl diagnostics drill for pending/not-ready workloads.	Incident triage command sequence.
7	Service routing and DNS failure simulation.	Troubleshooting decision tree for service reachability.
8	GPU scheduling and capacity constraints scenarios.	Scheduler and allocatable-capacity playbook.
9	Timed scenario set combining CNI and GPU Operator faults.	Exam-style remediation notes.
10	Final revision and checklist compression.	Kubernetes Integration quick revision sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Baseline cluster health and node prerequisites

Prove control-plane and worker nodes are ready before CNI/operator work.

Validate node readiness and kube-system pod health.
Confirm runtime, DNS, and clock state are consistent across nodes.
Record baseline outputs for comparison during incident drills.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Cluster and GPU readiness baseline)

kubectl get nodes -o wide && kubectl get pods -n kube-system

Expected output (example)

All required nodes are Ready and system pods are healthy.

Lab B: GPU Operator validation lab

Deploy and validate operator-managed components on GPU workers.

Verify operator deployment and daemonset status.
Launch a GPU test pod and confirm device visibility.
Collect logs/events for failure-path handling.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: Cluster and GPU readiness baseline)

kubectl get pods -n gpu-operator && kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=80

Expected output (example)

Operator components are Running with no device registration errors.

Lab C: CNI and policy behavior lab

Validate data-path and policy behavior with deterministic checks.

Inspect CNI configuration and plugin chain on worker nodes.
Run cross-namespace connectivity tests with policy toggles.
Confirm expected allow/deny behavior with packet evidence.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: CNI and policy troubleshooting)

kubectl get networkpolicy -A

Expected output (example)

Policy set matches intended namespace segmentation model.

Lab D: kubectl-led incident response lab

Diagnose and remediate a simulated cluster networking incident.

Identify symptom class from events and pod status.
Narrow root cause to scheduler, runtime, CNI, or policy layer.
Apply one fix and validate workload recovery.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Runbook: CNI and policy troubleshooting)

kubectl exec -n <ns> <pod> -- sh -c 'nc -vz <target> <port>'

Expected output (example)

Connectivity result aligns with expected allow/deny behavior.

Exam Pitfalls

Common failure patterns

Troubleshooting CNI before verifying node and runtime prerequisites.
Assuming GPU Operator readiness means network path is healthy.
Skipping namespace or network policy checks during connectivity incidents.
Changing multiple cluster controls at once and losing causality.
Reading pod status without checking events and node conditions.
Declaring fix success without rerunning workload-level validation.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the most reliable first step when GPU pods remain Pending?

A. Restart all pods
B. Check pod events, node allocatable resources, and taints/tolerations
C. Disable network policies immediately
D. Reinstall Kubernetes control-plane

Answer: B

Pending state is usually scheduler/resource related first; events and node allocatable fields provide the quickest signal.

Q2. Which statement best describes GPU Operator scope?

A. It replaces Kubernetes networking plugins
B. It manages GPU software stack components and device exposure workflows
C. It configures all storage backends
D. It disables node labels and taints

Answer: B

GPU Operator focuses on GPU stack lifecycle and related components, not full CNI management.

Q3. Why should CNI debugging include plugin-chain inspection?

A. Plugin order and config directly affect pod network setup behavior
B. It increases scheduler throughput
C. It removes need for policies
D. It only matters for control-plane nodes

Answer: A

Misordered or invalid CNI configs can break pod networking even when nodes appear healthy.

Q4. What is the best way to validate network policy behavior?

A. Assume defaults are secure
B. Execute explicit allow/deny connectivity tests across namespaces
C. Disable policy engine
D. Test only DNS queries

Answer: B

Policy behavior should be proven with controlled positive and negative traffic tests.

Q5. Which command set is highest value for fast incident triage?

A. kubectl get pods only
B. kubectl describe pods, kubectl get events, and node condition checks
C. External benchmark tools only
D. Git log history

Answer: B

These provide immediate scheduling/runtime/policy context needed to isolate root cause.

Q6. Why does MTU consistency matter in Kubernetes AI clusters?

A. MTU has no effect on pod networking
B. Mismatched MTU can trigger fragmentation, drops, and unstable throughput
C. It only affects storage and not pods
D. It is controlled only by the API server

Answer: B

MTU mismatches commonly create hard-to-diagnose network performance issues.

Q7. In exam scenarios, what proves your remediation is complete?

A. One pod reached Running
B. Root cause identified, targeted fix applied, and workload-level validation passes
C. Monitoring dashboards are open
D. Incident ticket is closed

Answer: B

Exam answers are strongest when they include cause, action, and validated outcome.

Q8. Which scope item is explicitly tied to this domain?

A. Install and use NVIDIA GPU Operator in Kubernetes
B. Replace all switch firmware
C. Build only bare-metal clusters
D. Disable kubectl

Answer: A

GPU Operator installation and usage is explicitly included in this blueprint domain.

Primary References

Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary Kubernetes/GPU operator documentation.

Objectives

Describe architecture and technologies of Kubernetes.
Install and configure node and software infrastructure for Kubernetes.
Install and use NVIDIA GPU Operator in Kubernetes.
Describe architecture and configuration of Kubernetes networking plugins and CNIs.
Install and configure Kubernetes networking plugin.
Use Kubernetes command line (kubectl) for management and diagnostics.

Navigation

Back to NCP-AIN landing Previous: AI Data Center Design and Optimization