Protected

NCP-AIN module content is available after admin verification. Redirecting...

If you are not redirected, login.

Access

Admin only

NCP-AIN module pages are restricted to admin users.

Training / NCP-AIN

Automation and Configuration

Module study guide

Priority 4 of 6 ยท Domain 6 in exam order

Scope

Exam study content

This module contains expanded study notes, scenario playbooks, command runbooks, and exam-style checkpoint questions.

Exam weight
10%
Priority tier
Tier 2
Why this domain
Configuration-at-scale scope for repeatability, drift control, and controlled rollout in AI networking environments.

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

  • Verify hardware and management-plane integrity first.
  • Confirm firmware/software baseline consistency.
  • Only then run performance tuning decisions.

2. Single-Variable Changes

  • Change one parameter at a time when investigating regressions.
  • Use before/after evidence with constant workload input.
  • Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

This module covers Automation and Configuration scope: scalable network configuration workflows, automation tooling patterns, drift detection, controlled change rollout, and rollback-safe operations for AI networking environments.

Track 1: Configuration-at-scale fundamentals

Manual configuration does not scale for AI fabrics; deterministic automation is required.

  • Configuration intent must be template-driven and version-controlled.
  • Pre-check, change, and post-check phases should be explicit in every rollout.
  • Rollback criteria must be defined before execution.

Drill: Design a three-phase change workflow for a 32-switch maintenance window.

Track 2: Automation tools and control surfaces

The blueprint expects practical use of tools to automate and scale configuration tasks.

  • Use automation frameworks to enforce consistent config deployment and validation.
  • Separate source-of-truth, rendered config, and runtime state checks.
  • Use idempotent operations to avoid unintended repeated changes.

Drill: Create an automation inventory and map each tool to its control responsibility.

Track 3: CLI validation and optimization loops

Automation does not remove CLI responsibility; CLI still validates runtime reality.

  • After automation rollout, verify interfaces, routes, and policy state with CLI checks.
  • Tie config changes to measurable network performance outcomes.
  • Use staged rollout to reduce blast radius.

Drill: Run one staged change in lab and produce before/after validation evidence.

Track 4: Drift detection and policy compliance

Configuration drift is a top source of recurring incidents in multi-team environments.

  • Regularly compare intended config with live device state.
  • Classify drift by risk level and remediation urgency.
  • Enforce policy checks before change promotion.

Drill: Build a drift report format that includes severity, owner, and remediation SLA.

Track 5: Safe change management for AI workloads

AI cluster traffic is sensitive to network instability; change safety directly protects workload uptime.

  • Use canary scope and maintenance windows for high-risk modifications.
  • Define stop/rollback conditions linked to objective metrics.
  • Capture change evidence for audit and incident replay.

Drill: Define go/no-go and rollback gates for a fabric-wide configuration update.

Module Resources

Downloads and quick links

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

  • If integrity checks fail, stop optimization and remediate first.
  • Compare against known-good baseline before changing multiple variables.
  • Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

  • Evidence should be reproducible by another engineer.
  • Use stable command templates for repeated environments.
  • Keep concise but complete validation artifacts for exam-style reasoning.

Intent vs runtime state

Template/render output defines intent, but operational correctness is determined by live runtime state.

  • Validate intent and runtime separately.
  • Use post-checks that directly query device state.
  • Treat divergence as operational risk, not cosmetic drift.

Change safety economics

Canary rollout and rollback planning reduce outage blast radius and recovery cost.

  • High-risk changes should start with minimal scope.
  • Promotion requires objective gate success.
  • Rollback should be executable in minutes, not hours.

Automation as a control loop

Automation should include sensing, deciding, acting, and validating in a closed loop.

  • Pre-checks establish safe starting state.
  • Execution applies deterministic intent.
  • Post-checks confirm policy and performance outcomes.

Scenario Playbooks

Exam-style scenario explanations

Scenario: Config rollout succeeds but training latency worsens

An automation job reports success across the fabric, but distributed training latency increases immediately after rollout.

Architecture Diagram

Source-of-Truth
    |
Automation Pipeline
    |
Switch Fleet ---- AI Workloads

Response Flow

  1. Diff intended config and previous known-good state.
  2. Validate live runtime state with CLI on canary and affected nodes.
  3. Correlate performance metrics with changed control points.
  4. Rollback targeted scope if post-check gates fail.

Success Signals

  • Latency returns to baseline tolerance after remediation.
  • Root cause ties to one specific change group.
  • Automation pipeline includes stronger post-check gates.

State verification and diff cues

nv show interface && nv show route

Expected output (example)

Runtime state confirms whether intended change converged correctly.

Performance regression confirmation

iperf3 -c <peer_ip> -P 8 -t 30

Expected output (example)

Benchmark verifies regression before rollback and recovery after fix.

Scenario: Drift detected on subset of switches after emergency change

Emergency manual edits solved an incident, but drift now exists on part of the fleet.

Architecture Diagram

Source-of-Truth Repo
      |
Drift Detection Job
      |
Fleet Nodes (canary + production)

Response Flow

  1. Run drift scan and classify deviations by risk.
  2. Restore intended state in canary group first.
  3. Validate policy/performance behavior post-remediation.
  4. Promote remediation to full fleet with rollback readiness.

Success Signals

  • Drift severity is classified and remediated with traceability.
  • Fleet converges to intended state.
  • No performance or security regressions are introduced.

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

  • 1. Capture baseline state before running any intrusive command.
  • 2. Execute command with explicit scope (node, interface, GPU set).
  • 3. Compare output against expected baseline signature.
  • 4. Record timestamp and decision (pass, investigate, remediate).

Runbook: Pre-check and staged rollout

Execute a safe configuration rollout with explicit gates.

Baseline state snapshot

nv show interface && nv show route

Expected output (example)

Known-good baseline captured before rollout.

Apply automation workflow

ansible-playbook -i inventory fabric-rollout.yml --limit canary

Expected output (example)

Canary scope converges with no execution errors.
  • Promote beyond canary only after post-check gates pass.
  • Rollback trigger should be metric-based and pre-defined.

Runbook: Drift detection and remediation

Detect divergence and safely restore intended state.

Drift report generation

ansible-playbook -i inventory drift-check.yml

Expected output (example)

Drift list includes host, control area, and severity.

Targeted remediation apply

ansible-playbook -i inventory remediate-drift.yml --limit <host_or_group>

Expected output (example)

Selected nodes converge to intended configuration.
  • Always rerun validation commands after remediation.
  • Preserve audit trail linking drift to change history.

Common Problems

Failure patterns and fixes

Automation success reported, but runtime state is inconsistent

Symptoms

  • Some devices show expected config while others diverge.
  • Performance behavior is inconsistent by rack or tenant.

Likely Cause

Pipeline success reflects execution status, not validated convergence.

Remediation

  • Add runtime CLI post-check gates to pipeline.
  • Re-run rollout on failed subset with canary boundaries.
  • Verify convergence and workload metrics after correction.

Prevention: Require convergence and SLO checks before marking job as successful.

Recurring drift after manual emergency changes

Symptoms

  • Same deviations reappear after routine automation runs.
  • Policy compliance reports fluctuate across weeks.

Likely Cause

Emergency edits bypassed source-of-truth update process.

Remediation

  • Backport emergency fix into source-of-truth templates.
  • Reconcile drift with staged remediation workflow.
  • Enforce change policy requiring post-incident template updates.

Prevention: Integrate emergency change reconciliation into standard incident closeout.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: Canary rollout with rollback gates

Execute one production-like change with deterministic safety controls.

Prerequisites

  • Inventory with canary and production groups.
  • Versioned template change ready for deployment.
  • Baseline metrics and CLI access.
  1. Capture pre-change baseline and success thresholds.

    nv show interface && nv show route

    Expected: Baseline state and gate thresholds are recorded.

  2. Apply change to canary scope only.

    ansible-playbook -i inventory fabric-rollout.yml --limit canary

    Expected: Canary converges without execution failures.

  3. Run post-checks and decide promote/rollback.

    iperf3 -c <peer_ip> -P 8 -t 30

    Expected: Metrics remain within allowed tolerance.

Success Criteria

  • Canary gate decision is evidence-based.
  • Rollback path is tested and documented.
  • Change record includes before/after artifacts.

Walkthrough: Drift remediation cycle

Detect and remediate drift while preserving workload stability.

Prerequisites

  • Active source-of-truth repository.
  • Automation job for drift detection.
  • CLI validation access on target devices.
  1. Generate drift report and prioritize fixes.

    ansible-playbook -i inventory drift-check.yml

    Expected: Drift entries include severity and owner.

  2. Apply targeted remediation to highest-risk nodes.

    ansible-playbook -i inventory remediate-drift.yml --limit high-risk

    Expected: High-risk nodes converge to intended state.

  3. Validate policy and performance behavior post-fix.

    nv show route && iperf3 -c <peer_ip> -P 8 -t 30

    Expected: No policy/performance regressions are observed.

Success Criteria

  • Drift is reduced and documented with traceability.
  • Runtime state matches intent on remediated scope.
  • Follow-up cadence is defined to prevent recurrence.

Study Sprint

10-day execution plan

Day Focus Output
1 Blueprint objective mapping for automation and config scope. Domain objective-to-tool matrix.
2 Source-of-truth and config templating model setup. Template and inventory structure.
3 Automation execution workflow (pre-check, apply, post-check). Standard rollout playbook.
4 CLI verification gates after automated change. Post-change validation command set.
5 Drift detection and compliance reporting workflow. Drift report template.
6 Canary rollout and rollback trigger design. Risk-gated rollout plan.
7 Scenario: partial rollout failure and recovery. Failure containment runbook.
8 Scenario: performance regression after config push. Performance validation checklist.
9 Timed exam-style automation troubleshooting drills. Scenario response templates.
10 Final revision with command and policy recall. Automation and Configuration quick revision sheet.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: Template-driven baseline deployment

Deploy standardized config from source-of-truth to multiple devices safely.

  • Render intended configuration for target scope.
  • Execute pre-checks and snapshot baseline state.
  • Apply config and verify expected state convergence.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Pre-check and staged rollout)

nv show interface && nv show route

Expected output (example)

Known-good baseline captured before rollout.

Lab B: Drift detection and remediation

Detect unauthorized or accidental drift and restore intended state.

  • Run drift comparison between intent and live state.
  • Classify drift by severity and operational risk.
  • Apply corrective action and re-verify compliance.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Pre-check and staged rollout)

ansible-playbook -i inventory fabric-rollout.yml --limit canary

Expected output (example)

Canary scope converges with no execution errors.

Lab C: Canary rollout with rollback guardrails

Practice low-risk staged rollout for high-impact changes.

  • Select canary subset and define success criteria.
  • Promote only if canary metrics pass thresholds.
  • Trigger rollback when gate conditions fail.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Drift detection and remediation)

ansible-playbook -i inventory drift-check.yml

Expected output (example)

Drift list includes host, control area, and severity.

Lab D: Config change performance validation

Ensure configuration changes preserve workload SLOs.

  • Capture benchmark and telemetry baseline before change.
  • Apply one controlled change and rerun benchmark.
  • Validate throughput/latency and queue behavior post-change.
Execution Sample (Collapsed)
  1. Capture baseline state for the target node/group before changes.
  2. Run scoped validation command for this lab objective.
  3. Compare observed output against expected signature.

Sample Command (Runbook: Drift detection and remediation)

ansible-playbook -i inventory remediate-drift.yml --limit <host_or_group>

Expected output (example)

Selected nodes converge to intended configuration.

Exam Pitfalls

Common failure patterns

  • Automating rollout without pre-check and rollback criteria.
  • Confusing rendered template output with live runtime state.
  • Skipping CLI verification after automation completes.
  • Ignoring drift until incidents occur.
  • Promoting wide-scope changes without canary validation.
  • Declaring success without workload-level performance checks.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the highest-value property of network automation in this domain?
  • A. Fewer commands
  • B. Deterministic, repeatable configuration with validation gates
  • C. No need for backups
  • D. Automatic elimination of all incidents

Answer: B

Automation value comes from repeatability, consistency, and controlled validation, not command count reduction alone.

Q2. Which sequence best matches safe rollout practice?
  • A. Apply to all devices, then test
  • B. Pre-check, canary apply, post-check, promote or rollback
  • C. Disable monitoring first
  • D. Skip baseline capture

Answer: B

Staged rollout with explicit gates and rollback reduces blast radius in production fabrics.

Q3. Why is drift detection critical in AI networking operations?
  • A. Drift is harmless
  • B. Drift can silently invalidate policy, performance, and troubleshooting assumptions
  • C. Drift only affects UI dashboards
  • D. Drift replaces documentation

Answer: B

Uncontrolled drift undermines deterministic operations and causes recurring hard-to-diagnose issues.

Q4. What proves an automated change is complete?
  • A. Automation job returned success
  • B. Live CLI state and workload metrics match intended outcome
  • C. Configuration file exists in Git
  • D. One device converged

Answer: B

Completion requires runtime validation, not just orchestration success messages.

Q5. Which anti-pattern increases incident risk the most?
  • A. Canary rollout
  • B. Wide-scope deployment with no rollback trigger
  • C. Baseline snapshot capture
  • D. Drift report review

Answer: B

Unbounded rollout without rollback gates can turn minor errors into large outages.

Q6. Why keep CLI checks in automated workflows?
  • A. CLI is obsolete
  • B. CLI validates actual live state and catches tooling blind spots
  • C. CLI always replaces automation
  • D. CLI only helps with documentation

Answer: B

CLI provides direct state verification that complements automated execution reports.

Q7. In exam responses, what strengthens automation answers?
  • A. Mentioning tool names only
  • B. Including validation gates, rollback criteria, and evidence artifacts
  • C. Avoiding metric references
  • D. Skipping change sequencing

Answer: B

Blueprint-aligned answers are operationally precise and include measurable controls.

Q8. Which objective is explicitly in this domain?
  • A. Use tools to automate and scale configuration tasks
  • B. Build AI model tokenizer
  • C. Replace packet captures with intuition
  • D. Disable command-line usage

Answer: A

Automation and scaled configuration control is central to this blueprint area.

Primary References

Curated from official NVIDIA NCP-AIN blueprint/study guide sources and primary automation/configuration documentation.

Objectives

  • Use tools to automate and scale configuration tasks.
  • Configure and optimize networking by using command line (CLI).
  • Apply repeatable configuration templates for multi-device rollout.
  • Validate configuration drift and enforce policy compliance.
  • Design rollback-safe change workflows for production environments.

Navigation