Protected

NCP-AII module content is available after admin verification. Redirecting…

If you are not redirected, login.

Training / NCP-AII

Physical Layer Management

Module study guide

Priority 5 of 5 · Domain 2 in exam order

Scope

Exam study content

This module contains expanded study notes, practical drills, and an exam-style question set.

Exam weight: 5%
Priority tier: Tier 3
Why this domain: Low exam weight but specialized operational tasks (BlueField and MIG).

Exam Framework

How to reason under pressure

1. Stabilize Before Optimizing

Verify hardware and management-plane integrity first.
Confirm firmware/software baseline consistency.
Only then run performance tuning decisions.

2. Single-Variable Changes

Change one parameter at a time when investigating regressions.
Use before/after evidence with constant workload input.
Discard changes without reproducible benefit.

Exam Scope Coverage

What this module now covers

Domain 2 covers low-level resource control: BlueField network platform operations and MIG partitioning strategies for AI and HPC multi-tenant environments.

Track 1: BlueField platform role in AI infrastructure

The exam expects operational familiarity with DPU-backed networking and infrastructure control planes.

Understand where BlueField sits in data-path and management-path architecture.
Map BlueField functions to provisioning, security, and networking operations.
Verify BlueField firmware/software alignment with switch and host baselines.

Drill: Diagram one host-network path and annotate BlueField responsibilities in the flow.

Track 2: BlueField lifecycle operations

Domain scope includes configuration and management, not just conceptual understanding.

Establish a repeatable update and validation process for BlueField software and firmware.
Monitor basic health and connectivity after each lifecycle operation.
Treat BlueField changes as part of cluster-wide change control.

Drill: Write a BlueField maintenance runbook with pre-checks, execution controls, and verification steps.

Track 3: MIG fundamentals and partition strategy

MIG configuration is explicitly in scope and can affect utilization and isolation outcomes.

MIG partitions one GPU into isolated instances with dedicated compute and memory slices.
Profile choices should match workload shape and service-level expectations.
MIG planning differs across AI inference, AI training, and HPC usage patterns.

Drill: Given three workloads, assign MIG profiles and justify expected tradeoffs.

Track 4: MIG operations and observability

Correct MIG setup requires operational checks and stable inventory handling.

Use supported tooling to create, list, and validate MIG instances.
Track MIG allocation changes with auditability for multi-tenant environments.
Validate that scheduler/runtime stack maps jobs to intended MIG resources.

Drill: Perform a MIG create/validate/reset cycle and record all state transitions.

Track 5: AI vs HPC configuration tradeoffs

The blueprint highlights both AI and HPC contexts for physical-layer decisions.

Latency-sensitive AI inference may prioritize predictable partition behavior.
HPC or large-model workloads may require fewer or no partitions for full-GPU access.
Resource isolation policies should align with job classes and operational objectives.

Drill: Build an allocation policy table: workload type, MIG policy, and expected operational impact.

Track 6: Failure patterns in physical-layer management

Most operational regressions come from drift, undocumented changes, or profile mismatch.

Avoid undocumented ad-hoc profile changes across production nodes.
Reconcile BlueField and MIG state after maintenance windows.
Integrate physical-layer checks into release and incident workflows.

Drill: Create a post-maintenance audit checklist for BlueField and MIG state consistency.

Concept Explanations

Deep-dive concept library

Exam Decision Hierarchy

Prioritize decisions in this order: safety and hardware integrity, baseline consistency, controlled validation, then optimization.

If integrity checks fail, stop optimization and remediate first.
Compare against known-good baseline before changing multiple variables.
Document rationale for each decision to support incident replay.

Operational Evidence Standard

Treat every key action as evidence-producing: command, output, timestamp, and expected vs observed behavior.

Evidence should be reproducible by another engineer.
Use stable command templates for repeated environments.
Keep concise but complete validation artifacts for exam-style reasoning.

BlueField in the AI infrastructure stack

BlueField acts as a programmable data-path and infrastructure-control element that can influence networking, isolation, and observability outcomes.

Treat BlueField state as part of cluster baseline, not an isolated appliance.
Changes to BlueField stack should follow the same change control rigor as host firmware.
BlueField validation should include connectivity and lifecycle consistency checks.

MIG partitioning strategy for AI and HPC

MIG is a resource isolation mechanism, not just a utilization toggle. Profile choice must align with workload SLOs.

Inference workloads often benefit from predictable partition isolation.
Large training/HPC workloads may require full GPU or larger slices.
MIG policy should be encoded by workload class to avoid ad-hoc allocation drift.

State drift and governance

Most physical-layer incidents come from undocumented drift across node-level MIG or platform-level BlueField states.

Record before/after state for every maintenance operation.
Reconcile expected vs observed state at end of each change window.
Escalate discrepancies before higher-level benchmark validation.

Scenario Playbooks

Exam-style scenario explanations

Scenario A: Multi-tenant inference pool with MIG drift

A subset of nodes starts violating inference latency SLOs after maintenance. Investigation reveals inconsistent MIG profiles across nodes.

Architecture Diagram

[Scheduler] -> [Node Group A: MIG 1g/2g mix] -> [Inference Services]
               [Node Group B: drifted profiles] -> [Latency spikes]

Response Flow

Inventory MIG state across all target nodes.
Compare profiles with approved workload class policy.
Normalize profiles and rerun validation workloads.
Lock policy enforcement in deployment workflow.

Success Signals

All nodes report expected MIG inventory per policy.
Latency stabilizes across previously drifting nodes.
Scheduler mapping aligns to intended partition classes.

MIG inventory check

nvidia-smi mig -lgi

Expected output (example)

GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb

Scenario B: BlueField maintenance causes partial network inconsistency

After BlueField update, a node subset exhibits unstable communication despite healthy host-level checks.

Architecture Diagram

[Control Plane]
   |
[BlueField Set A: baseline] --- [Switch Fabric] --- [BlueField Set B: updated]
                   \            intermittent path           /
                             [GPU Hosts]

Response Flow

Verify BlueField firmware/software alignment on all nodes.
Run targeted connectivity checks for affected paths.
Rollback inconsistent nodes or complete update to consistent baseline.
Re-run cluster communication validation.

Success Signals

BlueField version/state consistent across node groups.
Connectivity tests pass without intermittent link drops.
Cluster communication tests return to baseline behavior.

Network interface state sample

ethtool <iface> | rg -i 'Speed|Link detected'

Expected output (example)

Speed: 400000Mb/s\nLink detected: yes

CLI and Commands

High-yield command runbooks

CLI Execution Pattern

1. Capture baseline state before running any intrusive command.
2. Execute command with explicit scope (node, interface, GPU set).
3. Compare output against expected baseline signature.
4. Record timestamp and decision (pass, investigate, remediate).

MIG lifecycle verification runbook

Use this runbook when enabling, validating, or resetting MIG layout for workload classes.

Enable MIG mode

sudo nvidia-smi -i 0 -mig 1

Expected output (example)

Enabled MIG mode for GPU 00000000:17:00.0

List available MIG profiles

nvidia-smi mig -lgip

Expected output (example)

GPU instance profiles:\n1g.10gb\n2g.20gb\n3g.40gb

List created GPU instances

nvidia-smi mig -lgi

Expected output (example)

GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb

Reboot or service restart behavior depends on platform policy; validate after change.
Always reconcile instance inventory with scheduler/resource manager.

Physical-link readiness runbook

Validate link-level health when physical-layer regressions are suspected.

Link status and speed check

ethtool <iface>

Expected output (example)

Speed: 400000Mb/s\nDuplex: Full\nLink detected: yes

Mellanox link diagnostics sample

mlxlink -d <device> --json

Expected output (example)

{\n  "state": "Active",\n  "BER": "Within threshold",\n  "signal_quality": "Pass"\n}

Run both baseline and suspect nodes for side-by-side comparison.
Escalate any signal-quality anomaly before workload-level debugging.

Common Problems

Failure patterns and fixes

MIG profile drift across node pool

Symptoms

Same service class shows inconsistent latency/throughput by node.
Scheduler placements appear random or capacity mismatched.

Likely Cause

Undocumented MIG changes or failed policy application after maintenance.

Remediation

Inventory all MIG states and compare against policy.
Normalize profiles and resync scheduler resources.
Run canary workload to confirm restored consistency.

Prevention: Automate MIG policy enforcement and post-change audit checks.

BlueField subset inconsistency after lifecycle update

Symptoms

Intermittent communication instability on specific node group.
Host-level health appears normal.

Likely Cause

BlueField software/firmware mismatch across nodes or incomplete rollout.

Remediation

Collect version matrix for all BlueField units.
Rollback or complete update for consistent baseline.
Re-run communication validation after alignment.

Prevention: Use canary-first lifecycle workflow and block partial production rollout.

Workload class policy not aligned with MIG design

Symptoms

Large jobs fail admission while small jobs overconsume available partitions.
Resource fragmentation increases over time.

Likely Cause

MIG profiles chosen without workload class modeling.

Remediation

Define workload classes and required profile sizes.
Rebuild partition policy around demand distribution.
Validate utilization and SLO impact after policy change.

Prevention: Review profile policy during capacity planning cycles.

Lab Walkthroughs

Step-by-step execution guides

Walkthrough: MIG policy implementation for mixed AI workloads

Implement and validate a workload-class MIG policy with reproducible inventory state.

Prerequisites

GPU model supports MIG.
Approved workload classes and profile mapping available.
Scheduler/resource manager integration plan prepared.

Enable MIG mode for target GPUs.
```
sudo nvidia-smi -i 0 -mig 1
```
Expected: MIG mode enabled successfully.
Create planned instance profiles for each class.
```
nvidia-smi mig -cgi <profile_id> -C
```
Expected: GPU instances created matching plan.
Validate created inventory against design.
```
nvidia-smi mig -lgi
```
Expected: Inventory output equals policy baseline.
Run one validation workload per class.
```
python run_validation.py --class <class_name>
```
Expected: Each class runs within expected performance bounds.
Record baseline and integrate with scheduler metadata.

Expected: Resource definitions and node states are synchronized.

Success Criteria

Policy inventory stable across all target nodes.
Scheduler sees expected partition resources.
Class-specific workloads meet latency/throughput expectations.

Study Sprint

10-day execution plan

Day	Focus	Output
1	BlueField architecture and role mapping review.	Annotated architecture notes.
2	BlueField management surface and lifecycle workflow.	BlueField operations checklist.
3	MIG concepts and profile taxonomy review.	MIG profile quick-reference card.
4	Hands-on MIG create/list/validate/reset drill.	Validated MIG operations log.
5	AI workload to MIG mapping exercise.	AI mapping decision matrix.
6	HPC workload to MIG/full-GPU policy design.	HPC policy recommendation sheet.
7	BlueField and host consistency checks.	Cross-component consistency report.
8	Incident scenario: drifted MIG/BlueField state.	Drift response runbook.
9	Timed objective rehearsal with command-level decisions.	Exam scenario answer sheet.
10	Final revision and weak-area patch.	Physical-layer rapid revision notes.

Hands-on Labs

Practical module work

Each lab includes a collapsed execution sample with representative CLI usage and expected output.

Lab A: BlueField baseline validation

Confirm BlueField platform readiness in a controlled environment.

Validate current software and firmware baseline.
Run health and connectivity checks after baseline capture.
Document a maintenance-safe rollback point.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (MIG lifecycle verification runbook)

sudo nvidia-smi -i 0 -mig 1

Expected output (example)

Enabled MIG mode for GPU 00000000:17:00.0

Lab B: MIG profile operations

Practice deterministic MIG lifecycle management.

Enable MIG mode and create selected instance profiles.
Validate visible inventory and mapping behavior.
Reset to baseline state and confirm cleanup.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (MIG lifecycle verification runbook)

nvidia-smi mig -lgip

Expected output (example)

GPU instance profiles:\n1g.10gb\n2g.20gb\n3g.40gb

Lab C: Policy-by-workload mapping

Translate workload requirements into physical-layer configuration policy.

Classify workloads by latency, throughput, and isolation needs.
Select MIG or full-GPU policy per class.
Capture expected operational tradeoffs.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (MIG lifecycle verification runbook)

nvidia-smi mig -lgi

Expected output (example)

GPU 0: GI 1 Profile 1g.10gb\nGPU 0: GI 2 Profile 2g.20gb

Lab D: Drift detection and recovery

Detect and remediate state drift after simulated maintenance.

Compare expected vs observed BlueField/MIG states.
Apply controlled remediation steps.
Record evidence and update runbook controls.

Execution Sample (Collapsed)

Capture baseline state for the target node/group before changes.
Run scoped validation command for this lab objective.
Compare observed output against expected signature.

Sample Command (Physical-link readiness runbook)

ethtool <iface>

Expected output (example)

Speed: 400000Mb/s\nDuplex: Full\nLink detected: yes

Exam Pitfalls

Common failure patterns

Treating BlueField as static infrastructure and skipping lifecycle controls.
Applying MIG profiles without workload-fit validation.
Assuming MIG settings persist or match across all nodes without audit.
Ignoring AI vs HPC policy differences when partitioning resources.
Making emergency physical-layer changes without documentation.
Skipping post-maintenance consistency checks.

Practice Set

Domain checkpoint questions

Attempt each question first, then open the answer and explanation.

Q1. What is the key operational value of MIG?

A. It increases PSU wattage
B. It provides GPU resource isolation via partitioning
C. It replaces firmware tooling
D. It disables monitoring

Answer: B

MIG partitions a GPU into isolated instances, supporting controlled multi-tenant usage.

Q2. Why should BlueField lifecycle changes follow change-control processes?

A. They never affect production behavior
B. They can influence networking and platform stability
C. They are UI-only updates
D. They remove need for validation

Answer: B

BlueField updates can affect core platform behavior and must be validated like other critical changes.

Q3. Which statement best describes MIG profile selection?

A. One profile fits all workloads
B. Profile choice should map to workload shape and SLO targets
C. MIG only applies to CPUs
D. MIG is unrelated to scheduling

Answer: B

MIG profile strategy should match workload needs for isolation, latency, and utilization.

Q4. What is a high-risk pattern in physical-layer management?

A. Versioned runbooks
B. Undocumented ad-hoc profile changes
C. Post-change validation
D. Baseline inventory capture

Answer: B

Undocumented ad-hoc changes create drift and complicate troubleshooting.

Q5. For mixed AI and HPC environments, what is a practical MIG policy approach?

A. Always enforce a single profile for all jobs
B. Define workload classes and map policy per class
C. Disable all partitioning options
D. Ignore scheduler mapping

Answer: B

Workload-class policies allow controlled partitioning without sacrificing operational consistency.

Q6. What should be validated after a MIG reset cycle?

A. Only user permissions
B. Inventory state and expected resource visibility
C. Marketing dashboards
D. BIOS logo

Answer: B

Post-reset validation confirms that resource state matches intended baseline.

Q7. Which outcome indicates BlueField drift after maintenance?

A. Expected version/state matches baseline
B. Observed software/firmware state diverges from approved baseline
C. Logs are archived
D. Runbook version increments

Answer: B

Drift is measured as divergence from approved, expected operational baseline.

Q8. Why include physical-layer checks in incident workflows?

A. To avoid service restoration
B. To rapidly rule in/out low-level causes of performance or stability issues
C. To replace application logs
D. To skip diagnosis

Answer: B

Physical-layer issues can manifest as higher-level symptoms; explicit checks speed triage.

Q9. What is the best first artifact for physical-layer governance?

A. Unstructured chat notes
B. Versioned baseline and validation checklist
C. Ad-hoc command snippets only
D. No records

Answer: B

Baseline + checklist creates consistency and traceability across operations.

Q10. In exam context, why is this low-weight domain still important?

A. It never impacts cluster outcomes
B. BlueField/MIG mistakes can cascade into larger test and reliability failures
C. It is unrelated to AI infrastructure
D. It only applies to demos

Answer: B

Even with lower exam weight, physical-layer misconfiguration can undermine multiple higher-weight domains.

Primary References

Curated from the NCP-AII blueprint/study-guide sources and official documentation.

Objectives

2.1 Configure and manage a BlueField network platform.
2.2 Configure MIG (AI and HPC).

Navigation

Back to NCP-AII landing Previous: Troubleshoot and Optimize