Protected

Training is available after login. Redirecting…

If you are not redirected, login.

Training

AII-SME-WD-03 Nvidia AI Subject Matter Expert (x3) - Weekday (onsite)

Onsite weekday cohort spanning AI infrastructure, operations, and networking for NVIDIA GPU platforms.

Advanced · Weekday · Onsite · 120 hrs (40+40+0+40) · May 2026

Overview

A practitioner-led, advanced track combining AI infrastructure, operations, and networking - focused on operating GPU platforms under sustained load, with shared clusters, real failure modes, and production constraints.

This program emphasizes production failure domains, operational decision making, and real incident workflows across compute, network, and storage - not toy examples.

Covers the NVIDIA exam blueprint scopes for AI Infrastructure Professional , AI Operations Professional , and AI Networking Professional .

Details

Delivery
Weekday
Location
Onsite
Price
TBD
Class size
TBD

Requirements

  • Basic understanding of Linux
  • Networking fundamentals
  • Clusters and shared file systems
  • Kubernetes (CKA/CKS/CKAD-level familiarity)
  • Prior exposure to on-call or production support is strongly recommended.

Syllabus

Table of contents

Core Foundations

  1. AI Infrastructure Fundamentals (Compute, GPU, Memory, NUMA)
  2. GPU Architecture and Acceleration Concepts
  3. Containerization for AI (Docker, OCI, GPU Containers)
  4. Resource Management and Scheduling Concepts
  5. Multi-Tenancy, Isolation, and Quotas

Orchestration and Platform Layer

  1. Kubernetes (AI and HPC Orchestration)
  2. Slurm Workload Manager
  3. AI Workload Orchestration Platforms (Run:AI / GPU Scheduler)
  4. Cluster Provisioning and Lifecycle Management
  5. AI Platform Operations and Day-2 Tasks

Storage, Monitoring and Automation

  1. High-Performance Storage (Lustre Filesystem)
  2. Monitoring, Telemetry, and Observability
  3. Performance Benchmarking and Validation
  4. Automation and Configuration Management
  5. Security and Access Control (RBAC, IAM, Secrets)

Advanced Networking and Fabric

  1. AI Networking Fundamentals (Ethernet, InfiniBand, RDMA)
  2. High-Performance Fabric Design and Optimization
  3. Data Path Acceleration (DPU / SmartNIC Concepts)

Specialized Hardware and Expensive Labs

  1. NVSwitch and GPU-to-GPU Interconnects
  2. Advanced Fabric Configuration and Troubleshooting
  3. DPU Configuration, Offload, and Troubleshooting

Final Operational Mastery

  1. Fault Tolerance, High Availability, and Recovery
  2. Capacity Planning and Infrastructure Optimization
  3. End-to-End Troubleshooting (Compute, Network, Storage)
  4. Reference Architectures and Design Patterns

Labs Covered

Labs are progressively introduced as the cohort advances, aligned to production scenarios and failure modes discussed in class. Detailed lab notes are provided during the course.

Scope

  • Production operations, troubleshooting, and incident response
  • Kubernetes platform realities for GPU workloads
  • Networking and storage considerations for AI infrastructure
  • Runbook thinking, failure domains, and operational vocabulary

Included

  • Study notes will be provided as the class progresses
  • Lab notes will be provided as the class progresses
  • Flash cards to prepare for the exam will be provided
  • Use cases for interview preparation will be provided
  • More benefits are in the pipeline as the class progresses