AII-SME-WD-03 Nvidia AI Subject Matter Expert (x3) - Weekday (onsite)

Onsite weekday cohort spanning AI infrastructure, operations, and networking for NVIDIA GPU platforms.

Advanced · Weekday · Onsite · 120 hrs (40+40+0+40) · May 2026

Overview

A practitioner-led, advanced track combining AI infrastructure, operations, and networking - focused on operating GPU platforms under sustained load, with shared clusters, real failure modes, and production constraints.

This program emphasizes production failure domains, operational decision making, and real incident workflows across compute, network, and storage - not toy examples.

Covers the NVIDIA exam blueprint scopes for AI Infrastructure Professional , AI Operations Professional , and AI Networking Professional .

Express Interest

Details

Delivery: Weekday
Location: Onsite
Price: TBD
Class size: TBD

Requirements

Basic understanding of Linux
Networking fundamentals
Clusters and shared file systems
Kubernetes (CKA/CKS/CKAD-level familiarity)
Prior exposure to on-call or production support is strongly recommended.

Syllabus

Table of contents

Core Foundations

AI Infrastructure Fundamentals (Compute, GPU, Memory, NUMA)
GPU Architecture and Acceleration Concepts
Containerization for AI (Docker, OCI, GPU Containers)
Resource Management and Scheduling Concepts
Multi-Tenancy, Isolation, and Quotas

Orchestration and Platform Layer

Kubernetes (AI and HPC Orchestration)
Slurm Workload Manager
AI Workload Orchestration Platforms (Run:AI / GPU Scheduler)
Cluster Provisioning and Lifecycle Management
AI Platform Operations and Day-2 Tasks

Storage, Monitoring and Automation

High-Performance Storage (Lustre Filesystem)
Monitoring, Telemetry, and Observability
Performance Benchmarking and Validation
Automation and Configuration Management
Security and Access Control (RBAC, IAM, Secrets)

Advanced Networking and Fabric

AI Networking Fundamentals (Ethernet, InfiniBand, RDMA)
High-Performance Fabric Design and Optimization
Data Path Acceleration (DPU / SmartNIC Concepts)

Specialized Hardware and Expensive Labs

NVSwitch and GPU-to-GPU Interconnects
Advanced Fabric Configuration and Troubleshooting
DPU Configuration, Offload, and Troubleshooting

Final Operational Mastery

Fault Tolerance, High Availability, and Recovery
Capacity Planning and Infrastructure Optimization
End-to-End Troubleshooting (Compute, Network, Storage)
Reference Architectures and Design Patterns

Labs Covered

Labs are progressively introduced as the cohort advances, aligned to production scenarios and failure modes discussed in class. Detailed lab notes are provided during the course.

Scope

Production operations, troubleshooting, and incident response
Kubernetes platform realities for GPU workloads
Networking and storage considerations for AI infrastructure
Runbook thinking, failure domains, and operational vocabulary

Included

Study notes will be provided as the class progresses
Lab notes will be provided as the class progresses
Flash cards to prepare for the exam will be provided
Use cases for interview preparation will be provided
More benefits are in the pipeline as the class progresses