Training
AII-SME-WD-03 Nvidia AI Subject Matter Expert (x3) - Weekday (onsite)
Onsite weekday cohort spanning AI infrastructure, operations, and networking for NVIDIA GPU platforms.
Advanced · Weekday · Onsite · 120 hrs (40+40+0+40) · May 2026
Overview
A practitioner-led, advanced track combining AI infrastructure, operations, and networking - focused on operating GPU platforms under sustained load, with shared clusters, real failure modes, and production constraints.
This program emphasizes production failure domains, operational decision making, and real incident workflows across compute, network, and storage - not toy examples.
Covers the NVIDIA exam blueprint scopes for AI Infrastructure Professional , AI Operations Professional , and AI Networking Professional .
Details
- Delivery
- Weekday
- Location
- Onsite
- Price
- TBD
- Class size
- TBD
Requirements
- • Basic understanding of Linux
- • Networking fundamentals
- • Clusters and shared file systems
- • Kubernetes (CKA/CKS/CKAD-level familiarity)
- • Prior exposure to on-call or production support is strongly recommended.
Syllabus
Table of contents
Core Foundations
- AI Infrastructure Fundamentals (Compute, GPU, Memory, NUMA)
- GPU Architecture and Acceleration Concepts
- Containerization for AI (Docker, OCI, GPU Containers)
- Resource Management and Scheduling Concepts
- Multi-Tenancy, Isolation, and Quotas
Orchestration and Platform Layer
- Kubernetes (AI and HPC Orchestration)
- Slurm Workload Manager
- AI Workload Orchestration Platforms (Run:AI / GPU Scheduler)
- Cluster Provisioning and Lifecycle Management
- AI Platform Operations and Day-2 Tasks
Storage, Monitoring and Automation
- High-Performance Storage (Lustre Filesystem)
- Monitoring, Telemetry, and Observability
- Performance Benchmarking and Validation
- Automation and Configuration Management
- Security and Access Control (RBAC, IAM, Secrets)
Advanced Networking and Fabric
- AI Networking Fundamentals (Ethernet, InfiniBand, RDMA)
- High-Performance Fabric Design and Optimization
- Data Path Acceleration (DPU / SmartNIC Concepts)
Specialized Hardware and Expensive Labs
- NVSwitch and GPU-to-GPU Interconnects
- Advanced Fabric Configuration and Troubleshooting
- DPU Configuration, Offload, and Troubleshooting
Final Operational Mastery
- Fault Tolerance, High Availability, and Recovery
- Capacity Planning and Infrastructure Optimization
- End-to-End Troubleshooting (Compute, Network, Storage)
- Reference Architectures and Design Patterns
Labs Covered
Labs are progressively introduced as the cohort advances, aligned to production scenarios and failure modes discussed in class. Detailed lab notes are provided during the course.
Scope
- • Production operations, troubleshooting, and incident response
- • Kubernetes platform realities for GPU workloads
- • Networking and storage considerations for AI infrastructure
- • Runbook thinking, failure domains, and operational vocabulary
Included
- • Study notes will be provided as the class progresses
- • Lab notes will be provided as the class progresses
- • Flash cards to prepare for the exam will be provided
- • Use cases for interview preparation will be provided
- • More benefits are in the pipeline as the class progresses