Protected

Slurm content is available after login. Redirecting...

If you are not redirected, login.

Courses / Slurm / Study

Slurm Study Guide

This track follows the full 25-chapter Slurm concept map in strict sequence. It presents chapter lessons, practice questions, and review navigation as one indexed study flow.

Structured chapter-by-chapter Slurm study path

Login required ยท 25/25 chapters published

Study Flow

  1. Read chapters in order from 1 through 25 to preserve concept dependencies.
  2. Use the diagrams and mechanism sections to build internal scheduling and control-plane intuition.
  3. Run command samples in a non-production environment and validate observed signals.
  4. Use flashcards and mock questions to find weak sections, then revisit those chapters.

Chapter Distribution

Part Chapters
Part I Foundations 5
Part II Operations 4
Part III Scheduling 5
Part IV Accelerated Workloads 4
Part V Scale and Reliability 5
Part VI AI Integration 2

Chapter Index

All 25 chapters

Chapter 1: HPC Workload Management Foundations

Part I Foundations

HPC Workload Management Foundations is defined here as the discipline of foundational workload governance in high-performance computing clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Cluster: A coordinated set of networked compute resources managed as one scheduling domain.
  • Workload: A submitted unit of computational work, typically represented as a job or set of jobs.
  • Scheduler: The decision function that maps queued jobs to available resources over time.
  • Arbitration: The policy process that resolves contention among competing users.
Open chapter page

Chapter 2: Slurm Architecture

Part I Foundations

Slurm Architecture is defined here as the discipline of distributed control-plane architecture of Slurm and its daemon coordination model. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • slurmctld: Primary controller daemon responsible for scheduling and state orchestration.
  • slurmd: Node-resident daemon that launches and monitors tasks on compute nodes.
  • slurmdbd: Accounting daemon that persists usage records to a database backend.
  • Control plane: Logical layer responsible for coordination and policy decisions.
Open chapter page

Chapter 3: Slurm Resource Model

Part I Foundations

Slurm Resource Model is defined here as the discipline of formal resource semantics and allocation logic in Slurm scheduling. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • TRES: Trackable resources such as CPU, memory, GPU, and custom generic resources.
  • GRES: Generic resources, commonly used for GPUs or specialized devices.
  • Affinity: Binding of tasks to specific CPUs or NUMA regions.
  • NUMA: Non-Uniform Memory Access; memory latency depends on placement locality.
Open chapter page

Chapter 4: Nodes Partitions and Queues

Part I Foundations

Nodes Partitions and Queues is defined here as the discipline of node-state management and partition-level queue policy in multi-user clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Node state: Operational status such as idle, allocated, drain, down, or unknown.
  • Partition: A policy-scoped subset of nodes presented as a queue target.
  • Drain: State indicating intentional exclusion from new scheduling.
  • Access control: Rules limiting which users or accounts can submit to a partition.
Open chapter page

Chapter 5: Slurm Configuration

Part I Foundations

Slurm Configuration is defined here as the discipline of configuration governance and correctness in Slurm control and runtime layers. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • slurm.conf: Primary scheduler and cluster topology configuration file.
  • gres.conf: Device-level generic resource inventory definitions.
  • cgroup.conf: Runtime enforcement policy for CPU, memory, and device isolation.
  • Configuration drift: Deviation between intended and actual deployed settings.
Open chapter page

Chapter 6: Authentication and Security

Part II Operations

Authentication and Security is defined here as the discipline of authentication trust boundaries and multi-tenant security in Slurm clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Munge: Credential signing service used for authenticated communication in Slurm.
  • Authorization: Policy decision about whether an authenticated identity may perform an action.
  • Least privilege: Security principle granting only the minimum required permissions.
  • Isolation boundary: Administrative separation preventing one tenant from affecting another.
Open chapter page

Chapter 7: Slurm Commands User Interface

Part II Operations

Slurm Commands User Interface is defined here as the discipline of operator command interface taxonomy for observability, submission, and control. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • sinfo: Command for cluster and partition state visibility.
  • squeue: Command for queued and running job visibility.
  • sbatch: Command for script-based batch submission.
  • scancel: Command for job termination by identifier or filter.
Open chapter page

Chapter 8: Job Lifecycle

Part II Operations

Job Lifecycle is defined here as the discipline of formal state-machine interpretation of the Slurm job lifecycle. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Pending: Job admitted to queue but not yet allocated resources.
  • Running: Job has active resource allocation and executing tasks.
  • Completion: Terminal state with successful execution status.
  • Timeout: Terminal state due to policy-enforced runtime limit breach.
Open chapter page

Chapter 9: Job Types

Part II Operations

Job Types is defined here as the discipline of execution-shape taxonomy across batch, interactive, array, and step-level models. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Batch job: Script-defined workload submitted for deferred execution.
  • Interactive job: Allocation enabling immediate user-driven command execution.
  • Job array: Parametrized set of homogeneous job instances.
  • Job step: Execution subdivision inside an allocated parent job.
Open chapter page

Chapter 10: Scheduling Algorithms

Part III Scheduling

Scheduling Algorithms is defined here as the discipline of algorithmic scheduling behavior across FIFO, backfill, fairshare, and priority models. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • FIFO: First-in-first-out ordering based on submission sequence.
  • Backfill: Scheduling strategy filling idle windows without delaying reservations.
  • Fairshare: Policy model adjusting priority using historical usage.
  • Multifactor priority: Composite ranking using weighted scheduling factors.
Open chapter page

Chapter 11: Priority Factors

Part III Scheduling

Priority Factors is defined here as the discipline of formal decomposition of Slurm priority factors and weighting behavior. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Weight: Administrative coefficient controlling influence of a priority factor.
  • Decay: Temporal reduction of historical usage influence over time.
  • Normalization: Scaling factors to comparable ranges before weighted summation.
  • Fairshare score: Usage-based contribution to current job priority.
Open chapter page

Chapter 12: Quality of Service QoS

Part III Scheduling

Quality of Service QoS is defined here as the discipline of quality-of-service policy as a governance layer for runtime and resource entitlement. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • QoS: Policy object encoding limits, priorities, and preemption behavior.
  • Preemption: Policy action allowing one workload to displace another based on rules.
  • MaxWall: Maximum wall-clock runtime allowed under a policy class.
  • Resource cap: Upper bound on allocatable resources for a user, account, or QoS class.
Open chapter page

Chapter 13: Job Dependencies

Part III Scheduling

Job Dependencies is defined here as the discipline of dependency-driven workflow orchestration in Slurm job graphs. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • after: Dependency requiring predecessor job start before dependent eligibility.
  • afterok: Dependency requiring predecessor successful completion.
  • afternotok: Dependency requiring predecessor failure state.
  • singleton: Constraint ensuring one active job of a given key context.
Open chapter page

Chapter 14: Resource Isolation Cgroups

Part III Scheduling

Resource Isolation Cgroups is defined here as the discipline of runtime resource isolation through Linux control groups in Slurm. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • cgroup: Kernel mechanism for hierarchical resource control and accounting.
  • task/cgroup plugin: Slurm plugin integrating job/task boundaries with cgroup enforcement.
  • ConstrainRAMSpace: Policy control limiting memory usage within cgroups.
  • Device whitelist: Allowed device access set for constrained workloads.
Open chapter page

Chapter 15: GPU Scheduling

Part IV Accelerated Workloads

GPU Scheduling is defined here as the discipline of accelerator-aware scheduling using GRES, topology, and binding semantics. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • GRES: Generic resource abstraction used for GPUs and specialized devices.
  • MIG: Multi-Instance GPU partitioning for hardware-level sub-device isolation.
  • GPU binding: Mapping tasks to specific GPU identifiers for locality control.
  • Topology awareness: Placement logic incorporating physical interconnect structure.
Open chapter page

Chapter 16: Heterogeneous Jobs

Part IV Accelerated Workloads

Heterogeneous Jobs is defined here as the discipline of heterogeneous job composition for multi-stage workflows spanning dissimilar node classes. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Heterogeneous job: Single logical job containing components with different resource vectors.
  • Component group: One resource-homogeneous segment inside a heterogeneous submission.
  • Stage coupling: Dependency relation between computational stages in a pipeline.
  • Cross-stage locality: Data placement relationship affecting transition overhead between stages.
Open chapter page

Chapter 17: Slurm Accounting

Part IV Accelerated Workloads

Slurm Accounting is defined here as the discipline of accounting model and governance analytics through slurmdbd and reporting interfaces. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • slurmdbd: Daemon responsible for collecting and persisting accounting records.
  • sacct: Command-line interface for querying job-level accounting records.
  • sreport: Command for aggregated utilization and accounting reports.
  • Chargeback: Attribution of resource consumption cost to organizational entities.
Open chapter page

Chapter 18: Monitoring and Metrics

Part IV Accelerated Workloads

Monitoring and Metrics is defined here as the discipline of observability strategy and metric semantics for scheduler and cluster health. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Queue depth: Number of jobs waiting for allocation at a given observation point.
  • Wait time: Duration from submission until execution start.
  • Scheduling latency: Controller time to evaluate and dispatch eligible jobs.
  • Utilization: Fraction of available resources actively consumed by workloads.
Open chapter page

Chapter 19: High Availability

Part V Scale and Reliability

High Availability is defined here as the discipline of high-availability controller design and failover semantics in Slurm. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Primary controller: Active scheduler authority in normal operation.
  • Backup controller: Standby controller prepared for failover takeover.
  • Failover: Controlled transfer of scheduler authority to backup infrastructure.
  • State replication: Propagation of controller state required for coherent takeover.
Open chapter page

Chapter 20: Federation and Multi Cluster

Part V Scale and Reliability

Federation and Multi Cluster is defined here as the discipline of federated scheduling and governance across multiple Slurm clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Federation: Logical linkage of multiple Slurm clusters for coordinated scheduling semantics.
  • Global scheduling: Policy-aware placement decisions spanning cluster boundaries.
  • Cross-cluster sharing: Capability to place workloads where capacity exists across federated members.
  • Distributed accounting: Usage visibility and governance across multiple cluster domains.
Open chapter page

Chapter 21: Power Management

Part V Scale and Reliability

Power Management is defined here as the discipline of power-state orchestration and elastic capacity control in Slurm-managed environments. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Suspend: Transition node to low-power state when not needed for active scheduling.
  • Resume: Transition node back to schedulable state in response to demand.
  • Elasticity: Ability to vary active capacity based on workload intensity.
  • Provisioning latency: Time required to activate capacity and make it schedulable.
Open chapter page

Chapter 22: Slurm Automation

Part V Scale and Reliability

Slurm Automation is defined here as the discipline of automation strategies for reproducible Slurm deployment and lifecycle management. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Provisioning: Process of creating and initializing cluster infrastructure resources.
  • Idempotency: Property of automation where repeated execution yields consistent final state.
  • Drift management: Detection and remediation of state divergence over time.
  • Bootstrap: Initial configuration sequence required to make nodes operational.
Open chapter page

Chapter 23: Containers with Slurm

Part V Scale and Reliability

Containers with Slurm is defined here as the discipline of containerized workload execution in Slurm with reproducibility and dependency isolation. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Apptainer/Singularity: HPC-friendly container runtimes with user-space execution patterns.
  • Image immutability: Property of fixed container content across executions.
  • Dependency isolation: Separation of software libraries to avoid cross-workload conflicts.
  • Portability: Ability to run the same workload in multiple environments with consistent behavior.
Open chapter page

Chapter 24: Slurm for AI Infrastructure

Part VI AI Integration

Slurm for AI Infrastructure is defined here as the discipline of integration of Slurm with AI infrastructure stacks based on NCCL, MPI, CUDA, and GPU nodes. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • NCCL: NVIDIA communication library optimized for multi-GPU collective operations.
  • MPI: Message Passing Interface standard for distributed process coordination.
  • CUDA: NVIDIA parallel computing platform for GPU-accelerated execution.
  • Distributed training: Model training across multiple processes and often multiple nodes.
Open chapter page

Chapter 25: Slurm Performance Optimization

Part VI AI Integration

Slurm Performance Optimization is defined here as the discipline of system-level performance optimization in Slurm through scheduler, topology, and locality tuning. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.

  • Topology-aware scheduling: Placement strategy respecting network and hardware locality structure.
  • Locality: Proximity relationship affecting latency and bandwidth between compute resources.
  • Scheduler cycle: Interval in which the scheduler evaluates queue and dispatch opportunities.
  • Bottleneck: Dominant constraint limiting end-to-end workload throughput or latency.
Open chapter page