Chapter 1: HPC Workload Management Foundations
Part I Foundations HPC Workload Management Foundations is defined here as the discipline of foundational workload governance in high-performance computing clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Cluster: A coordinated set of networked compute resources managed as one scheduling domain.
- - Workload: A submitted unit of computational work, typically represented as a job or set of jobs.
- - Scheduler: The decision function that maps queued jobs to available resources over time.
- - Arbitration: The policy process that resolves contention among competing users.
Open chapter page ->
Chapter 2: Slurm Architecture
Part I Foundations Slurm Architecture is defined here as the discipline of distributed control-plane architecture of Slurm and its daemon coordination model. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - slurmctld: Primary controller daemon responsible for scheduling and state orchestration.
- - slurmd: Node-resident daemon that launches and monitors tasks on compute nodes.
- - slurmdbd: Accounting daemon that persists usage records to a database backend.
- - Control plane: Logical layer responsible for coordination and policy decisions.
Open chapter page ->
Chapter 3: Slurm Resource Model
Part I Foundations Slurm Resource Model is defined here as the discipline of formal resource semantics and allocation logic in Slurm scheduling. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - TRES: Trackable resources such as CPU, memory, GPU, and custom generic resources.
- - GRES: Generic resources, commonly used for GPUs or specialized devices.
- - Affinity: Binding of tasks to specific CPUs or NUMA regions.
- - NUMA: Non-Uniform Memory Access; memory latency depends on placement locality.
Open chapter page ->
Chapter 4: Nodes Partitions and Queues
Part I Foundations Nodes Partitions and Queues is defined here as the discipline of node-state management and partition-level queue policy in multi-user clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Node state: Operational status such as idle, allocated, drain, down, or unknown.
- - Partition: A policy-scoped subset of nodes presented as a queue target.
- - Drain: State indicating intentional exclusion from new scheduling.
- - Access control: Rules limiting which users or accounts can submit to a partition.
Open chapter page ->
Chapter 5: Slurm Configuration
Part I Foundations Slurm Configuration is defined here as the discipline of configuration governance and correctness in Slurm control and runtime layers. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - slurm.conf: Primary scheduler and cluster topology configuration file.
- - gres.conf: Device-level generic resource inventory definitions.
- - cgroup.conf: Runtime enforcement policy for CPU, memory, and device isolation.
- - Configuration drift: Deviation between intended and actual deployed settings.
Open chapter page ->
Chapter 6: Authentication and Security
Part II Operations Authentication and Security is defined here as the discipline of authentication trust boundaries and multi-tenant security in Slurm clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Munge: Credential signing service used for authenticated communication in Slurm.
- - Authorization: Policy decision about whether an authenticated identity may perform an action.
- - Least privilege: Security principle granting only the minimum required permissions.
- - Isolation boundary: Administrative separation preventing one tenant from affecting another.
Open chapter page ->
Chapter 7: Slurm Commands User Interface
Part II Operations Slurm Commands User Interface is defined here as the discipline of operator command interface taxonomy for observability, submission, and control. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - sinfo: Command for cluster and partition state visibility.
- - squeue: Command for queued and running job visibility.
- - sbatch: Command for script-based batch submission.
- - scancel: Command for job termination by identifier or filter.
Open chapter page ->
Chapter 8: Job Lifecycle
Part II Operations Job Lifecycle is defined here as the discipline of formal state-machine interpretation of the Slurm job lifecycle. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Pending: Job admitted to queue but not yet allocated resources.
- - Running: Job has active resource allocation and executing tasks.
- - Completion: Terminal state with successful execution status.
- - Timeout: Terminal state due to policy-enforced runtime limit breach.
Open chapter page ->
Chapter 9: Job Types
Part II Operations Job Types is defined here as the discipline of execution-shape taxonomy across batch, interactive, array, and step-level models. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Batch job: Script-defined workload submitted for deferred execution.
- - Interactive job: Allocation enabling immediate user-driven command execution.
- - Job array: Parametrized set of homogeneous job instances.
- - Job step: Execution subdivision inside an allocated parent job.
Open chapter page ->
Chapter 10: Scheduling Algorithms
Part III Scheduling Scheduling Algorithms is defined here as the discipline of algorithmic scheduling behavior across FIFO, backfill, fairshare, and priority models. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - FIFO: First-in-first-out ordering based on submission sequence.
- - Backfill: Scheduling strategy filling idle windows without delaying reservations.
- - Fairshare: Policy model adjusting priority using historical usage.
- - Multifactor priority: Composite ranking using weighted scheduling factors.
Open chapter page ->
Chapter 11: Priority Factors
Part III Scheduling Priority Factors is defined here as the discipline of formal decomposition of Slurm priority factors and weighting behavior. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Weight: Administrative coefficient controlling influence of a priority factor.
- - Decay: Temporal reduction of historical usage influence over time.
- - Normalization: Scaling factors to comparable ranges before weighted summation.
- - Fairshare score: Usage-based contribution to current job priority.
Open chapter page ->
Chapter 12: Quality of Service QoS
Part III Scheduling Quality of Service QoS is defined here as the discipline of quality-of-service policy as a governance layer for runtime and resource entitlement. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - QoS: Policy object encoding limits, priorities, and preemption behavior.
- - Preemption: Policy action allowing one workload to displace another based on rules.
- - MaxWall: Maximum wall-clock runtime allowed under a policy class.
- - Resource cap: Upper bound on allocatable resources for a user, account, or QoS class.
Open chapter page ->
Chapter 13: Job Dependencies
Part III Scheduling Job Dependencies is defined here as the discipline of dependency-driven workflow orchestration in Slurm job graphs. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - after: Dependency requiring predecessor job start before dependent eligibility.
- - afterok: Dependency requiring predecessor successful completion.
- - afternotok: Dependency requiring predecessor failure state.
- - singleton: Constraint ensuring one active job of a given key context.
Open chapter page ->
Chapter 14: Resource Isolation Cgroups
Part III Scheduling Resource Isolation Cgroups is defined here as the discipline of runtime resource isolation through Linux control groups in Slurm. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - cgroup: Kernel mechanism for hierarchical resource control and accounting.
- - task/cgroup plugin: Slurm plugin integrating job/task boundaries with cgroup enforcement.
- - ConstrainRAMSpace: Policy control limiting memory usage within cgroups.
- - Device whitelist: Allowed device access set for constrained workloads.
Open chapter page ->
Chapter 15: GPU Scheduling
Part IV Accelerated Workloads GPU Scheduling is defined here as the discipline of accelerator-aware scheduling using GRES, topology, and binding semantics. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - GRES: Generic resource abstraction used for GPUs and specialized devices.
- - MIG: Multi-Instance GPU partitioning for hardware-level sub-device isolation.
- - GPU binding: Mapping tasks to specific GPU identifiers for locality control.
- - Topology awareness: Placement logic incorporating physical interconnect structure.
Open chapter page ->
Chapter 16: Heterogeneous Jobs
Part IV Accelerated Workloads Heterogeneous Jobs is defined here as the discipline of heterogeneous job composition for multi-stage workflows spanning dissimilar node classes. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Heterogeneous job: Single logical job containing components with different resource vectors.
- - Component group: One resource-homogeneous segment inside a heterogeneous submission.
- - Stage coupling: Dependency relation between computational stages in a pipeline.
- - Cross-stage locality: Data placement relationship affecting transition overhead between stages.
Open chapter page ->
Chapter 17: Slurm Accounting
Part IV Accelerated Workloads Slurm Accounting is defined here as the discipline of accounting model and governance analytics through slurmdbd and reporting interfaces. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - slurmdbd: Daemon responsible for collecting and persisting accounting records.
- - sacct: Command-line interface for querying job-level accounting records.
- - sreport: Command for aggregated utilization and accounting reports.
- - Chargeback: Attribution of resource consumption cost to organizational entities.
Open chapter page ->
Chapter 18: Monitoring and Metrics
Part IV Accelerated Workloads Monitoring and Metrics is defined here as the discipline of observability strategy and metric semantics for scheduler and cluster health. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Queue depth: Number of jobs waiting for allocation at a given observation point.
- - Wait time: Duration from submission until execution start.
- - Scheduling latency: Controller time to evaluate and dispatch eligible jobs.
- - Utilization: Fraction of available resources actively consumed by workloads.
Open chapter page ->
Chapter 19: High Availability
Part V Scale and Reliability High Availability is defined here as the discipline of high-availability controller design and failover semantics in Slurm. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Primary controller: Active scheduler authority in normal operation.
- - Backup controller: Standby controller prepared for failover takeover.
- - Failover: Controlled transfer of scheduler authority to backup infrastructure.
- - State replication: Propagation of controller state required for coherent takeover.
Open chapter page ->
Chapter 20: Federation and Multi Cluster
Part V Scale and Reliability Federation and Multi Cluster is defined here as the discipline of federated scheduling and governance across multiple Slurm clusters. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Federation: Logical linkage of multiple Slurm clusters for coordinated scheduling semantics.
- - Global scheduling: Policy-aware placement decisions spanning cluster boundaries.
- - Cross-cluster sharing: Capability to place workloads where capacity exists across federated members.
- - Distributed accounting: Usage visibility and governance across multiple cluster domains.
Open chapter page ->
Chapter 21: Power Management
Part V Scale and Reliability Power Management is defined here as the discipline of power-state orchestration and elastic capacity control in Slurm-managed environments. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Suspend: Transition node to low-power state when not needed for active scheduling.
- - Resume: Transition node back to schedulable state in response to demand.
- - Elasticity: Ability to vary active capacity based on workload intensity.
- - Provisioning latency: Time required to activate capacity and make it schedulable.
Open chapter page ->
Chapter 22: Slurm Automation
Part V Scale and Reliability Slurm Automation is defined here as the discipline of automation strategies for reproducible Slurm deployment and lifecycle management. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Provisioning: Process of creating and initializing cluster infrastructure resources.
- - Idempotency: Property of automation where repeated execution yields consistent final state.
- - Drift management: Detection and remediation of state divergence over time.
- - Bootstrap: Initial configuration sequence required to make nodes operational.
Open chapter page ->
Chapter 23: Containers with Slurm
Part V Scale and Reliability Containers with Slurm is defined here as the discipline of containerized workload execution in Slurm with reproducibility and dependency isolation. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Apptainer/Singularity: HPC-friendly container runtimes with user-space execution patterns.
- - Image immutability: Property of fixed container content across executions.
- - Dependency isolation: Separation of software libraries to avoid cross-workload conflicts.
- - Portability: Ability to run the same workload in multiple environments with consistent behavior.
Open chapter page ->
Chapter 24: Slurm for AI Infrastructure
Part VI AI Integration Slurm for AI Infrastructure is defined here as the discipline of integration of Slurm with AI infrastructure stacks based on NCCL, MPI, CUDA, and GPU nodes. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - NCCL: NVIDIA communication library optimized for multi-GPU collective operations.
- - MPI: Message Passing Interface standard for distributed process coordination.
- - CUDA: NVIDIA parallel computing platform for GPU-accelerated execution.
- - Distributed training: Model training across multiple processes and often multiple nodes.
Open chapter page ->
Chapter 25: Slurm Performance Optimization
Part VI AI Integration Slurm Performance Optimization is defined here as the discipline of system-level performance optimization in Slurm through scheduler, topology, and locality tuning. The definition is intentionally strict: the concept is not limited to command usage, but includes policy semantics, internal coordination logic, and measurable operational outcomes. A novice reader should treat this as a systems concept with explicit boundaries rather than a collection of isolated tools.
- - Topology-aware scheduling: Placement strategy respecting network and hardware locality structure.
- - Locality: Proximity relationship affecting latency and bandwidth between compute resources.
- - Scheduler cycle: Interval in which the scheduler evaluates queue and dispatch opportunities.
- - Bottleneck: Dominant constraint limiting end-to-end workload throughput or latency.
Open chapter page ->