Protected

Slurm flashcards are available after login. Redirecting...

If you are not redirected, login.

Courses / Slurm / Study / Self-Test

Slurm Flashcards

Terminology recall cards across all chapters

100 concept cards

Study Method

Read the term and attempt to define it before opening the card.
Check whether your definition includes mechanism, constraints, and operating purpose.
Mark weak cards and revisit corresponding chapters for deeper review.

Chapter 1: HPC Workload Management Foundations

4 cards

Cluster

A coordinated set of networked compute resources managed as one scheduling domain.

Workload

A submitted unit of computational work, typically represented as a job or set of jobs.

Scheduler

The decision function that maps queued jobs to available resources over time.

Arbitration

The policy process that resolves contention among competing users.

Chapter 2: Slurm Architecture

4 cards

slurmctld

Primary controller daemon responsible for scheduling and state orchestration.

slurmd

Node-resident daemon that launches and monitors tasks on compute nodes.

slurmdbd

Accounting daemon that persists usage records to a database backend.

Control plane

Logical layer responsible for coordination and policy decisions.

Chapter 3: Slurm Resource Model

4 cards

TRES

Trackable resources such as CPU, memory, GPU, and custom generic resources.

GRES

Generic resources, commonly used for GPUs or specialized devices.

Affinity

Binding of tasks to specific CPUs or NUMA regions.

NUMA

Non-Uniform Memory Access; memory latency depends on placement locality.

Chapter 4: Nodes Partitions and Queues

4 cards

Node state

Operational status such as idle, allocated, drain, down, or unknown.

Partition

A policy-scoped subset of nodes presented as a queue target.

Drain

State indicating intentional exclusion from new scheduling.

Access control

Rules limiting which users or accounts can submit to a partition.

Chapter 5: Slurm Configuration

4 cards

slurm.conf

Primary scheduler and cluster topology configuration file.

gres.conf

Device-level generic resource inventory definitions.

cgroup.conf

Runtime enforcement policy for CPU, memory, and device isolation.

Configuration drift

Deviation between intended and actual deployed settings.

Chapter 6: Authentication and Security

4 cards

Munge

Credential signing service used for authenticated communication in Slurm.

Authorization

Policy decision about whether an authenticated identity may perform an action.

Least privilege

Security principle granting only the minimum required permissions.

Isolation boundary

Administrative separation preventing one tenant from affecting another.

Chapter 7: Slurm Commands User Interface

4 cards

sinfo

Command for cluster and partition state visibility.

squeue

Command for queued and running job visibility.

sbatch

Command for script-based batch submission.

scancel

Command for job termination by identifier or filter.

Chapter 8: Job Lifecycle

4 cards

Pending

Job admitted to queue but not yet allocated resources.

Running

Job has active resource allocation and executing tasks.

Completion

Terminal state with successful execution status.

Timeout

Terminal state due to policy-enforced runtime limit breach.

Chapter 9: Job Types

4 cards

Batch job

Script-defined workload submitted for deferred execution.

Interactive job

Allocation enabling immediate user-driven command execution.

Job array

Parametrized set of homogeneous job instances.

Job step

Execution subdivision inside an allocated parent job.

Chapter 10: Scheduling Algorithms

4 cards

FIFO

First-in-first-out ordering based on submission sequence.

Backfill

Scheduling strategy filling idle windows without delaying reservations.

Fairshare

Policy model adjusting priority using historical usage.

Multifactor priority

Composite ranking using weighted scheduling factors.

Chapter 11: Priority Factors

4 cards

Weight

Administrative coefficient controlling influence of a priority factor.

Decay

Temporal reduction of historical usage influence over time.

Normalization

Scaling factors to comparable ranges before weighted summation.

Fairshare score

Usage-based contribution to current job priority.

Chapter 12: Quality of Service QoS

4 cards

QoS

Policy object encoding limits, priorities, and preemption behavior.

Preemption

Policy action allowing one workload to displace another based on rules.

MaxWall

Maximum wall-clock runtime allowed under a policy class.

Resource cap

Upper bound on allocatable resources for a user, account, or QoS class.

Chapter 13: Job Dependencies

4 cards

after

Dependency requiring predecessor job start before dependent eligibility.

afterok

Dependency requiring predecessor successful completion.

afternotok

Dependency requiring predecessor failure state.

singleton

Constraint ensuring one active job of a given key context.

Chapter 14: Resource Isolation Cgroups

4 cards

cgroup

Kernel mechanism for hierarchical resource control and accounting.

task/cgroup plugin

Slurm plugin integrating job/task boundaries with cgroup enforcement.

ConstrainRAMSpace

Policy control limiting memory usage within cgroups.

Device whitelist

Allowed device access set for constrained workloads.

Chapter 15: GPU Scheduling

4 cards

GRES

Generic resource abstraction used for GPUs and specialized devices.

MIG

Multi-Instance GPU partitioning for hardware-level sub-device isolation.

GPU binding

Mapping tasks to specific GPU identifiers for locality control.

Topology awareness

Placement logic incorporating physical interconnect structure.

Chapter 16: Heterogeneous Jobs

4 cards

Heterogeneous job

Single logical job containing components with different resource vectors.

Component group

One resource-homogeneous segment inside a heterogeneous submission.

Stage coupling

Dependency relation between computational stages in a pipeline.

Cross-stage locality

Data placement relationship affecting transition overhead between stages.

Chapter 17: Slurm Accounting

4 cards

slurmdbd

Daemon responsible for collecting and persisting accounting records.

sacct

Command-line interface for querying job-level accounting records.

sreport

Command for aggregated utilization and accounting reports.

Chargeback

Attribution of resource consumption cost to organizational entities.

Chapter 18: Monitoring and Metrics

4 cards

Queue depth

Number of jobs waiting for allocation at a given observation point.

Wait time

Duration from submission until execution start.

Scheduling latency

Controller time to evaluate and dispatch eligible jobs.

Utilization

Fraction of available resources actively consumed by workloads.

Chapter 19: High Availability

4 cards

Primary controller

Active scheduler authority in normal operation.

Backup controller

Standby controller prepared for failover takeover.

Failover

Controlled transfer of scheduler authority to backup infrastructure.

State replication

Propagation of controller state required for coherent takeover.

Chapter 20: Federation and Multi Cluster

4 cards

Federation

Logical linkage of multiple Slurm clusters for coordinated scheduling semantics.

Global scheduling

Policy-aware placement decisions spanning cluster boundaries.

Cross-cluster sharing

Capability to place workloads where capacity exists across federated members.

Distributed accounting

Usage visibility and governance across multiple cluster domains.

Chapter 21: Power Management

4 cards

Suspend

Transition node to low-power state when not needed for active scheduling.

Resume

Transition node back to schedulable state in response to demand.

Elasticity

Ability to vary active capacity based on workload intensity.

Provisioning latency

Time required to activate capacity and make it schedulable.

Chapter 22: Slurm Automation

4 cards

Provisioning

Process of creating and initializing cluster infrastructure resources.

Idempotency

Property of automation where repeated execution yields consistent final state.

Drift management

Detection and remediation of state divergence over time.

Bootstrap

Initial configuration sequence required to make nodes operational.

Chapter 23: Containers with Slurm

4 cards

Apptainer/Singularity

HPC-friendly container runtimes with user-space execution patterns.

Image immutability

Property of fixed container content across executions.

Dependency isolation

Separation of software libraries to avoid cross-workload conflicts.

Portability

Ability to run the same workload in multiple environments with consistent behavior.

Chapter 24: Slurm for AI Infrastructure

4 cards

NCCL

NVIDIA communication library optimized for multi-GPU collective operations.

MPI

Message Passing Interface standard for distributed process coordination.

CUDA

NVIDIA parallel computing platform for GPU-accelerated execution.

Distributed training

Model training across multiple processes and often multiple nodes.

Chapter 25: Slurm Performance Optimization

4 cards

Topology-aware scheduling

Placement strategy respecting network and hardware locality structure.

Locality

Proximity relationship affecting latency and bandwidth between compute resources.

Scheduler cycle

Interval in which the scheduler evaluates queue and dispatch opportunities.

Bottleneck

Dominant constraint limiting end-to-end workload throughput or latency.

Back to Slurm landing