Protected
Slurm flashcards are available after login. Redirecting...
If you are not redirected, login.
Courses / Slurm / Study / Self-Test
Slurm Flashcards
Terminology recall cards across all chapters
100 concept cards
Study Method
- Read the term and attempt to define it before opening the card.
- Check whether your definition includes mechanism, constraints, and operating purpose.
- Mark weak cards and revisit corresponding chapters for deeper review.
Chapter 1: HPC Workload Management Foundations
4 cardsCluster
A coordinated set of networked compute resources managed as one scheduling domain.
Workload
A submitted unit of computational work, typically represented as a job or set of jobs.
Scheduler
The decision function that maps queued jobs to available resources over time.
Arbitration
The policy process that resolves contention among competing users.
Chapter 2: Slurm Architecture
4 cardsslurmctld
Primary controller daemon responsible for scheduling and state orchestration.
slurmd
Node-resident daemon that launches and monitors tasks on compute nodes.
slurmdbd
Accounting daemon that persists usage records to a database backend.
Control plane
Logical layer responsible for coordination and policy decisions.
Chapter 3: Slurm Resource Model
4 cardsTRES
Trackable resources such as CPU, memory, GPU, and custom generic resources.
GRES
Generic resources, commonly used for GPUs or specialized devices.
Affinity
Binding of tasks to specific CPUs or NUMA regions.
NUMA
Non-Uniform Memory Access; memory latency depends on placement locality.
Chapter 4: Nodes Partitions and Queues
4 cardsNode state
Operational status such as idle, allocated, drain, down, or unknown.
Partition
A policy-scoped subset of nodes presented as a queue target.
Drain
State indicating intentional exclusion from new scheduling.
Access control
Rules limiting which users or accounts can submit to a partition.
Chapter 5: Slurm Configuration
4 cardsslurm.conf
Primary scheduler and cluster topology configuration file.
gres.conf
Device-level generic resource inventory definitions.
cgroup.conf
Runtime enforcement policy for CPU, memory, and device isolation.
Configuration drift
Deviation between intended and actual deployed settings.
Chapter 6: Authentication and Security
4 cardsMunge
Credential signing service used for authenticated communication in Slurm.
Authorization
Policy decision about whether an authenticated identity may perform an action.
Least privilege
Security principle granting only the minimum required permissions.
Isolation boundary
Administrative separation preventing one tenant from affecting another.
Chapter 7: Slurm Commands User Interface
4 cardssinfo
Command for cluster and partition state visibility.
squeue
Command for queued and running job visibility.
sbatch
Command for script-based batch submission.
scancel
Command for job termination by identifier or filter.
Chapter 8: Job Lifecycle
4 cardsPending
Job admitted to queue but not yet allocated resources.
Running
Job has active resource allocation and executing tasks.
Completion
Terminal state with successful execution status.
Timeout
Terminal state due to policy-enforced runtime limit breach.
Chapter 9: Job Types
4 cardsBatch job
Script-defined workload submitted for deferred execution.
Interactive job
Allocation enabling immediate user-driven command execution.
Job array
Parametrized set of homogeneous job instances.
Job step
Execution subdivision inside an allocated parent job.
Chapter 10: Scheduling Algorithms
4 cardsFIFO
First-in-first-out ordering based on submission sequence.
Backfill
Scheduling strategy filling idle windows without delaying reservations.
Fairshare
Policy model adjusting priority using historical usage.
Multifactor priority
Composite ranking using weighted scheduling factors.
Chapter 11: Priority Factors
4 cardsWeight
Administrative coefficient controlling influence of a priority factor.
Decay
Temporal reduction of historical usage influence over time.
Normalization
Scaling factors to comparable ranges before weighted summation.
Fairshare score
Usage-based contribution to current job priority.
Chapter 12: Quality of Service QoS
4 cardsQoS
Policy object encoding limits, priorities, and preemption behavior.
Preemption
Policy action allowing one workload to displace another based on rules.
MaxWall
Maximum wall-clock runtime allowed under a policy class.
Resource cap
Upper bound on allocatable resources for a user, account, or QoS class.
Chapter 13: Job Dependencies
4 cardsafter
Dependency requiring predecessor job start before dependent eligibility.
afterok
Dependency requiring predecessor successful completion.
afternotok
Dependency requiring predecessor failure state.
singleton
Constraint ensuring one active job of a given key context.
Chapter 14: Resource Isolation Cgroups
4 cardscgroup
Kernel mechanism for hierarchical resource control and accounting.
task/cgroup plugin
Slurm plugin integrating job/task boundaries with cgroup enforcement.
ConstrainRAMSpace
Policy control limiting memory usage within cgroups.
Device whitelist
Allowed device access set for constrained workloads.
Chapter 15: GPU Scheduling
4 cardsGRES
Generic resource abstraction used for GPUs and specialized devices.
MIG
Multi-Instance GPU partitioning for hardware-level sub-device isolation.
GPU binding
Mapping tasks to specific GPU identifiers for locality control.
Topology awareness
Placement logic incorporating physical interconnect structure.
Chapter 16: Heterogeneous Jobs
4 cardsHeterogeneous job
Single logical job containing components with different resource vectors.
Component group
One resource-homogeneous segment inside a heterogeneous submission.
Stage coupling
Dependency relation between computational stages in a pipeline.
Cross-stage locality
Data placement relationship affecting transition overhead between stages.
Chapter 17: Slurm Accounting
4 cardsslurmdbd
Daemon responsible for collecting and persisting accounting records.
sacct
Command-line interface for querying job-level accounting records.
sreport
Command for aggregated utilization and accounting reports.
Chargeback
Attribution of resource consumption cost to organizational entities.
Chapter 18: Monitoring and Metrics
4 cardsQueue depth
Number of jobs waiting for allocation at a given observation point.
Wait time
Duration from submission until execution start.
Scheduling latency
Controller time to evaluate and dispatch eligible jobs.
Utilization
Fraction of available resources actively consumed by workloads.
Chapter 19: High Availability
4 cardsPrimary controller
Active scheduler authority in normal operation.
Backup controller
Standby controller prepared for failover takeover.
Failover
Controlled transfer of scheduler authority to backup infrastructure.
State replication
Propagation of controller state required for coherent takeover.
Chapter 20: Federation and Multi Cluster
4 cardsFederation
Logical linkage of multiple Slurm clusters for coordinated scheduling semantics.
Global scheduling
Policy-aware placement decisions spanning cluster boundaries.
Cross-cluster sharing
Capability to place workloads where capacity exists across federated members.
Distributed accounting
Usage visibility and governance across multiple cluster domains.
Chapter 21: Power Management
4 cardsSuspend
Transition node to low-power state when not needed for active scheduling.
Resume
Transition node back to schedulable state in response to demand.
Elasticity
Ability to vary active capacity based on workload intensity.
Provisioning latency
Time required to activate capacity and make it schedulable.
Chapter 22: Slurm Automation
4 cardsProvisioning
Process of creating and initializing cluster infrastructure resources.
Idempotency
Property of automation where repeated execution yields consistent final state.
Drift management
Detection and remediation of state divergence over time.
Bootstrap
Initial configuration sequence required to make nodes operational.
Chapter 23: Containers with Slurm
4 cardsApptainer/Singularity
HPC-friendly container runtimes with user-space execution patterns.
Image immutability
Property of fixed container content across executions.
Dependency isolation
Separation of software libraries to avoid cross-workload conflicts.
Portability
Ability to run the same workload in multiple environments with consistent behavior.
Chapter 24: Slurm for AI Infrastructure
4 cardsNCCL
NVIDIA communication library optimized for multi-GPU collective operations.
MPI
Message Passing Interface standard for distributed process coordination.
CUDA
NVIDIA parallel computing platform for GPU-accelerated execution.
Distributed training
Model training across multiple processes and often multiple nodes.
Chapter 25: Slurm Performance Optimization
4 cardsTopology-aware scheduling
Placement strategy respecting network and hardware locality structure.
Locality
Proximity relationship affecting latency and bandwidth between compute resources.
Scheduler cycle
Interval in which the scheduler evaluates queue and dispatch opportunities.
Bottleneck
Dominant constraint limiting end-to-end workload throughput or latency.