`seisflows.system.slurm`

The Simple Linux Utility for Resource Management (SLURM) is a commonly used workload manager on many high performance computers / clusters. The Slurm system class provides generalized utilites for interacting with Slurm systems.

Useful commands for figuring out system-specific required parameters: $ sinfo –Node –long # Determine the cores-per-node for partitions

Note

The main development system for SeisFlows used SLURM. Therefore the other system supers will not be up to date until access to those systems are granted. This rosetta stone, for converting from SLURM to other workload management tools will be useful: https://slurm.schedmd.com/rosetta.pdf

Note

SLURM systems expect walltime/tasktime in format: “minutes”, “minutes:seconds”, “hours:minutes:seconds”. SeisFlows uses the latter and converts task and walltimes from input of minutes to a time string.

TODO: Create ‘slurm_singulairty’, a child class for singularity-based runs which loads and runs programs through singularity, OR add a parameter options which will change the run and/or submit calls

Module Contents

Classes

Slurm

System Slurm

Functions

`check_job_status_array`(job_id)	Repeatedly check the status of a currently running job using 'sacct'.
`check_job_status_list`(job_ids)	Check the status of a list of currently running jobs. This is used for
`query_job_states`(job_id[, _recheck])	Queries completion status of an array job by running the SLURM cmd sacct
`modify_run_call_single_proc`(run_call)	Modifies a SLURM SBATCH command to use only 1 processor as a single run

Attributes

BAD_STATES

seisflows.system.slurm.BAD_STATES = ['TIMEOUT', 'FAILED', 'NODE_FAIL', 'OUT_OF_MEMORY', 'CANCELLED']

class seisflows.system.slurm.Slurm(ntask_max=100, slurm_args='', **kwargs)

Bases: seisflows.system.cluster.Cluster

System Slurm

Interface for submitting and monitoring jobs on HPC systems running the Simple Linux Utility for Resource Management (SLURM) workload manager.

Parameters

type slurm_args: str
param slurm_args: Any (optional) additional SLURM arguments that will be passed to the SBATCH scripts. Should be in the form: ‘–key1=value1 –key2=value2”

Paths

***

property nodes: Defines the number of nodes which is derived from system node size

property node_size: Defines the node size of a given cluster partition. This is a hard set number defined by the system architecture

property submit_call_header

The submit call defines the SBATCH header which is used to submit a workflow task list to the system. It is usually dictated by the system’s required parameters, such as account names and partitions. Submit calls are modified and called by the submit function.

Return type: str
Returns: the system-dependent portion of a submit call

property run_call_header

The run call defines the SBATCH header which is used to run tasks during an executing workflow. Like the submit call its arguments are dictated by the given system. Run calls are modified and called by the run function

Return type: str
Returns: the system-dependent portion of a run call

__doc__

check(): Checks parameters and paths

static _stdout_to_job_id(stdout)

The stdout message after an SBATCH job is submitted, from which we get the job number, differs between systems, allow this to vary

Note

Examples 1) standard example: Submitted batch job 4738244 2) (1) with ‘–parsable’ flag: 4738244 3) federated cluster: Submitted batch job 4738244; Maui 4) (3) with ‘–parsable’ flag: 4738244; Maui

This function deals with cases (2) and (4). Other systems that have more complicated stdout messages will need to overwrite this function

Parameters: stdout (str) – standard SBATCH response after submitting a job with the ‘–parsable’ flag
Return type: str
Returns: a matching job ID. We convert str->int->str to ensure that the job id is an integer value (which it must be)
Raises: SystemExit – if the job id does not evaluate as an integer

run(funcs, single=False, **kwargs)

Runs task multiple times in embarrassingly parallel fasion on a SLURM cluster. Executes the list of functions (funcs) NTASK times with each task occupying NPROC cores.

Note

Completely overwrites the Cluster.run() command

Parameters

funcs (list of methods) – a list of functions that should be run in order. All kwargs passed to run() will be passed into the functions.
single (bool) – run a single-process, non-parallel task, such as smoothing the gradient, which only needs to be run by once. This will change how the job array and the number of tasks is defined, such that the job is submitted as a single-core job to the system.

seisflows.system.slurm.check_job_status_array(job_id)

Repeatedly check the status of a currently running job using ‘sacct’. If the job goes into a bad state like ‘FAILED’, log the failing job’s id and their states. If all jobs complete nominally, return

Note

The time.sleep() is critical before querying job status because the system will likely take a second to intitiate jobs so if we query_job_states before this has happenend, it will return empty lists and cause the function to error out

Parameters: job_id (str) – main job id to query, returned from the subprocess.run that

ran the jobs :rtype: int :return: status of all running jobs. 1 for pass (all jobs COMPLETED). -1 for

fail (one or more jobs returned failing status)

Raises: FileNotFoundError – if ‘sacct’ does not return any output for ~1 min.

seisflows.system.slurm.check_job_status_list(job_ids)

Check the status of a list of currently running jobs. This is used for systems that cannot submit array jobs (e.g., Frontera) where we instead submit jobs one by one and have to check the status of all those jobs together.

Parameters: job_id – job ID’s to query with SACCT. Will be considered one group of jobs, who all need to finish successfully otherwise the entire group is considered failed
Return type: int
Returns: status of all running jobs. 1 for pass (all jobs COMPLETED). -1 for fail (one or more jobs returned failing status)
Raises: FileNotFoundError – if ‘sacct’ does not return any output for ~1 min.

seisflows.system.slurm.query_job_states(job_id, _recheck=0)

Queries completion status of an array job by running the SLURM cmd sacct Available job states are listed here: https://slurm.schedmd.com/sacct.html

Note

The actual command line call wil look something like this $ sacct -nLX -o jobid,state -j 441630 441630_0 PENDING 441630_1 COMPLETED

Note

SACCT flag options are described as follows: -L: queries all available clusters, not just the cluster that ran the

sacct call. Used for federated clusters

-X: supress the .batch and .extern jobnames that are normally returned: but don’t represent that actual running job

Parameters

job_id (str) – main job id to query, returned from the subprocess.run that ran the jobs
rechecks (int) – Used for recursive calling of the function. It can take time for jobs to be initiated on a system, which may result in the stdout of the ‘sacct’ command to be empty. In this case we wait and call the function again. Rechecks are used to prevent endless loops by putting a stop criteria

Raises

FileNotFoundError – if ‘sacct’ does not return any output for ~1 min.

seisflows.system.slurm.modify_run_call_single_proc(run_call)

Modifies a SLURM SBATCH command to use only 1 processor as a single run by replacing the –array and –ntasks options

Parameters: run_call (str) – The SBATCH command to modify
Return type: str
Returns: a modified SBATCH command that should only run on 1 processor

seisflows.system.slurm

Module Contents

Classes

Functions

Attributes

System Slurm

Parameters

Paths

`seisflows.system.slurm`