Migrating Casper jobs from Slurm to PBS

Requesting resources | Batch jobs | Interactive jobs
Migrating MPI programs | Command comparison | Q&A

Updated March 24: The documentation is intended to help you prepare Casper job scripts for use with PBS Pro if you have been using Slurm. Also see Starting Casper jobs with PBS, which was updated today.

The scheduler used for submitting jobs to run on Casper nodes is being transitioned from Slurm to the PBS Pro workload manager, which is used on the Cheyenne supercomputer. This is intended to simplify job scheduling for users.

Some of the Casper nodes in service before March 1, 2021, will remain available through Slurm until April 7, 2021, to allow users to modify their job scripts and prepare for the transition.

New high-throughput computing (HTC) nodes added to the Casper cluster in March are accessible only by using PBS Pro. The HTC nodes are for batch jobs that use up to one full node.

This documentation is provided to guide you through the transition to PBS Pro. Using PBS to submit Casper jobs will be familiar to anyone who has used it on Cheyenne, but there are some additional features present to carry capabilities over from Slurm.

Download the NCAR Slurm to PBS Quick Guide

Fundamentals

In PBS terminology, the Cheyenne and Casper clusters are separate PBS “servers.” Each server features distinct queue names and queue structure. At present, jobs running on one server are not accessible from the other server when using most PBS commands – deleted with qdel, for example.

Note: Because each server is distinct, use qsub to submit Casper jobs only from Casper login nodes. To submit Casper jobs from Cheyenne login nodes, use qsubcasper instead. To delete a Casper job from Cheyenne, use qdelcasper.

Submissions to Cheyenne from Casper are not possible at present.

Slurm users submit all jobs to the dav partition. PBS users will submit all Casper jobs to the casper queue. Those jobs will be charged against your Casper allocation, so you must have a valid project with available Casper compute hours to run jobs in this queue.

As with Slurm, but unlike PBS on Cheyenne, Linux control groups (cgroups) will enforce your CPU, memory, and V100 GPU requests. When your job starts, PBS will assign your job a block of CPU cores and pool of memory equal to the amount you request, and a number of V100 GPUs if specified; no other users will have access to those resources until your job is complete. This configuration provides greater (though not total) isolation of jobs on shared resources.


Requesting resources

As on Cheyenne, use a select statement to request one or more chunks of resources. The general format is as follows for a batch job:

#PBS -l select=<num_chunks>:<chunk_specification>+<num_chunks>:<chunk_specification>...

For exclusive jobs on Cheyenne, the number of chunks is equivalent to the number of nodes, but on Casper multiple chunks can be assigned to the same node if they fit. The chunk specification is where you request specific resources. Here is a table with the most common resource requests:

Resource Options Description Default
ncpus 1-36 Number of CPU cores to assign to the chunk (exclusive) 1
mpiprocs 1-72 Number of MPI ranks to use in this chunk 1
ompthreads 1-72 Number of parallel threads to use in this chunk # ncpus
mem 1-1024 GB Amount of memory in GB or TB to assign to the chunk 10 GB
ngpus 1-8 Number of GPUs to assign to this chunk (exclusive if using V100) 0

In addition to the select statement, you can restrict the GPU and CPU types by specifying separate resource flags. Here is an example of each type:

#PBS -l gpu_type=<GPU>
#PBS -l cpu_type=<CPU>

For example, if you want to request a single chunk with 16 CPUs, each running a single thread, along with 2 NVIDIA V100 GPUs and 80 GB of RAM. Here is what the batch directives would look like:

#PBS -l select=1:ncpus=16:mem=80GB:ngpus=2
#PBS -l gpu_type=v100

The 80 GB of memory corresponds to node memory. If you request a GPU, you have exclusive use of its memory; there is no way to request or restrict access to GPU VRAM. 

As with Slurm, it is possible to submit requests that are incompatible with each other. For example, no Casper nodes have more than one gp100 GPU. So if you request ngpus=2 and gpu_type=gp100, your job will not be scheduled.


Batch job example

Let’s say you want to convert the Slurm script shown below into a PBS job for running on Casper. The job uses 8 tasks spread across two nodes with four tasks per node. The script reserves 20 GB of memory on each node, then runs an MPI job on the 8 tasks.

When the job script is translated to PBS, the format of the resource requests is different but otherwise the general process is the same. The biggest difference is found when calling the MPI runtime. With Slurm, the MPI libraries are built with scheduler integration, and srun launches the executable. This also allows for resource subsetting via sub-jobs. PBS does not have a similar concept, so you use the native launcher to start the MPI runtime (in this case Open MPI).

View the scripts side by side in a new window.

Slurm script

#!/bin/bash -l
#SBATCH --job-name=mpi_job
#SBATCH --account=project_code
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:10:00
#SBATCH --partition=dav
#SBATCH --mem=20G

export TMPDIR=/glade/scratch/$USER/temp
mkdir -p $TMPDIR

### Run program
srun ./executable_name

PBS Pro script

#!/bin/bash -l
#PBS -N mpi_job
#PBS -A project_code
#PBS -l select=2:ncpus=4:mpiprocs=4:mem=20GB
#PBS -l place=scatter
#PBS -l walltime=00:10:00
#PBS -q casper

export TMPDIR=/glade/scratch/$USER/temp
mkdir -p $TMPDIR

### Run program 
mpirun ./executable_name

In the PBS example, -l place=scatter forces two chunks of size 4-ncpus to run on separate physical nodes. This directive replicates the effect of the --ntasks-per-node limitation set with Slurm. While illustrative for this comparison, we generally recommend not setting place=scatter, as your jobs will often be dispatched more quickly without that restriction.


Interactive job example

There is a bigger difference between PBS and Slurm for interactive jobs. In Slurm, allocating resources can be considered a separate step from assigning resources to a task. For an interactive job, you could run salloc to request certain compute resources and, once provisioned, use srun to assign some or all of those resources to a task. The execdav command simplified this process by combining both steps, and the default behavior was to assign all allocated resources to a shell (e.g., bash or tcsh).

PBS does not distinguish between those steps, so all requested resources are assigned to the interactive shell and are available to all processes launched from within that shell. This is easier to use in most cases, but it does add complexity if you want to run multiple MPI and/or GPU tasks concurrently using subsets of the allocated resources. See manuals for the specific MPI or GPU libraries for those tasks.

execdav / execcasper

The execdav command accepts Slurm options as command-line arguments. A new command called execcasper accepts PBS (qsub) arguments. The default behavior will be the same: You will be assigned 1 core for 6 hours and receive a default amount of memory (around 10 GB) if no other specification is given.

# Slurm example
execdav -A PROJ0001 -n 4 --mem=20G --gres=gpu:v100:4
# PBS example
execcasper -A PROJ0001 -l select=1:ncpus=4:mem=20GB:ngpus=4 -l gpu_type=v100

The execdav command will be disabled on April 7.


Migrating MPI programs

Some MPI libraries like Open MPI rely on scheduler integration to determine where to place tasks on nodes and CPUs. Because of this, Open MPI installations through version 4.0.5 should be used in Slurm jobs only. As you transition your workflows to PBS, you will need to switch to openmpi/4.1.0, which has been compiled with PBS support, and recompile application executables and any libraries you maintain. If you mix Open MPI versions, you will see either program crashes or very poor performance.

Intel MPI uses runtime settings to determine which scheduler to use, and we configure these settings for you in each impi module, so you should not need to make any adjustments or recompile Intel MPI applications.

In addition to Intel MPI and Open MPI v4.1.0, we are also providing a new installation of MVAPICH2 for use in PBS jobs; it should not be integrated into any Slurm workflows.

During the transition, nodes configured for PBS will show only PBS MPI modules and likewise with Slurm nodes. You will see both sets of modules on the login nodes, however, so be mindful of which module you need to use. Once all resources have been migrated to PBS, the old Slurm MPI installations will be deleted from the system.

MPI library Slurm-enabled versions PBS-enabled versions
Open MPI 3.0.x, 3.1.x, 4.0.3, 4.0.5 4.1.0
Intel MPI All versions All versions
MVAPICH2   2.3.5

 


Slurm commands and PBS equivalents

The submission and query commands have obvious analogs between Slurm and PBS. Use qsub instead of sbatch/salloc/srun to submit jobs (or qsubcasper as noted above). Use qstat instead of squeue to query running and pending jobs. Note that we cache some qstat output to reduce load on the scheduler, so there can be a small (10 seconds or less) delay in the output. You can obtain output similar to scontrol show job using qstat -f, though we request that you use this formulation of qstat sparingly since it is a computationally expensive command.

There are other Slurm commands with no clear analogs in PBS. Most notably, users could query information about past jobs in Slurm using sacct. CISL maintains a utility called qhist that can provide much of the same information from PBS (on both Cheyenne and Casper).

Slurm also has a command, sprio, that provides information about your job’s priority among all active jobs. There is no PBS equivalent for this command.


Q&A

How can I submit a job with a success dependency on another job? Can I create dependencies between Cheyenne and Casper jobs?

In general, dependencies work in the same fashion as they do on Cheyenne. See this example. Dependencies between Cheyenne and Casper jobs are currently not supported, though this is a capability we are looking to add in the future.

What about JupyterHub, FastX, and TurboVNC?

The JupyterHub job spawner will continue to use Slurm until existing Casper nodes are transitioned to PBS (between March 24 and April 7). When transitioned, JupyterHub jobs will use PBS syntax for both Cheyenne and Casper jobs. Only use GPU flags when creating a Casper server.

FastX sessions will continue operating as documented. The vncmgr script for starting VNC sessions will be migrated from Slurm to PBS on March 24.

I use Dask-jobqueue to submit to Slurm. How should I migrate this workflow?

The Dask-jobqueue project has support for both PBS and Slurm. Please see the official documentation for guidance on converting SLURMCluster usage to PBSCluster. Using the correct queue in your cluster object will ensure that the workers are launched on the correct system.

Is there any quick conversion guide between Slurm and PBS commands and directives?

A comparison of frequently used commands and syntax is available here.

Is there a utility that will convert my Slurm scripts into PBS scripts?

We do not offer a utility to auto-convert Slurm scripts to PBS scripts. While such a utility might provide some convenience, the breadth of use cases submitted to the scheduler would result in errors, and it is simply impractical to account for the many edge cases.