Starting Geyser and Caldera jobs from Cheyenne

Interactive jobs | Batch jobs | Compiling your code

Cheyenne HPC system users run jobs on the Geyser and Caldera clusters by submitting them with the open-source Slurm Workload Manager

Procedures for starting both interactive jobs and batch jobs are described below. 

Begin by logging in on Cheyenne.

Compiling code

You will need to compile your code on Geyser or Caldera to run it on these nodes*. See compiling your code below.


Interactive jobs

Using execdav

Run the execdav script/command to start an interactive job. Invoking it without an argument will start an interactive shell on the first available Geyser/Caldera data analysis and visualization (DAV) node. The default wall-clock time is 6 hours.

The execdav command has these optional arguments:

  • -a project_code (defaults to value of DAV_PROJECT)
  • -t time (minutes:seconds or hours:minutes:seconds; defaults to 6 hours)
  • -n number_of_cores (defaults to 1 core: -n 1)
  • -m nG (use this if you want to specify the amount of memory you need to use on the node, from 1 to 900 gigabytes: -m 300G, for example; if you do not specify memory per node, the default memory available is 1.87G per core that you request.)
  • -C constraint (options include sandybridge, pronghorn, westmere, geyser, caldera, gpu, k20, k5000, and x11.) Example: -C gpu

To specify which project code to charge for your CPU time, set environment variable DAV_PROJECT before invoking execdav. For example, DAV_PROJECT=UABC0001.

* * *

Using execgy and execca

The execgy and execca commands execute scripts that start interactive sessions on Geyser and Caldera respectively. A session started with one of these commands uses a single core and has a wall-clock time of 6 hours. Use execdav (see above) if you want to specify different resource needs.

Example with output

cheyenne6:~> execgy
mem =
amount of memory is default
Submitting interactive job to slurm using account SCSG0001 ...
submit cmd is
salloc  -C geyser   -N 1  -n 1 -t 6:00:00 -p dav --account=SCSG0001 srun --pty  ... (shortened for space)
salloc: Pending job allocation 132885
salloc: job 132885 queued and waiting for resources
salloc: job 132885 has been allocated resources
salloc: Granted job allocation 132885
salloc: Waiting for resource configuration
salloc: Nodes geyser10 are ready for job
username@geyser13:~>

To end the session, run exit.

Run execgy -help or execca -help for additional information.

* * *

Using exechpss

The exechpss command is used to initiate HSI and HTAR file transfers. See examples in Managing files with HSI and Using HTAR to transfer files.

* * *

See https://slurm.schedmd.com/documentation.html for in-depth Slurm documentation.


Batch jobs

Prepare a batch script by following one of the examples below. The system does not import your Cheyenne environment, so be sure your script loads the software modules that you will need to run the job.

Basic Slurm commands

When your script is ready, run sbatch to submit the job.

sbatch script_name

To check on your job's progress, run squeue.

squeue -u $USER

To get a detailed status report, run scontrol show job followed by the job number.

scontrol show job nnn

To kill a job, run scancel with the job number.

scancel nnn

-C option to specify node type

Many user jobs can run on any DAV node and they will spend less time waiting in the queue if the type of node to use is not specified. If you do need to specify a node type for your job, you can use the -C option to set that constraint by including a line like this in your script:

#SBATCH -C geyser
#SBATCH -C caldera
#SBATCH -C pronghorn

You can also combine constraints to specify that the job can run on either caldera or pronghorn. Pronghorn nodes do not have GPUs but are otherwise equivalent to caldera nodes.

#SBATCH -C caldera|pronghorn

In general, minimize resource constraints when possible to decrease the length of time your job waits in the queue.

Wall-clock

The wall-clock limit on these clusters is 24 hours.

Specify the hours your job needs as in the examples below, which use the hours:minutes:seconds format. It can be shortened to minutes:seconds.

Script examples

The examples below show how to create a script for running an MPI job. See these pages for other examples:

Earlier script examples included a source command for initializing the Slurm environment. That command is no longer needed and should be removed from scripts used in the CentOS 7 environment.  The first line of the bash scripts for Slurm jobs also has been revised to include the -l option.

For tcsh users

Insert your own project code where indicated and customize other settings as needed for your own job. Do not use the shell's -f option in your first line; it prevents the environment from initializing properly.

#!/bin/tcsh
#SBATCH -J job_name
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH --mem=8G
#SBATCH -t 00:60:00
#SBATCH -A project_code
#SBATCH -p dav
#SBATCH -e job_name.err.%J
#SBATCH -o job_name.out.%J

setenv TMPDIR /glade/scratch/$USER/temp
mkdir -p $TMPDIR

module purge
module load gnu/7.3.0 ncarenv ncarcompilers
module load openmpi

srun ./mpihello

For bash users

Insert your own project code where indicated and customize other settings as needed for your own job.

#!/bin/bash -l
#SBATCH -J job_name
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH --mem=8G
#SBATCH -t 00:60:00
#SBATCH -A project_code
#SBATCH -p dav
#SBATCH -e job_name.err.%J
#SBATCH -o job_name.out.%J

export TMPDIR=/glade/scratch/$USER/temp
mkdir -p $TMPDIR

module purge
module load gnu/7.3.0 ncarenv ncarcompilers
module load openmpi

srun ./mpihello

Compiling your code

You will need to compile your code on Geyser or Caldera to run it on these nodes*.

CISL recommends using the default Intel or GNU compilers for parallel programs. Once you are on a Geyser or Caldera node, load the GNU or Intel compiler and then the openmpi module if you plan to use MPI. Then compile your code as you usually do. 

Serial programs can use any compiler.

 

* Some Caldera nodes use the hostname "pronghorn." Compiling on caldera and pronghorn hosts will generate equivalent executables.