Starting Geyser and Caldera jobs from Cheyenne

Interactive jobs | Batch jobs | Compiling your code

Updated 12/15/2017: The command module load slurm was removed from the example scripts because the module is now available by default and does not need to be loaded. 

Cheyenne HPC system users run jobs on the Geyser and Caldera clusters by submitting them with the open-source Slurm Workload Manager

Procedures for starting both interactive jobs and batch jobs are described below. 

Begin by logging in on Cheyenne.

Compiling code

You will need to compile your code on Geyser or Caldera to run it on these nodes*. See compiling your code below.


Interactive jobs

Using execgy and execca

The execgy and execca commands execute scripts that start interactive sessions on Geyser and Caldera respectively. A session started with one of these commands uses a single core and has a wall-clock time of 6 hours. Use execdav (see below) if you want to specify different resource needs.

Example with output

cheyenne6:~> execgy
mem =
amount of memory is default
Submitting interactive job to slurm using account SCSG0001 ...
submit cmd is
salloc --constraint=geyser -n 1 -t 30:00 -p dav --account=SCSG0001 srun --pty env -i HOME=/glade/u/home/username PATH=/bin:/usr/bin TERM=xterm SHELL=/bin/bash /bin/bash -l
salloc: Granted job allocation 884
salloc: Waiting for resource configuration
salloc: Nodes geyser13 are ready for job
username@geyser13:~>

To end the session, run exit.

Run execgy -help or execca -help for additional information.

* * *

Using execdav

Run the execdav script/command to start an interactive job. Invoking it without an argument will start an interactive shell on the first available Geyser/Caldera data analysis and visualization (DAV) node*. The default wall-clock time is 6 hours.

The execdav command has these optional arguments:

  • -a project_code (defaults to value of DAV_PROJECT)
  • -n number of cores (defaults to 1 core)
  • -t time (minutes:seconds or hours:minutes:seconds; defaults to 6 hours)
  • -m nG (n is memory needed, from 1 to 900 GB; value defaults to 1.8G per core requested)
  • -g gpu_type (gpu_type can be k20, k5000, any or none; defaults to none)

To specify which project code to charge for your CPU time, set environment variable DAV_PROJECT before invoking execdav. For example, DAV_PROJECT=UABC0001.

* * *

Using exechpss

The exechpss command is used to initiate HSI and HTAR file transfers. See examples in Managing files with HSI and Using HTAR to transfer files.

* * *

See https://slurm.schedmd.com/documentation.html for in-depth Slurm documentation.


Batch jobs

Prepare a batch script by following one of the examples below. The system does not import your Cheyenne environment, so be sure your script loads the software modules that you will need to run the job.

The slurm module is loaded by default on Cheyenne to enable you to run Slurm commands. Include it in any customized environments that you use if you will be submitting Geyser or Caldera jobs.

Basic Slurm commands

When your script is ready, run sbatch to submit the job.

sbatch script_name

To check on your job's progress, run squeue.

squeue -u username

To get a detailed status report, run scontrol show job followed by the job number.

scontrol show job nnn

To kill a job, run scancel with the job number.

scancel nnn

-C option to specify node type

The example batch scripts below can be customized further with the -C option to set constraints on which DAV nodes the job can use. 

To place the job on a specific node:

#SBATCH -C geyser
#SBATCH -C caldera
#SBATCH -C pronghorn

You can also combine constraints to specify that the job can run on either caldera or pronghorn. Pronghorn nodes do not have GPUs but are otherwise equivalent to caldera nodes.

#SBATCH -C caldera|pronghorn

In general, minimize resource constraints when possible to decrease the length of time your job waits in the queue.

Script examples

These examples show how to create a script for running an MPI job.

tcsh script

Insert your own project code where indicated and customize other settings as needed for your own job.

#!/bin/tcsh
#SBATCH -J job_name
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH -t 60:00
#SBATCH -A project_code
#SBATCH -p dav
#SBATCH -e job_name.err.%J
#SBATCH -o job_name.out.%J

### Initialize the Slurm environment
source /glade/u/apps/opt/slurm_init/init.csh

mkdir -p /glade/scratch/username/temp
setenv TMPDIR /glade/scratch/username/temp

module purge
module load gnu/6.1.0 ncarenv ncarbinlibs ncarcompilers
module load openmpi-slurm

mpiexec ./mpihello

bash script

Insert your own project code where indicated and customize other settings as needed for your own job.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -n 8
#SBATCH --ntasks-per-node=4
#SBATCH -t 60:00
#SBATCH -A project_code
#SBATCH -p dav
#SBATCH -e job_name.err.%J
#SBATCH -o job_name.out.%J

### Initialize the Slurm environment
source /glade/u/apps/opt/slurm_init/init.sh

mkdir -p /glade/scratch/username/temp
export TMPDIR=/glade/scratch/username/temp

module purge
module load gnu/6.1.0 ncarenv ncarbinlibs ncarcompilers
module load openmpi-slurm

mpiexec ./mpihello

Compiling your code

You will need to compile your code on Geyser or Caldera to run it on these nodes*.

CISL recommends using intel/16.0.3 or gnu/6.1.0 or later for parallel programs. Once you are on a Geyser or Caldera node, load the GNU or Intel compiler and then the openmpi-slurm module if you plan to use MPI. Then compile your code as you usually do.

Serial programs can use any compiler.

 

* Some Caldera nodes use the hostname "pronghorn." Compiling on caldera and pronghorn hosts will generate equivalent executables.