- Data Portals
- User Support
- About Us
Users schedule their jobs to run on the Yellowstone, Geyser, and Caldera clusters by submitting them through Platform LSF.
Most production computing jobs run in batch queues on the 1.5-petaflops Yellowstone high-performance computing (HPC) system. Shared-node batch jobs and some exclusive-use batch jobs also may run on the Geyser and Caldera clusters. Interactive queues are available on both Geyser and Caldera.
Users can run short, non-memory-intensive processes interactively on the system's login nodes. These include tasks such as text editing or running small serial scripts or programs.
You can compile models and programs. The number of simultaneously executing compilation process threads may not exceed eight (8). Typically this is controlled by an argument following the “-j” option for your GNU make command.
All tasks that you run on login nodes are run “at risk.” If any task or multiple concurrent tasks being run by an individual user consumes excessive resources, the task or tasks will be killed and you will be notified.
Do not run programs or models that consume excessive amounts of CPU time, more than a few GB of memory, or excessive I/O resources. Instead, use the Yellowstone batch nodes or the shared nodes on the Geyser and Caldera clusters. Many tasks can be performed easily on Geyser and Caldera by using the execgy and execca scripts.
Select the most appropriate queue for each job and provide accurate wall-clock times in your job script. This will help us fit your job into the earliest possible run opportunity.
Check for backfill windows; you may be able to adjust your wall-clock estimate and have your job fit an available window.
Note the system's usable memory and configure your job script to maximize performance.
Parallel jobs usually run most efficiently on the Yellowstone supercomputing cluster if they use all 16 cores on each node, as specified by the #BSUB -R "span[ptile=16]" job directive that is used in the sample batch script below.
If job memory or configuration requirements prevent you from using all 16 cores, specify the number of cores best suited to the job's efficiency. See Checking memory use to determine your program’s memory requirements and do some test runs with ptile settings ranging from 1 to 16.
Some jobs may benefit from hyper-threading, where you would run from 17 to as many as 32 tasks per node. See Hyper-threading on Yellowstone for sample scripts that show how to set environment variables that you will need.
Serial batch jobs, which use only one core of a node, are best run on the Geyser cluster’s shared nodes rather than Yellowstone’s exclusive nodes, where you are charged for the full use of a node while using only one of its 16 cores.
To run a single, serial batch job on one Geyser core, include these job directives in your job script in addition to others that your job requires:
#BSUB -n 1 #BSUB -R “span[ptile=1]” #BSUB -q geyser
However, if your workflow requires running numerous independent serial jobs—tens or even hundreds—you may benefit by running those jobs in parallel on Yellowstone. See Using command files in batch jobs for a discussion of running independent jobs in parallel. Example 1 on that page includes a sample script.
To submit a simple MPI batch job, follow the instructions below.
To start an interactive job on Yellowstone, follow the example here.
To start an interactive job on Geyser or Caldera, see Running interactive applications.
See Platform LSF examples for additional sample scripts.
To submit a batch job, use the command bsub with the redirect sign (<) and the name of your batch script file.
bsub < script_name
We recommend passing the options to bsub in a batch script file rather than with numerous individual commands.
Include these options in your script:
Use the same name for your output and error files if you want the data stored in a single file rather than separately.
Users sometimes need to execute module commands from within a batch job—to load an application, for example, or to load or remove other modules.
To ensure that the module commands are available, insert the following in your batch script if you need to include module commands.
In a tcsh script:
In a bash script:
Once that is included, you can add the module purge command if you need to and then load just the modules that are needed to establish the software environment that your job requires.
Here is a batch script example for a job that will use four nodes (16 MPI tasks per node) for six minutes on Yellowstone. Insert your own project code, job name and executable, and specify a queue.
#!/bin/tcsh # # LSF batch script to run an MPI application # #BSUB -P project_code # project code #BSUB -W 00:06 # wall-clock time (hrs:mins) #BSUB -n 64 # number of tasks in job #BSUB -R "span[ptile=16]" # run 16 MPI tasks per node #BSUB -J job_name # job name #BSUB -o job_name.%J.out # output file name in which %J is replaced by the job ID #BSUB -e job_name.%J.err # error file name in which %J is replaced by the job ID #BSUB -q queue_name # queue #run the executable mpirun.lsf ./myjob.exe
The bjobs command provides information on unfinished jobs. The following examples show some commontly used options and arguments. To learn more, log in to Yellowstone and refer to the man pages.
Run bjobs by itself for the status of your own jobs.
For information about your unfinished jobs in an individual queue, use -q and the queue name.
bjobs -q queue_name
To get information regarding unfinished jobs for a user group, add -u and the group name.
To list all unfinished jobs, use all as shown.
bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 354029 ehaskell RUN regular yslogin1-ib 32*ys0101-ib job_113 May 30 10:03 354032 ehaskell RUN regular yslogin1-ib 32*ys0104-ib job_114 May 30 10:22 354038 jmathers RUN regular yslogin5-ib 32*ys0118-ib *p.cpp.exe May 30 10:35 354039 jmathers RUN regular yslogin5-ib 32*ys0119-ib *p.cpp.exe May 30 10:54
When large jobs are running, the output identifies each individual node that is in use. To suppress those lines, you can pipe the output through grep as follows.
bjobs -u all | grep -v "^ "
bjobs -o "jobid project queue stat submit_time mem delimiter=','" -u $USER
192896,bbaggins,P8675309,geyser,RUN,Nov 13 09:40,79 Mbytes
192899,bbaggins,P8675309,geyser,RUN,Nov 13 09:40,61 Mbytes
192902,bbaggins,P8675309,geyser,RUN,Nov 13 09:41,51 Mbytes
bjobs -o "jobid user project queue stat submit_time mem delimiter=','" -u all
187483,ttritt,UABC0003,regular,RUN,Nov 12 22:56,2.6 Gbytes
187484,ttritt,UABC0003,regular,RUN,Nov 12 22:56,2.7 Gbytes
187486,ttritt,UABC0003,regular,RUN,Nov 12 22:56,2.4 Gbytes
187670,jdenver,P12345678,regular,RUN,Nov 12 23:46,15.8 Gbyte
187900,jccash,P87654321,regular,RUN,Nov 13 00:15,21 Gbytes
187902,jccash,P87654321,regular,RUN,Nov 13 00:15,21 Gbytes
187964,jccash,P87654321,regular,RUN,Nov 13 00:26,21 Gbytes
Use bhist with no arguments to get a report on the status of your running, pending, and suspended jobs.
Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 354029 jmathers job_113 2 0 55 0 0 0 57 354032 jmathers job_114 2 0 55 0 0 0 57
Use bhist and the Job ID if you want information about an individual job.
Use the following example to get a detailed report on a specified number of recent event files (in this case, 10) and save the output to a file.
bhist -a -l -n 10 > file.report
bpeek – Allows you to watch the error and output files of a running batch job. This is particularly useful for monitoring a long-running job; if the job isn't running as you expected, you may want to kill it to preserve computing time and resources.
bkill – Removes a queued job from LSF, or stops and removes a running job. Use it with the Job ID, which you can get from the output of bjobs.
tail – When used with the -f option to monitor a log file, this enables you to view the log as it changes. To use it in the Yellowstone environment, also disable inotify as shown in this example to ensure that your screen output gets updated properly.
tail ---disable-inotify -f /glade/scratch/username/filename.log
Run the bfill command before submitting your job to see if backfill windows are available. With that information, you may be able to adjust your job script wall-clock estimate and have your job start more quickly than it might otherwise.
The bfill command parses and reformats output from the native LSF bslots command.
For Yellowstone, as shown in the sample output below, bfill reports the backfill window's duration and how many nodes are available.
For Geyser and Caldera, where jobs most typically run on shared nodes, bfill indicates:
When system use is high, few large backfill windows are likely to be available on Yellowstone. Some large windows might become available as the system is drained prior to jobs launching in the capability queue or prior to system maintenance downtimes.
Use the bfill information as general guidance. The nodes and slots are NOT guaranteed to be available at job submission.
----- Current backfill windows ----- Yellowstone: 00:16:18 - 1 nodes Yellowstone: Unlimited - 16 nodes Geyser: Unlimited - 0 entire nodes, plus 1140 slots Caldera: Unlimited - 12 entire nodes, plus 122 slots