CISL Help Desk and Consulting Services
 

Resources

Getting Started

Support & Training

Running jobs

  • Queues and charging
  • Scheduling system
  • Submit, monitor, and delete jobs
  • Tips for performance

Bluefire queue names are the familiar ones used on our previous supercomputers, although there are equivalent queues for the regular memory (64 GB) and large memory (128 GB) nodes. There are 69 regular memory nodes and 48 large memory nodes on the full system.


Click here to learn about scheduling, queues, and charges in the new Yellowstone environment


The Bluefire queue structure is described here:

Queue Name Queue Charging Factor Run Limit Memory Limit
capability (by special permission only) 1 12 hours 64GB per node
debug (see usage guidelines below) 1 6 hours 64GB per node
dedicated 1 6 hours 64GB per node
economy 0.5 6 hours 64GB per node
hold 0.33 6 hours 64GB per node
lrg_capability(by special permission only) 1 12 hours 128GB per node
lrg_economy 0.5 6 hours 128GB per node
lrg_hold 0.33 6 hours 128GB per node
lrg_premium 1.5 6 hours 128GB per node
lrg_regular 1 6 hours 128GB per node
lrg_standby 0.1 6 hours 128GB per node
premium 1.5 6 hours 64GB per node
regular 1 6 hours 64GB per node
share (see usage guidelines below) 1 12 hours 256GB per node (shared)
special 1 6 hours 64GB per node
standby 0.1 6 hours 64GB per node

Here are some definitions used in the charging formulas below:

  • The "computer factor" is a multiplier that equalizes the way GAUs are consumed on different computing platforms. Faster computers have higher computer factors. The computer charging factor for Bluefire is 1.4.
  • The "queue charging factor" is a multiplier that reflects the priority given to jobs in a queue: higher-priority jobs are charged more.
  • The "number of processors used in that node" is 32.

Jobs run in any queue except share or debug are charged for full use of the nodes by this formula:

 Non-debug, non-share queue GAU charges = wallclock hours used * number of nodes reserved * number of processors in that node * computer factor * queue charging factor 

Jobs run in the share and debug queues are charged only for CPU time by this formula:

 Debug and share queue GAU charges = CPU time * computer factor * queue charging factor 

Guidelines for using the login nodes

Tasks such as code development and short checkout runs may be appropriate to run on the login nodes. We prefer that you limit jobs on the login nodes to 30 minutes wallclock or less. Longer-duration work should be run in one of the batch queues. We also recommend running jobs that utilize large memory in batch rather than on the login nodes. The reason for this recommendation is to provide fast response for all users on these nodes.

Guidelines for using the share and debug queues

The share queue and the debug queue each have two nodes, intended to run multiple jobs simultaneously. Jobs running in these queues should use less than 32 processors per node (the fewer processors, the better), and should not hog any resource of the nodes, e.g. memory. Additionally:

  • Debug queue jobs must declare 30 minutes or less wallclock time.
  • Good candidates for the share queue are jobs archiving files to the HPSS, or {pre, post} processing jobs.
  • No user should submit more than five jobs to the share or debug queues at any one time.

If these restrictions on the debug and share queues are unacceptable to your job requirements, please contact cislhelp on the tab shown at the top of this page to ask for help with your job workflow.

CISL staff may kill jobs for a variety of reasons, including inappropriate queue domination. A killed job will produce output, but its abrupt termination may be puzzling. Users should look for KILL messages in the job output, and should contact cislhelp about any puzzling messages or output.

Checking and managing charges

You can check your GAU charges via the CISL user portal at:

    No longer available since Bluefire decommissioning.

Log in using your userid and one-time password from your CryptoCard/Yubikey.

If you are a new portal user, you will need to set up a GAU tab. Go to the "Manage Tabs" tab, select "GAU" as a tab to display, and "Save". A tab labelled "GAU" should now be available where you can select reporting options of your resource usage by date and job number. See the online help document for details.

Note that the charges are only posted the day following a run.

If you are concerned about your GAU usage rate, contact CISL as shown at the top of this page for guidance and suggestions on how to run more efficiently and conserve your allocation. Sometimes jobs can be configured to make better use of the processors, and you may be able to save GAUs by running in a less expensive queue. Seek help from us whenever you notice anything is amiss with your allocation or GAU consumption rate.

Note: The charging formula described in the CISL Portal gives different names to these variables, and it does not make a distinction between dedicated-node charging and shared-node charging. The following table helps prevent confusion caused by the terminology used in the CISL Portal:

Charging formula CISL Portal terminology
GAUs charged GAU calculation
wallclock hours used wallclock hours
number of nodes used *
number of processors in that node
CPUs reserved   Note: CPUs (processors) must be reserved in even multiples of the processors in a node unless your job runs in the share queue (see share queue formula above).
computer factor system multiplier
queue charging factor queue multiplier

When you understand the different terminology used for the portal, you can see that both charging formulas are equivalent.

Exceeding allocation threshold limits

Jobs from NCAR divisions or CSL proposal groups that have exceeded either the 30-day or 90-day usage threshold will be placed in the hold queue and run at a priority below jobs in the economy queue. Affected jobs will be charged at 1/3 the rate they would have been charged if they had been run in a regular queue ("rg").

Jobs from NCAR divisions or CSL proposal groups that have exceeded both the 30-day and 90-day usage threshold will be rejected, and users will receive an email suggesting that they submit their jobs to a standby queue.

The 30-day threshold is typically 120% of the 30-day allocation and the 90-day threshold is typically 105% of the 90-day allocation. The 30-day allocation is the same as the monthly allocation.

 

Load Sharing Facility (LSF) from Platform, Inc., is the basic scheduling system on Bluefire.

A simple MPI batch job submission example is given below. For more examples, see the /usr/local/examples directory.

#!/bin/csh 
## LSF batch script to run an MPI application
#BSUB -P 12345678                   # project number (required)
#BSUB -W 1:00                       # wall clock time (in minutes)
#BSUB -n 64                         # number of MPI tasks
#BSUB -R "span[ptile=64]"           # run 64 tasks per node
#BSUB -J matadd_mpi                 # job name
#BSUB -o matadd_mpi.%J.out          # output filename
#BSUB -e matadd_mpi.%J.err          # error filename 
#BSUB -q regular                    # queue

mpxlf_r -qfree=f90 -o matadd_mpi.exe matadd_mpi.f 
## edit exec header to enable using 64K pages
ldedit -bdatapsize=64K -bstackpsize=64K -btextpsize=64K matadd_mpi.exe
## set this env for launch as default processor binding
setenv TARGET_CPU_LIST "-1"
mpirun.lsf /usr/local/bin/launch ./matadd_mpi.exe

Note that charging in all queues is for exclusive use of the full node. The exception is the share and debug queues, which charge for individual processor use.

Submit jobs with LSF

To submit a job for execution, use the command "bsub":

bsub < script

where "script" is the batch script file, and the redirect sign ("<") is necessary!

The most common options are shown as follows.

  • -J job_name
  • -P project_name
  • -W [hour:]minute
  • -e error_file
  • -o output_file
  • -q queue_name
  • -w dependency_expression

Passing the options to bsub in the batch script file is recommended. Here is a batch script example requesting 64 processors using 1 node for 6 hours in regular queue.

#!/bin/csh
#
# LSF batch script to run an MPI application
#
#BSUB -P 99999999                   # project number
#BSUB -W 6:00                       # wall clock time (in minutes)
#BSUB -n 64                         # number of tasks
#BSUB -R "span[ptile=64]"           # run 64 task per node
#BSUB -J myjob                      # job name
#BSUB -o myjob.%J.out               # output filename
#BSUB -e myjob.%J.err               # error filename
#BSUB -q regular                    # queue

#compile the code
mpxlf_r -qfree=f90 -o myjob.exe myjob.f

#run the executable
mpirun.lsf /usr/local/bin/launch ./myjob.exe

Monitor jobs in the queue

To show the information of your unfinished jobs, use the command "bjobs".

bjobs
Use "-u" to specify a user group. The following command displays the information of unfinished jobs for all users.
bjobs -u all
Use "-q" to specify a queue. The following command displays the information of your own unfinished jobs in regular queue.
bjobs -q regular

Command "bpeek" allows you to watch the error and output files of your running batch job. This is particularly useful when your job is long-running and you may need to kill the job if it hangs or shows aberrant behavior. Learn about "bpeek" via man pages.

All of the above "b" commands (bsub, bjobs etc) are provided in the LSF suite of user commands.

Delete jobs in the queue and also delete running jobs  

To stop and remove a job, use the command "bkill"

bkill jobid

where jobid is the job ID produced by the bsub and can be obtained through bjobs. More details of bkill can be found in the bkill man page.

Historical job information   

To get a summary of batch jobs already run, use command "bhist". For more information, see the man page.

To get the best performance from your jobs, we recommend what we refer to as the "Big 3": SMT, large pages, and processor binding. See here for an example of using those "Big 3" on WRF

Using Simultaneous Multi-Threading(SMT)

Simultaneous Multi-Threading (SMT) is a feature that became available under AIX 5.3 for Power 5- and Power 6- based systems. To use SMT, no source code changes are required in your application's Fortran, C, or C++ code, but we recommend some simple modifications to your job scripts described in the paragraphs immediately below. By making these changes, you may be able to boost performance by 20% or more on some applications.

Under SMT, the Power 6 doubles the number of active threads on a processor by implementing a second, on-board "virtual" processor that is enabled by the CPU architecture. The basic concept of SMT is that no single process uses all processor execution units at the same time, so a second thread can utilize unused cycles.

Bluefire has 32 cpus/cores per node. Since the nodes have Simultaneous Multi-Threading (SMT) enabled, it will appear that there are 64 virtual cpus in each node. Typically 64 tasks/threads per node is most efficient; however, we recommend that you compare performance for your application for 32 and 64 tasks per node.

Pure MPI jobs

To take advantage of SMT on Bluefire, double the value of the ptile parameter, i.e. ptile=64 instead of ptile=32. This establishes 64 virtual processors on the Bluefire node, instead of just 32 physical processors.

An MPI-only non-SMT job that is submitted to run on 4 32-way nodes (that is, -n 128 and ptile=32) can be modified to utilize SMT on 2 32-way nodes by specifying -n 128 and ptile=64 or can continue to use 4 nodes and take advantage of SMT by specifying -n 256 and ptile=64, assuming the job scales up. The latter method might also be preferable if wallclock time is the primary consideration.

The relative benefit of each of these approaches can then be examined by comparing LSF's report of "Resource usage summary" that is included in the file specified by the -o bsub option.

Hybrid jobs

A non-SMT job that runs 32 MPI tasks across 4 32-way Bluefire nodes with each MPI task spawning 4 OpenMP threads would specify -n 32, ptile=8 and OMP_NUM_THREADS=4. The same job can be run with SMT by keeping -n 32 and OMP_NUM_THREADS=4 but switching to ptile=16 and would then use half the number of 32-way nodes. Alternatively, keeping the node count the same (4 nodes) would be configured by -n 64, ptile=16, and OMP_NUM_THREADS=4. Note for hybrid jobs: Under AIX 5.3, there is a known defect that causes performance problems in hybrid applications when the application reads stdin as redirected from a file, e.g., cam < namelist. The workaround is to set MP_STDINMODE=0 in the environment. This may be important for getting best performance under SMT.

Examples of jobs scripts using SMT are on Bluefire under the /usr/local/examples/lsf_batch directory.

Pure OpenMP jobs

A pure OpenMP jobs is usually submitted with -n 1 and ptile=1 with the environmental variable OMP_NUM_THREADS set to the requested number of OpenMP threads (usually 32). Simpling setting this environmental variable to 64 will exploit SMT for pure OpenMP jobs that scale up to 64 threads.

MPMD jobs

To run an MPMD program such as CCSM using SMT, an MPI job with 80 tasks can fit on two Bluefire nodes instead of four with just the simple changes below. (In this case, each of the 16 atm tasks has 4 threads, so a total of 128 processors is used.)

  1. Modify ptile setting (maximum number of tasks per node) in LSF:
    #BSUB -R "span[ptile=64]"    #Bluefire default without SMT is 32
    
  2. The number of tasks your job requests remains the same:
    #BSUB -n 80    # number of tasks
    
  3. If your job uses task geometry, modify the LSB_PJL_TASK_GEOMETRY environment variable as if the node had 64 processors rather than 32, for example:

    Old task geometry (uses 4 nodes):

    export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\ 
    21,22,23,24,25,26,27,28,29,30,31) (32,33,34,35,36,37,38,39,\
    40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61)\
    (62,63,64,65,66,67,68,69)(70,71,72,73,74,75,76,77,78,79)}" 
    

    Note: Backslashes (\) denote line is continuous.

    New task geometry (uses 2 nodes):

    export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,\
    21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,\
    40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63) \
    (64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79)}"

SMT should aid in getting better throughput of your jobs and better performance for your GAU charges. Below are suggestions for testing whether SMT usage will benefit your applications:

Instructions for using SMT with the Community Climate System Model(CCSM) run scripts are given in the document, Taking advantage of Simultaneous Multi-Threading on bluevista when running CCSM" (Note: You will need to adjust for Bluefire's larger nodes).

Multiple page size support

64-KB pages

The default page size is 4 KB. On POWER6 systems, AIX 5L Version 5.3 supports a new 64-KB page size when running the 64-bit kernel. 64-KB pages are intended to be general-purpose. They are easy to use, and it is expected that many applications will see performance benefits when using 64-KB pages rather than 4-KB pages. IBM has reported performance improvements on a variety of workloads ranging from 1% to 13% when compared to the default 4-KB pages.

A user can specify a different page size to use for each of the three regions of a process's address space (data, stack, and text). The ldedit command may be used to set these page size options in an existing executable:

ldedit -btextpsize=64K -bdatapsize=64K -bstackpsize=64K a.out

A user can also set a process's preferred page sizes via the LDR_CNTRL environment variable. The following example will cause a.out to use 4-KB pages for its data, 64-KB pages for its text, and 64-KB pages for its stack:

Korn shell:

export LDR_CNTRL=DATAPSIZE=64K@TEXTPSIZE=64K@STACKPSIZE=64K

This will override any page size settings in an executable's XCOFF header.

Caveat: Using 64-KB pages rather than 4 KB pages for a multithreaded process's data may reduce the maximum number of threads a process can create due to alignment requirements for stack guard pages. If you encounter this limit, you may disable stack guard pages by setting the environment variable AIXTHREAD_GUARDPAGES to 0.

Page sizes for very high performance environments

AIX 5.3 also supports large pages (16 MB) and "huge" pages (16 GB). However, these must be configured by the system administrator and the system rebooted. Users must be specifically authorized to use large pages. For further information on special requests for use of large pages, contact the CISL Consulting Office by any of the methods in our CISL Customer Support.

Further details are discussed in the IBM Whitepaper, "Guide to Multiple Page Size Support on AIX 5L Version 5.3".

PS: Using the "-g" option with the xlf compiler may fail to provide useful information when large page size is used.

Processor binding is mandatory

We highly recommend using processor binding for all parallel jobs as explained below:

Pure MPI

To use processor binding, set (Korn or Bourne shell syntax):

export TARGET_CPU_LIST="-1"   #Korn shell syntax - use setenv for C shell
mpirun.lsf /usr/local/bin/launch ./wrf.exe

You may provide a comma (,) separated list of cpu-ids.

OpenMP-MPI hybrids

For hybrid programs, use:

export TARGET_CPU_RANGE="-1"
mpirun.lsf /usr/local/bin/hybrid_launch ./wrf.exe

along with your OMP_NUM_THREADS environment variable setting.

Important: All parallel jobs should begin using one of the launch scripts mentioned above with their mpirun.lsf command.

Pure OpenMP

To have more control, we recommend using the XLSMPOPTS environmental variable, setting it as follows

# for ksh and bash
export XLSMPOPTS="startproc=0:stride=n:stack=128000000"
or
# for csh
setenv XLSMPOPTS "startproc=0:stride=n:stack=128000000"
To maximize performance, n should be the largest possible stride which will make possible to run the requested number of threads in the available 64 processors. For example, for OMP_NUM_THREADS=64 stride must be 1. For OMP_NUM_THREADS=32 stride should be 2. For OMP_NUM_THREADS=16 stride should be 4. And so on.