Managing and monitoring jobs

qdel | qhist | qpeek | qstat

Updated 9/16/2020: The qhist example output was revised to show that walltime is now displayed in hours rather than seconds.

Here are some of the most useful commands for managing and monitoring Cheyenne jobs.

⇒ qdel

Run qdel with the job ID to kill a pending or running job.

qdel jobID

Kill all of your own pending or running jobs. (Be sure to use backticks as shown.)

qdel `qselect`

⇒ qhist

Run qhist for information on finished jobs.

qhist -u $USER

Your output will include jobs that finished on the current day unless you specify the number (N) of days to include.

qhist -u $USER -d N

Your output will be similar to this, with Mem(GB) and CPU(%) indicating approximate average memory and CPU usage per node:

Job ID   User   Queue  Nodes  Submit   Start    Finish  Mem(GB) CPU(%) Wall(h)
4151227  arashe arashe     3  15-2344  15-2344  16-0409    16.1  100.6   15942
4148980  arashe arashe     1  15-2146  15-2147  16-0408     7.2   97.2   22890
4148979  arashe arashe    22  15-2146  15-2147  16-0356    36.2  100.7   22183
4153964  arashe arashe     1  16-0300  16-0300  16-0348     4.4    2.7    2854
4154334  arashe arashe     1  16-0340  16-0340  16-0340     0.0    0.9       3
4153997  arashe arashe     4  16-0311  16-0311  16-0313    15.6   50.8      77
4150854  arashe arashe     1  15-2320  15-2320  16-0313     1.4    2.6   13989

The following variation will generate a list of jobs that finished with non-zero exit codes to help you identify jobs that failed.

qhist -u $USER -f

⇒ qpeek

Use qpeek to inspect the stdout log for a job that is running on Cheyenne. The qpeek script will examine one of your running jobs at random if you do not specify a jobID.

qpeek jobID

To monitor the stderr log, use the --error option.

qpeek --error jobID

To examine both stdout and stderr while a job is running, join the output and error logs by including this PBS directive in your job script:

#PBS -j oe

⇒ qstat

Run this to see the status of all of your own unfinished jobs. 

qstat -u $USER

Your output will be similar to what is shown just below. Most column headings are self-explanatory – NDS for nodes, TSK for tasks, and so on.

In the status (S) column, most jobs are either queued (Q) or running (R). Sometimes jobs are held (H), which might mean they are dependent on the completion of another job. If you have a job that is held and is not dependent on another job, CISL recommends killing and resubmitting the job.

                                                       Req'd  Req'd   Elap
Job ID         Username Queue   Jobname SessID NDS TSK Memory Time  S Time
------         -------- -----   ------- ------ --- --- ------ ----- - ---- 
657237.chadmin apatelsm economy ens603   46100 60  216   --   02:30 R 01:24 
657238.chadmin apatelsm regular ens605     --   1   36   --   00:05 H   -- 
657466.chadmin apatelsm economy ens701    5189 60  216   --   02:30 R 00:46 
657467.chadmin apatelsm regular ens703     --   1   36   --   00:10 H   --

Following are examples of qstat with some other commonly used options and arguments.

Get a long-form summary of the status of an unfinished job. (Use this only sparingly; it places a high load on PBS.)

qstat -f jobID

Get a single-line summary of the status of a unfinished or recently completed job (within 72 hours).

qstat -x jobID

Get information about unfinished jobs in a specified queue.

qstat -q queue_name

See job activity by queue (e.g., pending, running) in terms of numbers of jobs.

qstat -Q

Display information for all of your pending, running, and finished jobs.

qstat -xu $USER