Managing and monitoring jobs

qdel | qhist | qpeek | qstat

Here are some of the most useful commands for managing and monitoring Cheyenne jobs.

⇒ qdel

Run qdel with the job ID to kill a pending or running job.

qdel jobID

Kill all of your own pending or running jobs. (Be sure to use backticks as shown.)

qdel `qselect`

⇒ qhist

Run qhist for information on finished jobs.

qhist -u $USER

Your output will include jobs that finished on the current day unless you specify the number (N) of days to include.

qhist -u $USER -d N

Your output will be similar to this:

Job ID User   Queue   Nodes Submit  Start   Finish  Mem(GB)  CPU(%) Wall(s)
936169 sarahk economy  18   06-1230 06-1236 06-1236     0.0    25.0      10
936168 sarahk economy  18   06-1230 06-1233 06-1236    21.4  2197.0     191
904154 sarahk economy  18   05-1230 05-1231 05-1232     8.9  1106.0      33
904153 sarahk economy  18   04-1230 04-1231 04-1231     0.6   190.0      13
822187 sarahk economy  18   03-1230 03-1230 03-1231     3.7  2506.0      31
822174 sarahk economy  18   02-1230 02-1230 02-1230     0.8   100.0      14

This variation will generate a list of jobs that finished with non-zero exit codes to help you identify jobs that failed.

qhist -u $USER -f

⇒ qpeek

Use qpeek to inspect the stdout log for a job that is running on Cheyenne. The qpeek script will examine one of your running jobs at random if you do not specify a jobID.

qpeek jobID

To monitor the stderr log, use the --error option.

qpeek --error jobID

To examine both stdout and stderr while a job is running, join the output and error logs by including this PBS directive in your job script:

#PBS -j oe

⇒ qstat

Running qstat by itself is the same as qstat -u $USER. Either will give you the status of all of your own unfinished jobs. ((Use this command only sparingly.)

Your output will be similar to what is shown just below. Most column headings are self-explanatory – NDS for nodes, TSK for tasks, and so on.

In the status (S) column, most jobs are either queued (Q) or running (R). Sometimes jobs are held (H), which might mean they are dependent on the completion of another job. If you have a job that is held and is not dependent on another job, CISL recommends killing and resubmitting the job.

                                                       Req'd  Req'd   Elap
Job ID         Username Queue   Jobname SessID NDS TSK Memory Time  S Time
------         -------- -----   ------- ------ --- --- ------ ----- - ---- 
657237.chadmin apatelsm economy ens603   46100 60  216   --   02:30 R 01:24 
657238.chadmin apatelsm regular ens605     --   1   36   --   00:05 H   -- 
657466.chadmin apatelsm economy ens701    5189 60  216   --   02:30 R 00:46 
657467.chadmin apatelsm regular ens703     --   1   36   --   00:10 H   --

Following are examples of qstat with some commonly used options and arguments.

Get a status report on an individual job.

qstat -f jobID
qstat -x jobID

Get information about unfinished jobs in a specified queue.

qstat -q queue_name

See job activity by queue (e.g., pending, running) in terms of numbers of jobs.

qstat -Q

Display information for all of your pending, running, and finished jobs.

qstat -xu $USER