Common causes of job failures

These are the some of the most common causes of job failures on the Cheyenne system and some tips for how to avoid them. Keep an eye out for other tips, and change notices that you need to be aware of, in the CISL Daily Bulletin.

  1. Running close to node memory limit.
    If you experience erratic job failures on Cheyenne, you may need to specify the use of higher-memory nodes or spread the job across more nodes. See Checking memory use to determine how much memory your program requires to run,
  2. Home or other directory filling up.
    This is another possible cause of erratic job failures. Run the gladequota command and clean up any GLADE spaces that are at or near 100% full.
  3. Specifying version-specific modules in your dotfiles.
    The problem with specifying version-specific modules in your dotfiles (.bashrc and .tcshrc, for example) is that someday the version you specify will be removed. You'll probably look at your batch job, not see anything wrong with it, and be puzzled, not realizing the problem is rooted in your dotfiles. In fact, it is best not to specify any modules in your dotfiles. Instead, set up any unique environments that you need as described in our Customized environments documentation. You might also load the necessary modules in your batch script.
  4. Failure to clean directories when remaking executables and binaries.
    This failure can be puzzling because it looks like everything built correctly. When you run your application it fails in a way that does not point to the root cause of the problem: that the build is using one or more binaries from previous, incompatible builds.
  5. Filling up temporary file space.
    Using shared /tmp, /var/tmp or similar shared directories to hold temporary files can increase the risk of your own programs and other users' programs failing when no more space is available. Set TMPDIR to point to your GLADE scratch space in every script for all batch jobs. See Storing temporary files with TMPDIR.