Checking memory use

Cheyenne jobs | Casper, Geyser and Caldera jobs | Identifying and resolving memory issues

Updated 10/2/2018

If a job requires more than the usable memory that is available on any of the nodes in use, it is killed by the system monitor, or it drops a large core file in your directory, or it dies, sometimes with no explanation. To determine how much memory your program uses without monitoring it continuously in real time, check it with the peak_memusage tool.

The peak_memusage tool identifies the peak memory used and prints it in stderr when the job terminates. The result is expressed in binary multiples for purposes of precision.

This tool cannot run properly if the program it is checking requires more memory than is available. See this section below for how to address this issue and others that may arise.

If your job approaches the usable memory per node threshold shown in this table, you may experience unexpected issues or job failures. Leave a margin of 2 or 3 percent.

System Usable memory/node
Cheyenne 45 GB (3,168 nodes)
109 GB (864 nodes)
Casper 365 GB (20 nodes)
738 GB (2 nodes)
1115 GB (2 nodes)
Geyser 1000 GB
Caldera 62 GB

Running peak_memusage on Cheyenne

The recommended way to determine your application’s memory usage on Cheyenne is to run a PBS batch job using peak_memusage in a non-shared queue such as "regular" or "economy."

See these links for sample batch scripts that you can customize:

If you already have a working batch job and want to check its memory usage, the examples can show you how to instrument your job with peak_memusage.


Running peak_memusage on DAV nodes

Use the peak_memusage tool in a Slurm batch script to determine how much memory your program uses when running on the Casper, Geyser, or Caldera clusters. The sample scripts that are provided at the links below use constraints to select the types of nodes to be used (for example, -C geyser and --mem=100G). Depending on your job's needs, you might want to specify the use of caldera or pronghorn nodes or other constraints.

See these links for sample batch scripts that you can customize:

If you already have a working batch job and want to check its memory usage, the examples can show you how to instrument your job with peak_memusage.


Identifying and resolving memory issues

Issue 1. Your Cheyenne job requires large-memory nodes. Modify your select statement to include mem=109GB, as in this example.

#PBS -l select=2:ncpus=36:mpiprocs=36:mem=109GB

Issue 2. Your peak_memusage batch job returns zero length batch output and error files, or the output provides partial or no results about memory use, or the job contains the string “Killed” with no further information.

This indicates the job terminated abnormally. The tools cannot provide meaningful information when the executable being measured does not execute properly. Check your output to see if the job is failing prior to memory tool output. If it is, correct the problem and rerun the job. If a parallel job provides only partial results or no results about memory use, run the job again using more nodes. If the parallel job fails when run across more nodes, consider using the Allinea DDT advanced memory debugger to isolate the problem.

Issue 3. Your batch job crashes shortly after execution and the output provides no explanation.

If you suspect memory problems, it makes sense to examine recent changes to your code or runtime environment.

When did you last run the job successfully? Since then, was the model's resolution increased? Did you port your code from a computer having more memory? Did you add new arrays or change array dimensions? Have you modified the batch job script?

If the answer to those questions is “no,” your job might be failing because you have exceeded your disk quota rather than because of a memory issue. To check, follow these steps:

  • Run the gladequota command on Cheyenne.
  • Check the "% Full" column in the command's output.
  • Clean up any GLADE spaces that are at or near 100% full.
  • Look for core files or other large files that recent jobs may have created.
  • Exceeding disk quota at runtime can result in symptoms that are similar to those resulting from memory problems. If running gladequota doesn't indicate a problem, consider using the Allinea DDT advanced memory debugger to isolate the problem.

If you have tried all of the above and are still troubled that your job is exceeding usable memory, contact CISL Consulting with the relevant Cheyenne job number(s). The consultant on duty can search the job logs for information.