Checking memory use

Running peak_memusage | Identifying and resolving memory issues

If a job requires more than the usable memory that is available on any of the nodes in use, it is killed by the system monitor, or it drops a large core file in your directory, or it dies, sometimes with no explanation. To determine how much memory your program uses without monitoring it continuously in real time, check it with the one of the system-specific tools that CISL provides:

  • The peak_memusage tool is for Cheyenne jobs.
  • The job_memusage tool is for Geyser and Caldera jobs.

Both tools identify the peak memory used and print it in stderr when the job terminates. The result is expressed in binary multiples for purposes of precision.

These tools cannot cannot run properly if the program they are checking requires more memory than is available. Following the usage instructions below are recommendations for for resolving this, and other problems that may arise when you are using these tools.

If your job approaches the usable memory per node threshold shown in this table, you may experience unexpected issues or job failures. Leave a margin of 2 or 3 percent.

System Usable memory/node
Cheyenne 43 GB (3,168 nodes)
86 GB (864 nodes)
Geyser 1000 GB
Caldera 62 GB

Running peak_memusage

The recommended way to determine your application’s memory usage on Cheyenne is to run a PBS batch job using peak_memusage in a non-shared queue such as "regular" or "economy." Trying to run this on a login node or using the "share" queue increases the risk that your job will fail.

See these links for sample batch scripts that you can customize:

Running job_memusage

This section is in development.

Identifying and resolving memory issues

Issue 1. Your Cheyenne job requires large-memory nodes. Modify your select statement to include mem=109GB, as in this example.

    #PBS -l select=2:ncpus=36:mpiprocs=36:mem=109GB

Issue 2. Your peak_memusage or job_memusage batch job returns zero length batch output and error files, or the output provides partial or no results about memory use, or the job contains the string “Killed” with no further information.

This indicates the job terminated abnormally. The tools cannot provide meaningful information when the executable being measured does not execute properly. Check your output to see if the job is failing prior to memory tool output. If it is, correct the problem and rerun the job. If a parallel job provides only partial results or no results about memory use, run the job again using more nodes. If the parallel job fails when run across more nodes, consider using the Allinea DDT advanced memory debugger  to isolate the problem.

Issue 3. Your batch job crashes shortly after execution and the output provides no explanation.

If you suspect memory problems, it makes sense to examine recent changes to your code or runtime environment.

When did you last run the job successfully? Since then, was the model's resolution increased? Did you port your code from a computer having more memory? Did you add new arrays or change array dimensions? Have you modified the batch job script?

If the answer to those questions is “no,” your job might be failing because you have succeeded your disk quota rather than because of a memory issue. To check, follow these steps:

  • Run the gladequota command on Cheyenne.
  • Check the "% Full" column in the command's output.
  • Clean up any GLADE spaces that are at or near 100% full.
  • Look for core files or other large files that recent jobs may have created.
  • Exceeding disk quota at runtime can result in symptoms that are similar to those resulting from memory problems. If running gladequota doesn't indicate a problem, consider using the Allinea DDT advanced memory debugger to isolate the problem.

If you have tried all of the above and are still troubled that your job is exceeding usable memory, contact CISL Consulting with the relevant Cheyenne job number(s). The consultant on duty can search the job logs for information.