Transition from Bluefire

The Yellowstone environment differs in some important ways from earlier HPC, analysis, and visualization systems, even beyond the hardware specifications.

Resources
Yellowstone, the petascale computing resource in the NCAR-Wyoming Supercomputing Center (NWSC) in Cheyenne, Wyoming.

System software

The Yellowstone operating system is Red Hat Enterprise Linux, while the Bluefire system used UNIX/AIX. You will notice few if any differences as a result of this change.

Some additional notes on system software:

Compilers: The most notable difference is that compilers from Intel, PathScale, PGI, GNU, and NVIDIA are provided rather than the XL compilers that were used on Bluefire. You may need to change your makefiles and scripts as a result, and you will need to recompile your source code, but having several compilers will give you more flexibility for your project. (Most climate and weather models, notably CESM and WRF, have been ported to the Linux x86 architecture with at least one of these supported compilers.) You can find documentation about these compilers here: Compiling code.

Environment modules: Using these is significantly more important on the new systems than it was previously, now that many more compilers are available. The environment modules help you ensure that you are loading compatible compilers, libraries, and other packages to properly configure your environment. Expert users can still customize their own environment to suit their unique purposes.

Shells: The four most widely used shells are supported on login and compute nodes: tcsh, csh, bash, and ksh. The default is tcsh.


Single login to centralized environment

You may be accustomed to logging in to Bluefire, Mirage, and Storm by specifying a unique host name for each machine, such as bluefire.ucar.edu or mirage0.ucar.edu.

In the new environment, the login nodes are common to the Yellowstone HPC system and the analysis and visualization clusters, Geyser and Caldera. A single host name—yellowstone.ucar.edu—gives you access to one of the six login nodes.

Once you have logged in, you can schedule jobs to run through any of the queues described below or start interactive sessions on the Geyser and Caldera nodes. 

Resources
Centralizing the login nodes makes for more efficient workflow. For example, users can submit chained batch jobs to run without their intervention.

Centralizing the login nodes in this way provides significant benefits by making workflow more efficient. For example, you can submit a set of dependent, chained batch jobs to run without your intervention: one on Geyser that reads and preprocesses a data set from our Research Data Archive; then a large-scale simulation on Yellowstone using that input data; followed by post-processing on Geyser; then generating a set of visualizations on Caldera.


Scheduling and charges

Platform LSF is used to schedule jobs on the Yellowstone, Geyser, and Caldera clusters. Bluefire users are familiar with LSF, but it was not used to schedule jobs on the Mirage and Storm systems, so the process may be new to some.

Charges are calculated in terms of core-hours rather than General Accounting Units (GAUs). In practice, the only difference is omission of the "machine factor" constant in calculating charges.

Jobs that are run in any queue except for the shared "geyser" and "caldera" queues shown in the charts below are charged for exclusive use of the nodes by this formula:

wall-clock hours × nodes used × cores per node × queue factor

Charges for jobs run on the shared nodes are calculated by multiplying core-hours by the queue factor:

core-seconds/3600 × queue factor

Submitting jobs

Batch scripts that you used for submitting Bluefire jobs need only minor changes to run on the Yellowstone system.

Key changes to make:

  • Check your scripts for instances of environment variable TMPDIR and make sure it is set to /glade/scratch/username. Do not allow your applications to write to /tmp as some users were used to doing on Bluefire.
  • Revise your scripts as needed to account for the difference in the number of cores per node: Yellowstone has 16, while Bluefire had 32. We recommend running no more than 16 tasks per node. To see if your code benefits from the hyper-threading support on Yellowstone, however, you can experiment with up to 32 tasks per node.
  • You may have used an 8-digit number to identify your project on Bluefire. Prefix that number with the letter "P" if that same project is still active in the Yellowstone environment. For example, project number 35071234 on Bluefire becomes P35071234. Also use the "P"-prefixed form of your project code for HPSS, though the project number likely will work for most HSI commands.

You can find sample scripts in our Yellowstone Running jobs documentation.


Queues

Yellowstone: There are fewer scheduling queues overall and the standard wall-clock limit has been extended from six hours to 12 hours. (One exception is the “small” queue, where the limit is two hours.)

There are no “debug” and “share” queues as you may have used on Bluefire. Use the "small," "geyser," or "caldera" queues for debugging; and use the "geyser" and "caldera" queues in lieu of a designated share queue.

You can use this link to check queue activity, or run bsumm on your command line after logging in.

Queue Wall clock Job size
(# of cores)
Priority Queue
factor
Notes
capability 12 hours  16,385-65,536 2 1.0 Execution window:
Midnight Friday to 6 a.m. Monday
regular 12 hours Up to 16,384 2 1.0  
premium 12 hours Up to 16,384 1 1.5  
economy 12 hours Up to 16,384 3  0.7  
small 2 hours Up to 4,096 1.5  1.5 Interactive and batch use for debugging, profiling, testing; no production workloads.
standby 12 hours Up to 16,384 4 0.0 Accessible only to projects that have exceeded their allocation limits or 30- or 90-day usage thresholds.
Jobs are charged for all 16 cores on each node used, regardless of how many cores per node actually are used.

 

Geyser and Caldera: Jobs on the Geyser and Caldera systems must be submitted for scheduling. As shown in the chart, there are four queues—one for interactive and shared batch use on each cluster and one for exclusive batch job use on each cluster.

You can use this link to check queue activity, or run bsumm on your command line after logging in.

Queue Wall clock Job size
(# cores)
Priority Queue
factor
Notes
geyser 24 hours  1-39 2 1.0 Interactive and batch use; shared nodes
bigmem 6 hours 1-640 2 1.0  Interactive and batch use, exclusive;
jobs charged for all 40 cores on each node used; daytime limit of four nodes
caldera 24 hours 1-15 2 1.0 Interactive and batch use; shared nodes
gpgpu 6 hours 1-256 2 1.0  Interactive and batch use, exclusive;
jobs charged for all 16 cores on each node used; daytime limit of four nodes
hpss 24 hours 1 1 0 For HPSS and external data transfer only
The geyser and bigmem queues use nodes in the Geyser cluster. The caldera and gpgpu queues use nodes in the Caldera cluster.

 


GLADE

The Bluefire GLADE resource was retired on April 1, 2013. A new and much larger GLADE system serves as the central disk resource for the Yellowstone, Geyser, and Caldera clusters.

The new GLADE system has file spaces and policies that are similar to those used previously. For example, Yellowstone users have a file space identified as /glade/u/home/username rather than /glade/home/username.


HPSS

You can continue to follow the same recommended procedures for saving files to, and retrieving files from, our High Performance Storage System.

There is a new backup area in which you can save a second copy of critical files. This user-controlled, second-copy strategy offers better protection than was possible with the class of service (COS) method for creating two copies of a file. For example, it ensures that you can't overwrite or delete both your primary copy and the backup copy of a file with a single erroneous HSI command. Since backup copies are stored separately, they won't be affected by a command that deletes or overwrites the primary file.

CISL discontinued support for the dual-copy COS on Sept. 4, 2012, and all data files that were stored in the dual-copy COS are being converted to single-copy files.

Instructions for creating backup copies are in our HPSS Multiple copies documentation.