The new Yellowstone environment differs in some important ways from earlier HPC, analysis, and visualization systems, even beyond the hardware specifications.
Yellowstone, the petascale computing resource in the NCAR-Wyoming Supercomputing Center (NWSC) in Cheyenne, Wyoming.
System software
The new operating system is Red Hat Enterprise Linux, while the Bluefire system used UNIX/AIX. You will notice few if any differences as a result of this change.
Some additional notes on system software:
Compilers: The most notable difference is that compilers from Intel, PathScale, PGI, GNU, and NVIDIA are provided rather than the XL compilers that were used on Bluefire. You may need to change your makefiles and scripts as a result, and you will need to recompile your source code, but having several compilers will give you more flexibility for your project. (Most climate and weather models, notably CESM and WRF, have already been ported to the Linux x86 architecture with at least one of these supported compilers.) You can find documentation about these compilers here: Compiling code.
Environment modules: Using these is significantly more important on the new systems than it was previously, now that many more compilers are available. The environment modules will help you ensure that you are loading compatible compilers, libraries, and other packages to properly configure your environment. Expert users still will be able to customize their own environment to suit their unique purposes.
Shells: The four most widely used shells are supported on login and compute nodes: tcsh, csh, bash, and ksh. The default is tcsh.
Single login to centralized environment
You probably are accustomed to logging in to Bluefire, Mirage, and Storm by specifying a unique host name for each machine, such as bluefire.ucar.edu or mirage0.ucar.edu.
That is not the case in the new environment, in which the login nodes are common to the Yellowstone HPC system and the analysis and visualization clusters, Geyser and Caldera. A single host name - yellowstone.ucar.edu - will give you access to one of the six login nodes, from which you can schedule jobs to run through any of the queues described below or start interactive sessions on the Geyser and Caldera nodes.
Centralizing the login nodes makes for more efficient workflow. For example, users can submit chained batch jobs to run without their intervention.
Centralizing the login nodes in this environment provides significant benefits by making workflow more efficient. For example, you will be able to submit a set of dependent, chained batch jobs to run without your intervention: one on Geyser that reads and preprocesses a data set from our Research Data Archive; then a large-scale simulation on Yellowstone using that input data; followed by post-processing on Geyser; then generating a set of visualizations on Caldera.
Scheduling and charges
Platform LSF is used to schedule jobs on the Yellowstone, Geyser, and Caldera clusters. Bluefire users are familiar with LSF, but it was not used to schedule jobs on the Mirage and Storm systems, so the process may be new to you.
Charges are calculated in terms of core-hours rather than General Accounting Units (GAUs). In practice, the only difference is omission of the "machine factor" constant in calculating charges.
Jobs run in any queue except for the shared "geyser" and "caldera" queues shown in the charts below are charged for exclusive use of the nodes by this formula:
(# nodes x 16) x wall-clock time (hours) x queue factor
Charges for jobs run on shared nodes are calculated using this formula:
core-seconds/3600 x queue factor
Submitting jobs
Batch scripts that you have used for submitting Bluefire jobs will need only minor changes to run in the new environment.
Key changes to make:
- Check your scripts for instances of environment variable TMPDIR and make sure it is set to /glade/scratch/username. Do not allow your applications to write to /tmp as some users have done on Bluefire.
- Revise your scripts as needed to account for the difference in the number of cores per node: Yellowstone has 16, while Bluefire had 32. We recommend running no more than 16 tasks per node. To see if your code benefits from the hyper-threading support on Yellowstone, however, you can experiment with up to 32 tasks per node.
- You may have used an 8-digit number to identify your project on Bluefire. Prefix that number with the letter "P" if that project is still active in the Yellowstone environment. For example, project number 35071234 on Bluefire becomes P35071234. Also use the "P"-prefixed form of your project code for HPSS, though the project number likely will work for most HSI commands.
You can find sample scripts in our Yellowstone Running jobs documentation.
Queues
HPC resources: There are fewer scheduling queues overall and the standard wall-clock limit has been extended from six hours to 12 hours. (One exception is the “small” queue, where the limit is two hours.)
There are no “debug” and “share” queues as you may have used on Bluefire. Use the "small," "geyser," or "caldera" queues for debugging; and use the "geyser" and "caldera" queues in lieu of a designated share queue.
Yellowstone queues
| Queue |
Wall clock |
Job size (# of cores) |
Priority |
Queue factor |
Notes |
| capability |
12 hours |
16,385-65,536 |
2 |
1.0 |
Execution window: Noon Friday to 6 a.m. Monday |
| regular |
12 hours |
Up to 16,384 |
2 |
1.0 |
|
| premium |
12 hours |
Up to 16,384 |
1 |
1.5 |
|
| economy |
12 hours |
Up to 16,384 |
3 |
0.7 |
|
| small |
2 hours |
Up to 4,096 |
1.5 |
1.0 |
8 a.m. to 5 p.m. only |
| standby |
12 hours |
Up to 16,384 |
4 |
0.0 |
Accessible only to projects that have exceeded their allocation limits or 30- or 90-day usage thresholds. |
| hpss |
24 hours |
1 |
1 |
N/A |
For HPSS and external data transfer only |
| Jobs are charged for all 16 cores on each node used, regardless of how many cores per node actually are used. This does not apply to the hpss queue. |
Analysis and visualization resources: Jobs on the Geyser and Caldera systems must be submitted for scheduling. As shown in the chart, there are four queues – one for interactive and shared batch use on each cluster and one for exclusive batch job use on each cluster.
Geyser and Caldera queues
| Queue |
Wall clock |
Job size (# cores) |
Priority |
Queue factor |
Notes |
| geyser |
24 hours |
1-39 |
2 |
1.0 |
Interactive and batch use; shared nodes |
| bigmem |
6 hours |
1-640 |
2 |
1.0 |
Interactive and batch use, exclusive; jobs charged for all 40 cores on each node used; daytime limit of four nodes |
| caldera |
24 hours |
1-15 |
2 |
1.0 |
Interactive and batch use; shared nodes |
| gpgpu |
6 hours |
1-256 |
2 |
1.0 |
Interactive and batch use, exclusive; jobs charged for all 16 cores on each node used; daytime limit of four nodes |
| The geyser and bigmem queues use nodes in the Geyser cluster. The caldera and gpgpu queues use nodes in the Caldera cluster. |
GLADE
The Bluefire GLADE resource was retired on April 1, 2013. A new and much larger GLADE system serves as the central disk resource for the Yellowstone, Geyser, and Caldera clusters.
The new GLADE system has file spaces and policies that are similar to those used previously. For example, Yellowstone users have a file space identified as /glade/u/home/username rather than /glade/home/username.
Files that remained in the old /glade/home space when it was retired were backed up and will be retained off-line for six months. Contact cislhelp@ucar.edu to access those files.
HPSS
You can continue to follow the same recommended procedures for saving files to, and retrieving files from, our High Performance Storage System. The physical location of the new tape storage system has no impact on how you use the system.
CISL created a new backup area in which you can save a second copy of critical files. The new user-controlled, second-copy strategy offers better protection than is possible with the class of service (COS) method for creating two copies of a file. For example, the new method ensures that you can't overwrite or delete both your primary copy and the backup copy of a file with a single erroneous HSI command. Since backup copies are stored separately, they won't be affected by a command that deletes or overwrites the primary file.
CISL discontinued support for the dual-copy COS on Sept. 4, 2012, and all data files that were stored in the dual-copy COS are being converted to single-copy files.
You can find instructions for using the new backup area in our HPSS Multiple copies documentation.