CISL Annual Report: FY2020
The NCAR Computational & Information Systems Lab achieved numerous goals in fiscal year 2020, which became especially challenging with the arrival of the COVID-19 pandemic. Beginning in mid-March and through the end of the fiscal year, most staff members worked remotely from their usual environments in Boulder, Colorado, and Cheyenne, Wyoming. Despite significant changes in on-site schedules and procedures, CISL completed several important computing and data-storage upgrades while managing a major procurement.
Operations at the NCAR-Wyoming Supercomputing Center (NWSC) in particular changed dramatically, and getting projects done demanded an extra measure of trust, innovation, and effort. In addition to wearing masks, following other health and safety guidelines, and quarantining new equipment upon delivery, the staff presented and recorded virtual site visits and walk-through videos for vendors interested in submitting proposals for NCAR’s next-generation high-performance computing (HPC) system, which is known for now as NWSC-3.
Staff throughout CISL adapted as needed to continue delivering world-class support to the Earth system science community in FY2020. Beginning in March, for example, the High Performance Computing Division and Enterprise Systems and Services Division came together to complete two Campaign Storage capacity expansions and other projects, with vendor support often provided only through remote consultation.
The Cheyenne supercomputer, NCAR’s current HPC environment, supported more than 1,800 unique users at more than 300 universities and other institutions during the fiscal year. In our most recent survey, the university user community reported more than 550 publications and nearly 70 dissertations from FY20 that resulted from the use of NCAR resources and services.
Daily Cheyenne node utilization regularly exceeded 95% by the end of the fiscal year, although system software and facility upgrades combined with power-related incidents reduced the average utilization for the year to just over 80%.
The NWSC-3 request for proposals was released on April 2. CISL conducted a virtual site visit of the NWSC on April 24 to address prospective offeror questions and also completed two separate Independent Government Cost Estimates for this procurement. The NWSC-3 technical and business evaluation teams completed their evaluations of the initial NWSC-3 proposals that were tendered and submitted their independent recommendations to the CISL Council on July 22. Subsequently, UCAR issued CISL’s guidance and requests for the down-selected vendors to solicit their Final Revised Proposals (FRPs). After the technical and business team review of the FRPs, CISL Council met and formulated a recommendation, which was accepted by the NCAR Director. Finally, CISL concluded the subcontract negotiations with the selected vendor and worked with UCAR contracts on submitting the approval package to NSF.
In addition to preparing for a new HPC system, CISL successfully deployed new cyberinfrastructure that included a data analysis and visualization (DAV) resource named Casper and a storage resource called Campaign Storage. The deployment of Casper, which replaced the now-decommissioned Geyser and Caldera clusters, allowed CISL to transition some of the smaller jobs and high-throughput jobs away from Cheyenne, which could substantially improve the performance of the job scheduler and also potentially improve the queue wait time for user jobs. The Campaign Storage system, which now has a usable capacity of 65 PB, is facilitating our pivot away from the aging High Performance Storage System (HPSS). With CISL staff helping users transition their workflows from HPSS and migrating data to other storage resources, use of the system was reduced significantly over the course of the year, and the system will be decommissioned in October 2021.
Other successful deployments included the new Quasar tape archive system, which was made available to early users. Quasar allows CISL to use less expensive tape storage as a complementary and integrated tape-based storage tier for data sets that are suitable for long-term preservation or permanent archiving.
CISL also deployed, for testing purposes, a new on-premise cloud environment called Stratus. The 5-PB Western Digital ActiveScale / Quantum object-storage environment provides the rudiments of essential cloud characteristics for storage as defined in the National Institute of Standards and Technology (NIST) cloud model: on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. The Research Data Archive (RDA) is currently in the early use and evaluation stage of this service. Other groups will follow. Results of their evaluations will be used to determine follow-on phases for on-premise cloud deployment.
CISL staff also:
- Developed and tested bursting software that allows the SLURM scheduler to spin up commercial cloud instances. Tests included successful completion of numerous Antarctic Mesoscale Prediction System (AMPS) daily forecast runs on the commercial cloud. Such test and research activities previously required deploying costly emerging technology resources to the on-premise HPC R&D Futures Lab.
- Successfully deployed a Grand Unified File Index (GUFI) platform to generate easy-to-use weekly reports that have proven to be invaluable to CISL, file owners, and project leads in understanding and managing the remaining data holdings across GLADE, Campaign Storage, Quasar, and HPSS. This will allow CISL and the user community to more effectively manage data across the different storage tiers.
- Deployed a robust, comprehensive framework that is used after the completion of system downtimes and outages to ensure that all system resources are healthy before user jobs are released to Cheyenne and Casper. The same framework is often used to spot-check system health in order to identify rogue nodes and failing resources before a user jobs lands on them. This work enhanced system availability and usability of all CISL-provided HPC compute and storage resources.
- Worked with NCAR modeling groups to identify opportunities to improve model performance and workflow efficiency.
The data services CISL provides – from curation, distribution, and access to analysis and user training – are essential to NCAR’s support of the scientific community’s research efforts. The Data Stewardship Engineering Team (DSET), for example, leads NCAR’s efforts toward increased coordination around digital data discovery and access. Additionally, an ongoing DSET-driven activity is the operation of the Digital Asset Services Hub (DASH), which provides support and guidance for writing and executing data management plans, consulting service for data management, and integrated access to ongoing DSET activities.
In FY20, CISL began several efforts related to its “Science at Scale” strategy for Big Data. A selected subset of the 500TB CESM Large Ensemble (LENS) data – a data set of wide interest – was reformatted (in Zarr format) and uploaded to the Amazon Web Services (AWS) Simple Storage Service (S3). CISL successfully obtained an allocation of 100 TB of free storage from the AWS Public Datasets Program to serve this data. The effort also included development of documentation and a Jupyter Notebook for data exploration. The NCAR CESM LENS cloud-hosted subset was released publicly on October 9, 2020, and is now listed in the AWS Registry of Open Data at https://registry.opendata.aws/ncar-cesm-lens/.
CISL continued to improve the content of and access to its data resources in the Research Data Archive (RDA), provided data-sharing services through the Climate Data Gateway, the Globus data-transfer facility, and the CMIP Analysis Platform. The latter uses in-house compute capabilities co-located with high-performance storage to provide an analysis environment for petabyte-sized data collections, which have become too large to be moved and stored efficiently at user locations. The co-located compute and storage services create the technology ecosystem that is required to support CMIP-related national and international research initiatives.
In one FY20 highlight, 49 data sets were archived in the DASH Repository, which was deployed for operational use in FY19. One result is that NCAR data sets that did not previously have an established repository are now curated and publicly accessible, achieving compliance with U.S. Open Data policies. Other highlights included:
- GeoCAT implementation of significant computational and visualization functionalities from NCL. More than 10 NCL computational functions were implemented under the GeoCAT computational component (GeoCAT-comp). In addition, more than 90 visualization scripts were implemented in the GeoCAT plotting gallery. The successful GeoCAT implementation of those functions that were originally NCL methods demonstrated the feasibility of the pivot-to-Python strategy.
- Continued growth in the Visualization and Analysis Platform for Ocean, Atmosphere, and Solar Research (VAPOR) user community. The VAPOR data and visualization software package, used for efficient exploration of very large or complex 3D data sets, was cited in research publications more than 20 times in FY20. The latest stable release, v2.6, was downloaded more than 1,500 times during the fiscal year.
- RDA extracted the Integrated Surface Pressure Databank version 3 (ISPD V3) data collection from its native HDF5 format, and transformed and loaded it into a relational database structure to support more efficient data query and subset processing capabilities.
- Completion of univariate bias-correction of the NA-CORDEX data set for all high-priority variables and evaluation of multivariate bias correction.
- Major release of VAPOR 3.2 on February 3, 2020, incorporating a modern flow renderer and model renderer. Numerous other improvements were made regarding software performance and ease-of-use for end users. They included performance improvements enabled by modern C++ features, modern OpenGL capabilities, and a more user-friendly interface enabled by modern usage of the Qt GUI toolkit. VAPOR 3.2 accumulated 3,942 downloads.
- Completion of VAPOR code-base refactoring and integration of the remaining major version 2 features (for example, volume rendering, flow visualization, and Python processing) into version 3.1.
- New releases of all parallel Python tools in response to user issues, requests, and CMIP6 experiences.
- Pangeo deployment of an on-site JupyterHub instance with access to the GLADE and Cheyenne systems, deployment on Amazon Web Services, and the start of benchmarking efforts to measure performance and explore platform optimization.
CISL continued to play an important role in NCAR’s ongoing commitment to the education and training of early-career scientists, engineers, and technicians with its successful Summer Internships in Parallel Computational Science (SIParCS) program. SIParCS was transitioned online successfully in order to provide 14 graduate and undergraduate students with the opportunity to learn from and work with 27 staff mentors throughout the lab despite the pandemic. The CISL Outreach, Diversity and Education (CODE) team decided in mid-April to go virtual. Instead of traveling from places like Idaho, Oregon, Wyoming, and Puerto Rico to spend several weeks in Boulder, the interns “arrived” on each other’s computer screens on Monday, May 18 to start their internships. Moving online required significant effort by the CODE team and others.