Computational researchers speed up the analysis of climate model data

By Brian Bevirt
02/04/2015 - 12:00am

Climate models running on today’s supercomputers are producing so much data that the methods scientists use to analyze model data have begun restricting the pace of scientific discovery. At NCAR, the Community Earth System Model (CESM) running on the Yellowstone supercomputer generates data so quickly that the traditional post-processing tools now require more time to run than the experiments themselves. Three years ago when the fifth Coupled Model Intercomparison Project (CMIP5) was in production, model data post-processing required about as much time as the model runs. Those runs produced a total of 170 terabytes of CESM data, and it took 15 months just to transpose that data to the required file format. Supercomputers that will run the upcoming CMIP6 experiments will generate multiple petabytes of data, far more than the existing post-processing tools can handle in a useful amount of time. In practical terms, CMIP6 would be unable to produce any new results unless the post-processing bottleneck can be resolved.

Kevin Paul and Sheri Mickelson are computational researchers in CISL collaborating with members of the CESM Software Engineering Group to address this “big data” problem. In 2014 they created parallel post-processing tools that dramatically reduce the time required for model data analysis. This work will help researchers prepare for the World Climate Research Programme’s sixth Coupled Model Intercomparison Project (CMIP6). This worldwide research effort will standardize and coordinate scientifically important climate model experiments to advance the scope and accuracy of climate prediction science. (See sidebar: “About CMIP.”) CMIP6 is scheduled to begin production model runs in 2017, so the scientific and technological methods it requires are now being finalized by leading scientists in the climate simulation community.

Kevin Paul and Sheri Mickelson

Kevin Paul and Sheri Mickelson work in CISL’s Application Scalability and Performance Group, which performs research and develops solutions to address real-world computing issues that arise in computationally demanding Earth System science.

There are four post-processing tasks required for CMIP6 data, and CISL’s two new tools provide dramatic speedups for the first two steps:

  1. Transpose the raw files that are output by the model into a file type more suitable for analyzing the trends required by CMIP. (See sidebar: “Time-slice files and time-series files.”)
  2. Perform diagnostics on the transposed data from the model run to verify that it is producing correct results.
  3. Convert the data files to use standardized names, formats, grids, and units for comparison with data from other climate models from around the world.
  4. Submit data to the Earth System Grid for use by CMIP researchers.

Members of CISL’s Application Scalability and Performance Group (ASAP) created functional prototypes of parallel-processing tools for the first two post-processing tasks using Python, a programming language well suited for rapid application development. Based on the timing results of the prototypes, computational scientist Kevin Paul and software engineer Sheri Mickelson jointly developed robust parallel solutions for these two steps. Kevin became the primary author of the PyReshaper tool, which efficiently converts the climate model’s time-slice output files to time-series files. And Sheri became the primary author of the PyAverager tool, which quickly extracts climate trends from the time-series files for visualization, allowing scientists to determine if the model is running accurately.

The PyReshaper is a parallel-processing tool that currently handles the first post-processing task from 7 to 20 times faster than the previous method. Its ability to reduce the processing time depends on characteristics of the component model like the number and type of variables being calculated, model resolution, and the length of time being simulated. The PyReshaper uses the Message Passing Interface (MPI) library to enable parallel execution over the time-series variables to reduce execution time. The PyReshaper is written in Python, using the MPI4Py module and the NCL I/O module PyNIO. Version 0.9.1 of the PyReshaper was released to the community for use and evaluation in October 2014. Kevin is currently incorporating suggestions for improvements to the tool and its user interface.

The PyAverager is a parallel-processing tool that currently handles the second post-processing task from 8 to 400 times faster than the previous method. The PyAverager works on data in both time-slice and time-series file formats to compute averages of data values for use in visualizations and model diagnostics. It computes multiple types of climatologically important averages for temporal (monthly, seasonal, annual) data. The PyAverager is written in Python using the task parallelism approach. (See sidebar: “Parallel processing.”) It uses the MPI4Py module and the NCL I/O module PyNIO to read the data files and send messages across supercomputer nodes. Version 1.0 of the PyAverager will be released to the community in early 2015 so more researchers can benefit from its speed and suggest ideas for further development.

Both of these new parallel tools are being introduced to the climate modeling community through posters, presentations, and in meetings at conferences. Sheri and Kevin also participate in working groups of model developers and for CMIP6 planning. Both the PyReshaper and the PyAverager will be available to the entire community by mid 2016, prior to the beginning of CMIP6 production.

The last two post-processing tasks also have the potential for significantly improved efficiency via parallel techniques and automation. Currently, the third task is performed by serial code that uses the Climate Model Output Rewriter (CMOR) library. As with the first two tasks, a parallel tool has the potential to significantly speed up the CMOR conversion step. The fourth post-processing step – publishing model runs on the Earth System Grid science gateway – is often performed manually. Python scripts also have great potential for automating this step.

Traditionally, CISL’s ASAP group has focused on decreasing the time that Earth System models spend running on supercomputers. However, as the data storage components of high-performance systems become more of a potential bottleneck, the group began focusing on accelerating the post-processing workflow.

Kevin and Sheri’s work has shown that significantly reducing the time spent waiting for model diagnostics could transform the way that scientists use supercomputers and conduct their research. During CMIP5, the post-processing for some model runs required almost an entire day before researchers could begin analyzing the model output, so their next model run had to start long before they could adjust it using ideas from their analysis of the preceding run. A model run that can be analyzed within minutes of its completion could transform modeling to become a practically interactive environment for making scientific inquiries. Minimizing post-processing time using parallel techniques offers researchers a new path toward breakthrough discoveries via simulation.

This visualization of global aerosol optical depth was produced from a high-resolution CESM simulation that captures planetary-scale climate modes and small-scale features simultaneously, including their mutual interaction. The CESM’s atmosphere component (CAM5) includes new schemes for fully interactive aerosols to represent many characteristics of clouds. Simulating cloud physics in a climate model was not practical several years ago because supercomputers did not have enough processing capacity for such fine resolution. Back during the multi-year CMIP5, all of the runs of all of the models produced a total of only 170 terabytes of CESM data. The experiment that produced the visualization here simulated 100 years of global climate by consuming 25 million CPU hours on 23,404 computing cores of the Yellowstone supercomputer and generated approximately one half terabyte of data per simulated day. Parallel post-processing tools were used for this experiment because serial tools are impractical for analyzing 18,250 terabytes of data in time-slice files.

Kevin Paul used the PyReshaper to generate time-series files, then Sheri Mickelson used the PyAverager on those files to create monthly means of individual variables for model diagnostics and visualizations. Tim Scheitlin and Matt Rehme of CISL’s Visualization Lab created this plot of an August mean average aerosol optical depth from the time-series files. Principal Investigator R. Justin Small (NCAR CGD) published a paper about this research, “A new synoptic scale resolving global climate simulation using the Community Earth System Model,” in the December 2014 Journal of Advances in Modeling Earth Systems (Vol. 6, No. 4, pp. 1065-1094). Data generated by this study are accessible from the Earth System Grid, and a high-resolution animation of hourly latent heat flux over sea surface temperature is published on YouTube.

For more information about the PyReshaper and PyAverager tools, or to become a friendly user in 2015, visit the Parallel Python Tools website.