Earth System Grid: Easy Web access to terascale climate data

By Staff
08/24/2006 - 12:00am


  The Earth System Grid
  A partnership spanning NSF, DOE, and the university community has created the Earth System Grid, a collaborative environment that links distributed centers, users, models, and data.

Just a few years ago, accessing the climate data in deep storage at various U.S. research centers was a daunting task for the global geosciences community. Different institutions formatted, organized, and served data in different ways. Authentication procedures and transfer protocols varied from site to site. Output from a single climate simulation was often archived in thousands of files. Data retrieval was complex, inefficient, and tedious.

“Climate models were running on supercomputers at many sites, putting out enormous quantities of data,” says Don Middleton of the Scientific Computing Division at the National Center for Atmospheric Research (NCAR). “One or two specialists at each site would know where the data were―and even they weren’t always sure. Researchers would contact them and they’d go off and find the data. That worked, but it didn’t scale very far at all. At the same time, there was a general sense that these data were of value to people all over the world interested in climate change research and environmental-impacts assessments.”

Since 2001, a partnership sponsored by the Department of Energy’s Office of Science under the auspices of the Scientific Discovery through Advanced Computing Program (SciDAC) has been working to make these data more generally available. Collaborators in the partnership, which spans DOE, the National Science Foundation, and the university community, include:

  • Argonne National Laboratory (ANL)
  • Lawrence Berkeley National Laboratory/National Energy Research Scientific Computing Center (LBNL/NERSC)
  • Lawrence Livermore National Laboratory/Program for Climate Model Diagnosis and Interpretation ((LLNL/PCMDI)
  • NCAR (Scientific Computing Division, High Altitude Observatory, and Climate & Global Dynamics Division)
  • Los Alamos National Laboratory (LANL)
  • Oak Ridge National Laboratory (ORNL)
  • The University of Southern California/Information Sciences Institute (USC/ISI)

Principal investigators Ian Foster of ANL, Don Middleton of NCAR, and Dean Williams of LLNL lead a team of nearly two dozen computer scientists, application developers, modelers, and Grid computing experts who are tackling the problem of distributed terascale data. List of team members

ESG: A large DataGrid

The result of their efforts is the innovative Earth System Grid (ESG), a virtual collaborative environment that links distributed centers, users, models, and data.

“ESG is a large DataGrid that provides data from computational and storage resources that are geographically distributed and not under centralized control,” says Middleton. “That’s the core of what Grid computing is all about: harnessing a collection of heterogeneous resources across different system administration groups and even across agencies, across various security boundaries and institutional policies.”

ESG makes terascale climate data as easy to access as Web pages. Its main entry points are two Web portals: one for general climate research data ( and another dedicated to the activities of the U.S. Intergovernmental Panel on Climate Change (IPCC) ( Through these portals, modelers and data managers can publish their datasets. Users can register, search, browse, and acquire the data they need.

The measure of success

“Early on, we didn’t know how many people would be interested in these data—we were hoping maybe a few hundred,” notes Middleton. “But we found out that there’s a big audience out there, quite a large group who are interested, and for many different reasons.”

Overall, ESG now has more than 3,200 registered users worldwide, ranging from climate scientists and university researchers to private companies and K-12 educators.

Since it went live in 2004, ESG has become recognized as a leading infrastructure for accessing and distributing climate model data:

  • ESG’s general portal provides access to more than 130 terabytes of climate data from the Community Climate System Model (CCSM), the Parallel Climate Model (PCM), and the Parallel Ocean Program (POP). It also provides access to model source code, initialization datasets, and tools for data publishing, analysis, and visualization. ESG data holdings are distributed across multiple sites including LANL, LBNL, NCAR, and ORNL. The portal has over 2,600 registered users who have downloaded over 25TB of data.
  • ESG’s IPCC portal, hosted at LLNL/PCMDI, indexes data from 23 climate models, while published IPCC model runs have totaled in excess of 30 terabytes of data. More than 600 scientific subprojects have registered to receive data for analysis, nearly 100TB of data has been downloaded, and an estimated 200 scientific papers have been authored focusing on the analysis of these data.
ESG collaborators from laboratories and research centers around the country connect remotely via the AccessGrid for weekly meetings. Here, members of NCAR's ESG team, located at two different facilities in Boulder, use the AccessGrid for some technological problem solving. L to R: Dave Brown, Don Middleton, Jose Garcia, Patrick West, Peter Fox, Gary Strand, Luca Cinquini, and Rob Markel.  

Integrating many technologies

Accessing climate data via a Web page may look easy—but making it possible was far from easy. Developing and deploying the system required serious problem solving, says Luca Cinquini, a software engineer in NCAR's Scientific Computing Division who led ESG portal development.

“We had to integrate many pieces of Grid technology, such as the Globus toolkit, with technologies that are common in the business world, like Java, Tomcat, and Web portals,” says Cinquini. “We learned that sometimes it wasn’t as easy as might have been expected to apply new technologies to real-world applications. In some cases, we found scalability problems—things would be fine when we were serving small amounts of data, but when we started publishing more data and the size of our databases increased, we'd run into performance issues.”

Today, having overcome numerous technical challenges, ESG is at the forefront of Grid technology:

  • ESG allows remote access to multiple mass storage systems via high-performance networks.
  • Users can retrieve many files at once; they can also request an assembly of specific subsets of experiments from thousands of files, making it unnecessary for users to cope with complex details.
  • Extensive catalogs document climate data holdings and offer standard-format, searchable metadata.
  • A unique replica location service keeps track of original and duplicate files.
  • Grid security tools enable user registration, authentication, and authorization.
  • Behind the scenes, metrics log user activity and a monitoring infrastructure keeps track of resources across multiple institutions.
  • Popular climate analysis applications such as the Climate Data Analysis Tools (CDAT), the NCAR Command Language (NCL), and the Python interface to the NCL Graphics Library (PyNGL) will soon allow data analysis and visualization of ESG data from remote servers and personal computers.

Effectively delivering data to the community

“My view of ESG is that it’s been an immensely successful project for a number of reasons,” says Peter Fox, chief computational scientist of NCAR’s High Altitude Observatory and a member of the ESG development team. “ESG is a significant implementation of a highly distributed collaboration with access to large amounts of climate data. It’s not a prototype environment, it’s not a development environment―it’s a real production infrastructure, delivering a lot of data to the community in a very effective way. It allows scientists to focus on science rather than on the excruciating details of how data is organized, formatted, or transferred.”

“I’m tremendously pleased with ESG,” adds Gary Strand, a software engineer and data manager in NCAR’s Climate and Global Dynamics Division (CGD), who acts as a liaison between climate modelers and ESG developers. “I think back to a few years ago to when we had fewer data holdings and dealt with a small community; even then it was stretching our limits to fill data requests. These days, we’re making a lot of data available to people more efficiently than before. Researchers can get into archival systems without having an account at those sites. ESG is a groundbreaking first attempt to get these literally millions of files and terabytes of data out to the wide world; it’s a resounding success.”

Managing enormous datasets

    NCAR scientist Bill Collins
  William Collins, a scientist in NCAR's Climate and Global Dynamics Division, serves on the CCSM Scientific Steering Committee.

William Collins, a CGD scientist who developed one of the first techniques for integrating aerosol data into global climate models, serves on the CCSM Scientific Steering Committee. He notes that the dataset generated by CCSM―a total of about 100 terabytes worth of simulation—is the largest ever generated by a community model from NCAR. ESG is playing an increasingly important role in helping researchers find patterns and meaning in these data.

“Traditional tools for managing datasets break down once you reach the volume of model simulations that we now attain,” Collins says. “We’re working with ESG software engineers and computer scientists to figure out how best to provide an entry point and a hierarchical method for exploring these enormous datasets. This will make it much easier for us to intercompare features in different simulations—for example, to look at how the physics of the climate respond to different climate change scenarios.

“ESG capability will become even more critical once we create our first-generation Earth System model, otherwise known as CCSM4. The output generated by that model will be so large and rich that we will need to partner closely with ESG to exploit the data ourselves, as well as to provide the data as a resource to the wider community.”

ESG and other scientific projects

Other projects are benefiting from ESG innovations, says Fox, whose team in NCAR's High Altitude Observatory helped to develop OPeNDAP-g, a Grid extension to OPeNDAP, for ESG. (OPeNDAP, which stands for the Open-source Project for a Network Data Access Protocol, is a widely used protocol for scientific data networking.)

“DOE’s investment in ESG has had a scientific impact, a technical impact, and a broader impact, allowing us to build production software that we can actually use in other efforts,” he says. “The work we’ve done with OPeNDAP-g has been reintegrated back into the community release of OPeNDAP, so the entire community outside ESG will reap the benefits of the improvements. That’s everyone from the ocean sciences and other atmospheric sciences to space sciences. It's being used by a large number of researchers all over the world, as far as Australia, Europe, and Japan."

The Virtual Solar-Terrestrial Observatory, a National Science Foundation–funded joint project of NCAR and McGuinness Associates, is also utilizing OPeNDAP-g as well as the ESG catalog and portal infrastructures, as are the Semantically Enabled Science Data Integration (SESDI) and the Sun-Earth Connection Distributed Data Service (SECDDS) projects from NASA. The Portal User Registration Service (PURSe), an NSF middleware initiative, has adopted elements of ESG’s security design.

ESG has had productive collaborations and interactions with a number of national and international groups, including the University Corporation for Atmospheric Research’s Unidata program, the Earth System Modeling Framework (ESMF), the Global Organization for Earth System Science Portals (GO-ESSP), the National Operational Model Archive and Distribution System (NOMADS) of the National Oceanic and Atmospheric Administration (NOAA), NOAA’s Geophysical and Fluid Dynamics Laboratory, the British Atmospheric Data Centre (BADC), NSF’s Linked Environments for Atmospheric Discovery (LEAD) project, and the Geosciences Network (GEON).

In addition, ESG partners ORNL and NCAR are both now members of the NSF TeraGrid effort. ESG will be exploring opportunities with the TeraGrid community to collaborate and expand capabilities even more.

Making research easier, better, and faster

Networked knowledge increases the rate of scientific discovery. By fostering multidisciplinary partnerships, ESG encourages the study of complex problems such as climate variability from many perspectives. ESG’s system of interlinked data and resources helps researchers to do their work easier, better, and faster.

“Originally, we just wanted to build a system to make climate model data available to the world,” says Middleton. “But what happened was that we have built the beginnings of a science gateway, where models, source code, initialization datasets, post-processing applications, and analysis and visualization tools are all in one place, for use by a common user community. And all with access control and metrics. It’s a delightful outcome, and not particularly expected.”

That could be just the beginning.

“Right now, we’re looking at what the scientific landscape is going to be in 2012,” Middleton says. “We’re planning a comprehensive, next-generation cyberinfrastructure that spans data management, access, analysis, and visualization in a distributed knowledge environment.”

Such an infrastructure could extend the frontiers of climate research and lead to the solution of some of today’s most compelling scientific mysteries.

Photos: Lynda Lester, NCAR/CISL