IMAGe summer conferences focus on Big Data for research

By Brian Bevirt
10/07/2015 - 12:00am

In 2015, CISL’s Institute for Mathematics Applied to Geosciences (IMAGe) offered a variety of conferences designed to help Earth science researchers cope with the ever-increasing challenges of “Big Data.” In May, the IMAGe-STATMOS Summer School in Data Assimilation was part of a series designed to help train the next generation of researchers working in data-rich disciplines. It brought together graduate students, early-career scientists, and senior scientists in environmental statistics and related fields to explore contemporary topics in applied environmental data modeling. During their four days at the workshop, participants received an introduction to data assimilation methods and their applications, as well as hands-on training in the use of IMAGe’s Data Assimilation Research Testbed (DART). Then in June, IMAGe presented a week-long Data Analytics Bootcamp for High School Students, an opportunity for 10 Boulder high school sophomores and juniors to learn about being a data scientist. Demand for data scientists continues to increase as the Big Data era produces data in varieties and volumes far exceeding anything scientists and engineers have ever had to manage before. The bootcamp’s curriculum was an engaging hands-on experience for the students as they performed exercises using authentic data to analyze and solve real-life problems.

IMAGe hosted three more conferences in July, August, and September to continue developing researchers’ skills in Environmental Data Analytics, Ensemble Data Assimilation, and Climate Data Informatics. These conferences support the research communities’ need to extract scientific knowledge from the petabytes of data being produced by today’s instruments and computers. Each is described in the sections below.

Data analytics is the discipline of interpreting data to discover useful information and patterns with the goal of answering specific questions, gaining scientific insight, or making more effective decisions. This field also places value on eliciting questions that can be answered by specific data sets and communicating results in ways that are understandable to a nontechincal audience. Data analytics uses methods and algorithms drawn from statistics and computer science to help researchers explore the ever-increasing volume of data that supports science, engineering, medicine, commerce, and many other aspects of society. For NCAR researchers, effective data analytics reveals more scientific information from both observations and numerical simulations, and it often produces graphics to communicate results visually.

Data assimilation refers to methods that combine data from observations and the output of numerical models to provide improved estimates and better prediction of real systems. A familiar application of data assimilation in the geosciences is weather forecasting, where a large set of weather observations are combined with the output of a numerical weather model to make forecasts. At NCAR, data assimilation is also used to improve climate models and check physical models against observations.

Data informatics is a discipline for examining large data sets to find patterns and structure that can help in understanding the relationship between different variables or to make predictions. Climate data informatics broadly refers to any research combining climate science with approaches from statistics, machine learning, and data mining. Conferences between researchers from all of these areas stimulate the discussion of new ideas, foster new collaborations, grow the climate informatics community, and accelerate discovery across disciplinary boundaries.

To continue making progress on the grand challenges of numerically modeling the Earth System, CISL must efficiently manage data in its many forms to advance the supporting sciences. These forms range from the Big Data problems associated with model output and remotely sensed observations to the wide range of small but vital historical data sets that document past climate and important geophysical processes. Historical data is used to study past variability, for insight into processes, and for reanalysis projects. Geophysical models depend on data for initial fields and forcing variables, and these models typically generate substantial and complex data objects that require interpretation. These data objects allow scientists to produce predictions and reanalysis products for past weather, as well as to diagnose model shortcomings.

The volume and diversity of climate data from satellites, environmental sensors, and climate models has greatly increased and improved our understanding of the climate system. However, these increases can make traditional analysis tools impractical and necessitate new disciplines and tools to discover knowledge in the data. To meet the varied needs of our research communities, IMAGe research takes an interdisciplinary approach where collaboration with scientific teams within and outside NCAR helps to motivate new software tools and analysis methods. CISL’s data-centric view has helped to integrate research on several different aspects of computational and mathematical science.

Second Annual Graduate Workshop on Environmental Data Analytics

The Second Annual Graduate Workshop on Environmental Data Analytics was held at NCAR’s Mesa Lab campus on 27-31 July 2015. This workshop was part of an ongoing series designed to prepare the next generation of researchers and practitioners to work within and contribute to the data-rich era. Each workshop brings together researchers from graduate students to senior scientists in environmental statistics and related fields to explore contemporary topics in applied environmental data modeling.

Data analytics workshop
Students participating in the 27 July tutorial, “Introduction to Bayesian statistics and modeling for environmental and ecological data,” presented by Alix Gitelman of the Department of Statistics at Oregon State University. (Photo by Brian Bevirt, CISL)

Across scientific fields, researchers face challenges coupling data with imperfect models to better understand variability in their system of interest. Inferences garnered through these analyses support decisions with important economic, ecological, and social implications. Increasingly, the bottleneck for researchers is not access to data; rather, it is the need to identify and apply appropriate statistical methods using efficient software.

This second annual workshop offered hands-on computing and modeling tutorials, presentations from graduate student participants, and invited talks from early-career and established leaders in environmental data modeling. Tutorials and invited talks addressed useful ideas and tools that are directly applicable to student participants’ current and future research. Working group breakout sessions were convened multiple times each day to generate and synthesize new ideas. Seven of the 29 participants came from EPSCOR states. The participants:

Andrew Finley
Andrew Finley of the Department of Forestry at Michigan State University presented his tutorial, “Hierarchical Bayesian spatial-temporal models and software,” on the fourth day of the workshop. (Photo by Brian Bevirt, CISL)
  • Developed new modeling and computing skills through hands-on analyses and lectures led by quantitative scientists.
  • Shared research findings and explored open questions in the environmental, ecological, climatic, and statistical sciences.
  • Learned about NCAR and National Ecological Observatory Network (NEON) data resources that can facilitate scientific discovery.

The tutorials included:

  • Introduction to Bayesian statistics and modeling for environmental and ecological data, by Alix Gitelman, Department of Statistics, Oregon State University.
  • Climate data analytics, by Doug Nychka, IMAGe, NCAR.
  • Hierarchical Bayesian spatial-temporal models and software, by Andrew Finley, Department of Forestry, Michigan State University.

The workshop featured a tour of the NSF’s National Ecological Observatory Network (NEON) headquarters in Boulder, where participants interacted with NEON scientists and engineers. NEON is a continental-scale research instrument consisting of geographically distributed infrastructure that is networked via cybertechnology into an integrated research platform for regional- to continental-scale ecological research.

Frontiers in Ensemble Data Assimilation for Geoscience Applications

Data assimilation combines observed data with numerical models to improve predictions. An ensemble assimilation uses a sample of states of the system where the variation among the ensemble members quantifies the uncertainty in the state. The use of this technique in geoscience applications was the topic of IMAGe’s Theme of the Year conference for 2015. It was held at NCAR’s Mesa Lab campus on 3-7 August. Indicating the international appeal of ensemble data assimilation, 13 of the 27 participants came from non-U.S. universities. Two participants came from EPSCOR states.

Data assimilation conference participants
IMAGe’s 2015 Theme of the Year conference drew participants from around the world to learn about “Frontiers in Ensemble Data Assimilation for Geoscience Applications.” (Photo by Brian Bevirt, CISL)

Each Theme-of-the-Year (TOY) presented by IMAGe is a series of programs – typically with multiple events throughout the year – with each year’s theme focused on a specific aspect of mathematics applied to the geosciences. Each TOY is designed to advance research, education, and collaboration between the mathematics and geosciences communities. The TOY programs focus on potentially rewarding research activities and encourage contributions from talented young investigators in a variety of disciplines.

Frontiers in Ensemble Data Assimilation for Geoscience Applications focused on (1) ensemble data assimilation for atmosphere, ocean, land, and coupled Earth System models, and (2) hybrid ensemble variational assimilation techniques. Participants explored current techniques and applications of data assimilation in the geosciences.

The conference was preceded by a graduate student tutorial that prepared graduate students interested in data assimilation to conduct research using DART with a variety of geoscience models and observations. It covered:

  • Fundamentals of ensemble data assimilation using idealized models, by Jeff Anderson, NCAR IMAGe.
  • Using the Data Assimilation Research Testbed (DART) community software facility, and how to apply useful diagnostics. IMAGe instructors Tim Hoar, Nancy Collins, and Jeff Anderson then assisted participants during their lab exercises.
  • Kevin Raeder
    Kevin Raeder (NCAR IMAGe) presented a talk on “Examples of Research Tools Enabled by CESM Atmospheric Models and DART.” (Photo by Brian Bevirt, CISL)
  • Using DART with the Weather Research and Forecast (WRF) model.
  • Using DART with the Community Earth System Model (CESM).

The conference’s extensive program featured 13 presentations by data assimilation (DA) practitioners and researchers from national research labs, NCAR, and U.S. and international universities. Topics included hybrid ensemble-variational DA methods, automated estimation of localization for DA, ensemble-based DA for tropical cyclones, using DA to improve reanalysis data, research tools enabled by DA, challenges and opportunities in global land DA, DA with the Community Land Model, ensemble sensitivity analysis, operational DA for numerical ocean predictions, challenges in global ocean DA, implementing regional DA in the Northwest Atlantic, and coupled DA.

The conference also included a two-hour student poster and information exchange session and a concluding panel discussion on data assimilation and related topics.

Fifth International Workshop on Climate Informatics

The Fifth International Workshop on Climate Informatics was held 24-26 September 2015 at NCAR’s Mesa Lab campus. Most of the 86 participants came from U.S research universities, with 13 from international universities, 10 from corporations, and 10 from other research laboratories. This workshop series was co-founded by Claire Monteleoni (George Washington University) and Gavin Schmidt (NASA Goddard Institute for Space Studies), and the first workshop was held in 2011 at the New York Academy of Sciences. Claire is the Principal Investigator of the workshop series’ multi-year NSF grant, and a variety of other sponsors help fund the series. Participants in the first workshop produced a book chapter titled “Climate Informatics” in Computational Intelligent Data Analysis for Sustainable Development; Data Mining and Knowledge Discovery Series, CRC Press, Taylor & Francis Group.

Climate informatics workshop participants
The 86 participants in the Fifth International Workshop on Climate Informatics interact during the “Knowledge discovery in climate science” presentation by Imme Ebert-Uphoff of Colorado State University. (Photo by Brian Bevirt, CISL)

Held at NCAR, the fifth workshop’s poster session and reception on the first night featured more than 40 posters. Also on the program were two panel discussions, “Deep Learning for Climate Science” and “Encoding climate knowledge into climate learning,” that were designed to generate new ideas across research disciplines.

Climate informatics poster session
Conference participants enjoyed numerous discussions about their work during the poster session where about half of them presented posters during the two-hour session and reception on the first evening. (Photo by AJ Lauer, CISL)

In addition to numerous other presentations, the seven invited speakers delivered a rich diversity of thought-provoking talks:

  • Opportunities and Challenges in the Analysis of Multi-model Ensemble Output, by Claudia Tebaldi, NCAR CGD.
  • Recent Machine Learning Methods and Their Potential for CI, by Lawrence Carin, Duke University.
  • Dealing with Dirty Data: A Blueprint for Analyzing Climate Variability, by Andrew Rhines, Harvard University.
  • Hazardous convective weather risk: Big and small data problems, by Mike Tippett, Columbia University.
  • Do Deep Nets Really Need To Be Deep?, by Rich Caruana, Microsoft Research.
  • Knowledge discovery in climate science, by Imme Ebert-Uphoff, Colorado State University.
  • Intelligent Systems for Climate Research: When Will Deep Learners Meet Deep Knowledge?, by Yolanda Gil, University of Southern California.
Climate Informatics Workshop principals
This photo from the poster session shows the conference steering committee, a co-founder of the workshop series, a member of the program committee, a workshop chair, the co-chair of the hackathon event, and the local host. Shown from left to right are Imme Ebert-Uphoff (Colorado State University), steering committee and program committee; Yan Liu (University of Southern California), workshop co-chair; Claire Monteleoni (George Washington University), co-founder of the workshop series, steering committee, and co-organizer of the hackathon; and Doug Nychka (NCAR), steering committee, host, and Director of IMAGe. (Photo by AJ Lauer, CISL)

The workshop organizers describe the series as follows. “We have greatly increased the volume and diversity of climate data from satellites, environmental sensors and climate models in order to improve our understanding of the climate system. However, this very increase in volume and diversity can make the use of traditional analysis tools impractical and necessitate the need to carry out knowledge discovery from data. Machine learning has made significant impacts in fields ranging from web search to bioinformatics, and the impact of machine learning on climate science could be as profound. However, because the goal of machine learning in climate science is to improve our understanding of the climate system, it is necessary to employ techniques that go beyond simply taking advantage of co-occurence, and, instead, enable increased understanding.

“The Climate Informatics workshop series seeks to build collaborative relationships between researchers from statistics, machine learning and data mining and researchers in climate science. Because climate models and observed datasets are increasing in complexity and volume, and because the nature of our changing climate is an urgent area of discovery, there are many opportunities for such partnerships.”

The format of the workshop emphasized communication among all the various fields, with a strong emphasis on brainstorming during the breakout sessions and panel discussions. The Climate Informatics website provides a place for interested researchers to interact, share data sets, access materials from past workshops, and learn about upcoming events.

An extra full day was added on the Saturday after this workshop for NCAR’s first data science “hackathon,” where participants were given a challenge problem in climate informatics. Small teams were formed to implement machine-learning and data-mining algorithms using the python programming language. More information about this hackathon will be published in a future article.