New workshop prepares high school students to be Data Scientists

By Brian Bevirt
07/12/2015 - 12:00am

Simply stated, data science and data analytics extract knowledge from data. CISL’s first “Data Analytics Bootcamp for High School Students” was an opportunity for 10 Boulder high school sophomores and juniors to learn about being a Data Scientist, a new kind of job that has grown with the internet and digital information. This five-day activity for high school sophomores and juniors from the Boulder Valley School District was held 22–26 June 2015 at NCAR’s Mesa Lab facility in Boulder.

Dorit Hammerling guides two students
Dorit Hammerling (standing), lead organizer and developer of the workshop stated, “Data analytics might sound scary at first to young people, but by approaching the field using data sets and problems that are both interesting and practical, this effort is designed to attract new talent to the discipline by providing hands-on experience for young people interested in pursuing careers involving scientific data analysis.” Photo by Brian Bevirt, CISL

Demand for data scientists continues to increase as the Big Data era produces data in varieties and volumes far exceeding anything scientists and engineers have ever had to manage before. Effective data analysis – using data to answer practical questions – underpins decision making in many fields and is the power behind many of the most successful web enterprises including Google, Facebook, Amazon, and Orbitz. For NCAR researchers, effective data analysis also promises to unlock more scientific information from observations and numerical simulations in the geosciences. This bootcamp introduced data analysis concepts by presenting exercises using real data applied to real-life situations. Some of the examples covered concepts in climate, but others were just fun, for example analyzing the performances of basketball players and pricing used cars.

Data analytics is the discipline of interpreting raw data to discover useful information and draw conclusions about it that help people make more effective decisions. Data analytics uses many tools to help people explore the ever-increasing volume of data that drives science, medicine, commerce, and so many other aspects of contemporary life. One of these tools is R, an open-source statistical software language that the workshop participants learned and used for their data analysis projects.

This workshop is a new outreach effort sponsored by CISL’s Institute for Mathematics Applied to the Geosciences (IMAGe) and was provided at no cost to the students. Organizer Dorit Hammerling (IMAGe Project Scientist II) and sponsor/co-organizer Doug Nychka (IMAGe Director) designed the curriculum to be a hands-on and engaging experience for high school students. The format used several instructors who presented a sequence of 15-minute lessons: 5 minutes of teaching followed by 10 minutes of hands-on exercises where students applied their new knowledge. Student attention was continually engaged through this 1/3 learning – 2/3 doing format, and their interest was sustained by the exercises that used authentic data to analyze real-life problems. By connecting with self-motivated young people as early in their lives as possible, Hammerling and Nychka want to stimulate their interest in using data analysis to solve real problems. This work is important because the scientific and technological workforce of today and tomorrow requires more well-trained people who can extract meaning and value from very large data sets. A unique feature of the material was integrating math and statistics with computing and programming in typical ways to analyze a data set. With the focus on computation, a highlight of the bootcamp was a field trip to NCAR’s new supercomputing center in Wyoming.

Another highlight of the workshop was a two-part session about the Earth's polar ice melting. Students first learned how to create movies in R, then used an ice data set to visually explore how ice has been receding in recent years. Next, the students, instructors, and several other interested NCAR employees watched the Chasing Ice documentary, which tied together the movies the students created with a powerful illustration of the Earth's changing climate. After the movie, students asked questions of Kevin Schaefer and Rachel McCrary, two Boulder-area scientists who shared their knowledge about snow and ice. Several students expressed how much they enjoyed these sessions at the end of the workshop, and some even began making their own movies on the following day!

Students practice new skills
Each pair of students received guidance from one expert during the workshop exercises. The support staff shown in this photo includes, from left to right, Colette Smirniotis, Dorit Hammerling, Lee Richardson, and Nathan Lenssen. The 10-minute exercise being conducted here followed five minutes of instruction in a new concept. This format was designed to sustain student interest during the intensive training and ensure that each participant had immediate, supported practice applying their new skills. Photo by Brian Bevirt, CISL

Session leaders also provided individual support for the students, with at least five instructors available during all sessions to ensure a student-teacher ratio of 2:1 or better. These educators included:

  • Dorit Hammerling, Project Scientist II in IMAGe, investigates spatio-temporal statistical methods applied to the geosciences with a focus on massive data sets from models and observations.
  • Doug Nychka, NCAR Senior Scientist in IMAGe, applies spatial statistics and Bayesian statistics to large data sets and is the main developer of two R packages for data analysis: LatticeKrig and Fields.
  • Amanda Hering, Assistant Professor of Applied Mathematics and Statistics at Colorado School of Mines, researches spatial and space-time modeling, wind speed forecasting, and model validation.
  • William Kleiber, Assistant Professor in the Department of Applied Mathematics at the University of Colorado at Boulder (and a former post-graduate scientist in IMAGe), researches multivariate process modeling, geophysical computer model calibration and emulation, and stochastic modeling of physical systems.
  • Kevin Schaefer, Research Scientist III at the National Snow and Ice Data Center (guest presenter and discussion leader), specializes in permafrost carbon feedback, modeling the terrestrial biosphere, and biogeochemistry.
  • Randy Russell, Educational and Instructional Designer II at the UCAR Center for Science Education, develops educational technologies including computer-based games, interactive simulations, and virtual labs for science education.
  • John Paige, Research Associate at Lawrence Berkeley National Laboratory, IMAGe visitor, and about to start his graduate studies in statistics at the University of Washington, researches uncertainty quantification in climate models and efficient, parallel computing tools for spatial models.
  • Rachel McCrary, Postdoctoral Fellow in IMAGe’s Regional Integrated Science Collective, investigates climate, climate modeling, downscaling, and uncertainty quantification.
  • Dan Milroy is a graduate student in computer science at the University of Colorado at Boulder and a systems administrator at CU's Research Computing division, and he is co-advised by Dorit Hammerling and Allison Baker (CISL Technology Development Division).
  • Colette Smirniotis is a graduate student in computational statistics at San Diego State University visiting IMAGe as a SIParCS intern.
  • Lee Richardson is a graduate student in statistics from Carnegie Mellon University visiting IMAGe as a SIParCS intern.
  • Vinay Ramakrishnaiah is a graduate student in computer, electronic, and electrical engineering from the University of Wyoming visiting IMAGe as a SIParCS intern.
  • Nathan Lenssen is a graduate student in statistics from Columbia University and a former IMAGe intern.

This eclectic group of educators collaborated to create and deliver the intensive curriculum that taught practical skills using R for data analysis while covering six fundamental concepts in data analytics:

  • Fundamentals of statistics and data types.
  • Exploratory data analysis and visualization.
  • Multivariate linear regression.
  • Categorical data analysis.
  • Data collection and survey analysis.
  • High performance computing and its role in data analysis.

Hammerling summarized this new workshop’s outcome: “All the students learned about data analysis and developed skills using the R statistical programming environment to solve problems. They left the workshop with R skills that they can readily apply in internships or other employment opportunities. And IMAGe hopes to hire some of these freshly trained people as student assistants to help advance our current research projects.” She concluded with, “On the last day, the students did an analysis by themselves – coding it from scratch – to solve a real-world question. One student then presented his findings and code development to the group. It was impressive to witness how many skills this group of talented students acquired in one week.”