Machine learning bootcamp for early-career scientists draws outsized interest and participation at NCAR
by Shira Feldman
Early-career scientists at the National Center for Atmospheric Research (NCAR) recently had the opportunity to attend a two-day “Introduction to Machine Learning” workshop, which took place November 7–8 in NCAR’s Mesa Laboratory Library.
This first-of-its-kind event was sponsored by the Education, Engagement & Early-Career Development (EdEC) department. It was also hosted and supported by the Machine Integration and Learning for Earth Systems (MILES) machine learning group, part of the Computational and Information Systems Lab (CISL).
The introductory-level workshop offered professional development to early-career, NCAR-affiliated scientists who had some interest in machine learning, but little or no previous experience. The session took place completely in-person, with organizers purposely keeping enrollment low to fit into the Mesa Lab library space. Though organizers capped the event at 34 participants, over 50 potential attendees expressed interest.
“I think a lot of people were surprised to see some of the use cases for machine learning that they totally didn't realize existed, and some of the cool work we're doing.”
Thanks to this unexpected level of interest, planners may repeat the introductory-level course. In the meantime, interested learners can access the course materials on GitHub. An intermediate-level course, following the same framework as the first, is planned for late winter.
Five primary instructors taught the course: Charlie Becker, Associate Data Scientist II in MILES; John Schreck, Machine Learning Scientist I in MILES; Will Chapman, Project Scientist I in the Climate & Global Dynamics Lab (CGD); Kirsten Mayer, Project Scientist I in CGD; and Thomas Martin, a Software Engineer at Unidata.
This set of instructors, spanning multiple labs and disciplines at NCAR, reflects the MILES group’s academic diversity: MILES features a core group of four scientists, but it also boasts a large affiliated membership affectionately known as “MILES Plus.”
The course’s prevailing philosophy was hands-on: instructors ran the bootcamp in a Jupyter notebook format, with no direct split between lecture and coding exercises. “The idea was to integrate both for an interactive environment of lecturing and the live coding experience,” said Becker, who led the workshop along with Unidata’s Martin. “Half of the second day was an open forum where people were encouraged to come with their own data and their own questions about machine learning, where we had a variety of instructors there to help guide them through and answer any remaining open questions they had specific to their use cases.”
Becker spoke to the advantages of an event for early-career professionals: “One, it's a strategic priority within the NCAR organization moving forward,” he emphasized. And secondly, “we’re increasing the collaboration among folks who are interested in machine learning, both amongst themselves and with the machine learning group here in CISL.” Becker was pleased to witness the participants’ increased engagement and cooperation as they worked through course problems in person and in real time.
Becker added: “I think a lot of people were surprised to see some of the use cases for machine learning that they totally didn’t realize existed and some of the cool work we’re doing.”
Use cases included a combination of both climate and weather data. “We used normalized sea surface temperature (SST) data to predict El Nino with various lead times. And on the weather side, we used atmospheric profiles from the Rapid Refresh Model to predict frozen precipitation type.”
One unique aspect of the frozen precipitation dataset was its crowdsourced status. Workshop participants got their data from an app called mPING that crowdsources weather observations. “Anyone can download a particular app and essentially just grab their smartphone and look out their window and say ‘yeah that looks like freezing rain,’ and mark their location via GPS, at any given time,” said Becker.
Crowdsourced resources can provide new information that fills in longstanding gaps for the machine learning field—but also introduces new quandaries. “It makes for a very interesting problem, as there's many biases within the dataset, but it's also useful because precipitation type observations are very rare, very sparse,” explained Becker. “Machine learning is very data-dependent and is driven by the quality of the data that we feed the algorithms. Precipitation is a certain type of phenomenon that we don't have a lot of good observational data at the right scales and resolutions we're looking for. So that’s one motivation to use crowdsourced data.”
“For example,” continued Becker, “airports have good observations, but there aren’t very many airports across the entire U.S. So training a model with data from airport observations only likely won’t capture very much of the phenomena we’re interested in, the crowdsourced data is much more granular.”
"The essential goal was to expose people to this broad pipeline, and then, at the end of the day, allow them to take these techniques to solve their own problems.”
Issues like this demonstrated to learners both the advantages and disadvantages of machine learning. “Are there advantages of being able to do things in the machine-modeling space that we can't necessarily do with physical-based modeling? Sure,” said Becker. “So computationally, for example, we might be able to characterize uncertainty better by being able to run really large ensembles.”
Becker described the overarching missions and themes of the workshop, and what he hopes to teach past and future participants (as well as independent learners who access the online course material). “The broad goal was learning end-to-end. The machine learning field is a vast field, not only of model architectures and techniques, but in that a lot of thought and consideration need to go into the data curation. The model’s only as good as the data you give it, right?”
He continued: “We need to carefully curate and think about the data and what problems we can potentially solve with it, and then all the pre-processing that needs to happen, and then the modeling phase, and then perhaps unique evaluation of these models too. In the course, we exposed people to the end-process of machine learning: conceptualizing the problem, collecting the data that's appropriate for the problem, pre-processing it, training a model, and then evaluating it. The essential goal was to expose people to this broad pipeline, and then, at the end of the day, allow them to take these techniques to solve their own problems.”
Based on the unanticipated demand this course offering provoked, organizers are carefully considering the next step. “We’re still thinking about what this will look like, given that it got more interest than we were able to fulfill,” said Becker. “Do we do another iteration of this ground-zero base course? Which is still an open question. I do want to highlight that we are going to release all the materials to be worked through independently.”
In the future, organizers also plan to place the course material on Project Pythia for anyone to access.