SIParCS technical projects 2023

The following 14 projects are being pursued by 2023 SIParCS interns. Selected from more than 150 applicants and representing 14 colleges and universities, the 16 SIParCS participants began their 11-week internships on May 22 and will conclude August 4 with their project presentations. This year, all the interns chose to pursue their projects in person in Boulder. Unless otherwise noted, all mentors are from CISL.

Group photo of SIParCS interns for 2023 — (Back row, L-R) Ameya Patil, Julius Owusu Afriyie, Yuta Norden, Hayden Outlaw, Si Chen, Dhamma Kimpara, Reuben Alter, Ian Franda, Kenton Wu.

Project 1. Applying search techniques for scientific data discovery and exploration

Intern: Emily Mc Nett (University of Wisconsin, Stout)

Mentors: Nathan Hook, Eric Nienhouse

NCAR’s diverse scientific data holdings have historically been difficult for external scientists and users to search across and find the data they need to do their science. While NCAR currently has a search system that aggregates these data holdings, we are experimenting with a simpler approach. This project is focused on continued development of a new scientific data search Java-based web application. This summer we’re enhancing our current software by including but not limited to the following: search facets, improving the usability of our web front-end, making our front end more responsive to both large and small screen internet enabled devices, and storing and reading more metadata from our Solr search platform, scientific metadata validation, and email notifications if incomplete metadata is found.

Project 2. Python data analysis and visualization for unstructured grids data

Intern: Ian Franda (University of Wisconsin, Madison)

Mentors: Orhan Eroglu, Anissa Zacharias

Project Raijin was awarded by NSF EarthCube in order to develop community-owned, sustainable, scalable tools for data analysis and visualization that can operate on unstructured climate and global weather data at global storm-resolving resolutions. The development of Project Raijin leverages the scientific Python ecosystem, particularly the Xarray and Dask packages, and the Pangeo community. During the SIParCS 2023 summer internship, the intern will have the opportunity to work in a novel research and development project by helping implement various data analysis and visualization workflows for unstructured grids such as example plotting scripts for different unstructured grid datasets, Jupyter Notebook based training modules, UXarray usage examples, and computational functions. The student will explore and learn about data visualization and analysis in geosciences using commonly used Python tools such as Matplotlib, Cartopy, and Holoviews. The student will also learn high performance computing (HPC) principles through use of NCAR’s HPC clusters as well as using parallelization and optimization packages such as Dask, Numba, and Datashader. Most or all of the student’s work will be made publicly available through our open-development model, which will in turn help the intern create a strong Python portfolio in data analysis and visualization.

Project 3. Investigating holographic images of clouds with machine learning

Intern: Hayden Outlaw (Tulane University)

Mentors: John Schreck, Matthew Hayman (EOL), Gabrielle Gantos

This project aims to improve the performance of a neural network processor for holographic images of cloud particles obtained using the HOLODEC instrument, an airborne cloud particle imager developed at NCAR that captures holographic images of liquid and ice cloud particles. A “U-Net” style neural network is used to recognize particles in the holograms after they have been computationally refocused. The objective of this project is to modify the neural net to reduce over-prediction and reduce data preprocessing requirements. Over the summer, the student will work with scientists in the CISL and the Earth Observing Lab (EOL) toward developing a new training dataset that utilizes multiple depth layers as additional “input channels.” Upon successful generation of the training dataset, the student will then modify the neural net to leverage this new data and train it. Depending on time, there may be opportunities to explore other potential solutions including mixed recurrent/computer vision modeling approaches as well as improving processing performance on the HPC systems at NCAR.

Project 4. Improving the accessibility of Open IoTwx for enhanced community engagement

Intern: Reuben Alter (Colorado College)

Mentors: Agbeli Ameko, Keith Maull (NCAR Library)

The goal of the Open IoTwx project is to transform equitable accessibility to open-source Internet-of-Things (IoT) instrumentation across diverse communities. The aim is to lower barriers and broaden community access to low-cost observational instrumentation networks and to empower a diverse community of citizen scientists to co-design sensornets that best meets their needs. The student will modify, assemble, and deploy Open IoTwx in both “low data” and “big data” configurations. Low data configurations include standard measurement nodes such as digital rain, wind, and air. Big data mode includes a Lidar that can produce 100,000 cloud point data per second and high-resolution images from various camera options. These data will be stored locally and a subset of the data will be transmitted via station communications protocols. This project will focus on open extensibility protocols to allow for a wide range of sensor options. The student will expand and develop documentation to allow for the growth of an open-source community around the project. In the process of developing this documentation system the student will apply their hands-on expertise (Arduino, 3D Cad, data communications protocols) to deploy a local station to improve accessibility of the entire build and deployment process. This specifically involves recommendation and design changes to create standard extensible protocols for hardware and IoT software components to promote interoperability while reducing costs and increasing accessibility.

Project 5. An interactive website to support the co-design process

Intern: Anupriya Dixit (Berea College)

Mentors: John Dennis, Cena Brown

At the NCAR, we perform world-class research in support of our goal of advancing our understanding of our Earth system. A powerful tool that supports our scientific objectives is high-performance computing. We have observed that the uptake of innovative computing technologies has traditionally been slow. We believe that a fundamental reason behind the slow uptake is that it typically takes expert knowledge to determine if a scientific objective is amenable to new technology. We have created a set of questions that supports this evaluation, but these questions are not particularly accessible. We therefore intend to develop an interactive website that can be used to make this kind of evaluation accessible to a broader audience. The goal of this project is to develop an interactive learning website that will simplify the evaluation of the suitability of new technology, in this case, Graphics Processing Unit (GPU) computing for a particular science objective. The student’s primary focus will be developing the website and testing it on suitable volunteers.

Project 6. Using machine learning uncertainty estimates to aid scientific analysis

Interns: Belen Saavedra (Berea College), Dhamma Kimpara (University of Colorado, Boulder)

Mentors: David John Gagne, Charlie Becker, Gabrielle Gantos, John Schreck

Both weather forecasters and researchers want to know when they should trust the guidance from their models as well as when not to. Being able to understand the sources of uncertainty in their guidance can enable them to convey to decision makers when to wait for further updates versus taking immediate protective actions. However, traditional machine learning models can only provide limited estimates of uncertainty within the realm of their training experience, so they tend to be overconfident in their predictions, especially when being applied to unseen data. A new class of machine learning models called evidential models can estimate the total uncertainty of a single prediction by making strong prior assumptions about how the data and model are expected to behave. In this project, the intern will work with CISL’s machine learning group to develop new ways to analyze and visualize uncertainty estimates and explanations of the uncertainty from these evidential models for multiple weather forecasting use cases, including winter precipitation type and severe storm hazard (e.g., tornadoes, hail) prediction.

Project 7. Reproducible and scalable analysis of remote sensing data in the cloud using Xarray

Intern: Yuta Norden (University of Hawai'i, Mãnoa)

Mentors: Julia Kent, Deepak Cherian (CGD), Scott Henderson & Jessica Scheick (External)

Science today requires software that enables expressive and easily parallelized workflows on gigabyte- to petabyte-sized datasets. Xarray is an actively developed open-source library that provides scientists with a powerful interface for parallelized computation with multi-dimensional raster datasets (e.g., image stacks), which are prevalent today across all scientific domains. Modern workflows can leverage Xarray to analyze massive cloud-hosted archives such as NASA’s Earth observation archive. Over the summer, you will learn skills that are key to practicing open science and that are transferable to both academic and non-academic career paths: The student will collaborate with a team of research scientists, data scientists, and software developers to produce publicly-accessible tutorials that leverage Xarray for scientific analysis of Cloud-hosted remote sensing data; contribute to multiple open source geoscientific Python projects (particularly Xarray and RioXarray) as well as general open source tools such as JupyterBook; use cloud-based datasets and computational resources; gain experience with collaborative software development workflows via GitHub; learn about the technical components of reproducible computational workflows including testing, continuous integration, and sharing data and results.

Project 8. Optimization of an ocean biogeochemistry model

Intern: Robin Armstrong (Cornell University)

Mentors: Moha Gharamti, Dan Amrhein & Matt Long (CGD)

Ocean ecosystems sustain marine fisheries and mediate fluxes of carbon that are important for maintaining the ocean’s vast inventory of carbon dioxide. Earth system models represent these ocean ecosystems using numerical models based on empirical constraints from ocean observations and understanding of fundamental biogeochemical processes. However, these models include many parameters specifying, e.g., interactions among trophic levels and between physical and biological processes. In many cases, the “correct” values of these parameters are poorly understood, and these uncertainties translate into errors in model representations of ocean ecosystems and make it difficult to compare simulations to real-world observations and make useful ocean forecasts. In this project, the student will help develop a parameter optimization framework for the Marine Biogeochemistry Library (MARBL), which is the ocean ecosystem model coupled to the Community Earth System Model, using the Data Assimilation Research Testbed (DART). DART is a sequential ensemble data assimilation tool that has been heavily tested and used for weather forecasting, ocean prediction, climate projections, flood prediction, parameter estimation, and other applications. Here we will exploit simplified ocean model configurations to enable rapid prototyping and iteration and consider optimizing a collection of these across a set of testbed sites sampling different oceanographic settings. The student will join a team of researchers with diverse expertise in oceanography and data assimilation; the student will have the opportunity to learn about ocean biogeochemistry and ecology, Earth system modeling, data assimilation, and high performance computing.

Project 9. Just-in-time compilation of a chemistry solver for GPU

Intern: Qina Tan (Colorado School of Mines)

Mentors: Jian Sun, John Dennis, Matthew Dawson (ACOM)

At NCAR, we perform world-class research in Earth system science with a particular focus on the interaction between the atmosphere and other components like the oceans and land surface. One of the many critical pieces in our understanding of the Earth system is the way in which chemical reactions impact the atmosphere. Atmospheric chemistry is typically solved by a numerical solver in a weather or climate model. Currently a researcher executes a “preprocessor” to generate a specialized version of the source code for a specific chemical mechanism. This preprocessing step both negatively impacts the user experience and software maintenance costs. These problems can potentially be overcome using a just-in-time (JIT) compilation approach. The JIT compilation can build the necessary chemistry solver at runtime. This means that once a researcher changes the chemical mechanism, the JITed solver is able to generate the appropriate chemistry solver automatically.. However, several open questions remain: (1) Does the JITed code yield a competitive performance against the one generated by the “preprocessor”? (2) Is the JITed code portable between different platforms such as CPU and Graphics Processing Units (GPU)? Addressing these concerns is likely to greatly enhance the attraction and adoption of the JIT compilation in this research community. The goal of this 2023 summer internship is to develop a GPU version of an existing JIT-based chemistry solver written in C++. The student’s primary focus will be developing the JITed chemistry solver for GPU and documenting the procedures, success or known issues. The student will also run various chemistry solvers on different linux clusters at NCAR to verify their accuracy and evaluate their performance.

Project 10. Interactive visualizations of climate data

Intern: Pritam Das (University of Washington)

Mentors: Negin Sobhani, Deepak Cherian (CGD)

Effective visualizations of climate model outputs and climate data can help communicate climate change issues to the general public. Furthermore, advanced and interactive visualizations enable climate scientists to detect patterns, time-evolving features, and trends in complex datasets and model outputs that might not be obvious from looking at the raw data alone. In this project, we are going to create a user-friendly dashboard for reading large climate data and visualizing these datasets. Next, we are going to host this application on a commercial platform and study the performance of Xarray/Dask for reading large volumes of climate datasets on different architectures. These data visualization dashboards will be used to communicate scientific findings to domain experts, policy makers, and the general public. Over the summer, the student will have opportunities to: (1) Collaborate with a team of research scientists, data scientists, and software developers to produce publicly accessible interactive dashboards that leverage Xarray/Dask, the scientific Python stack, and interactive visualization libraries such as Bokeh and Holoviews for visualization and analysis of climate data. (2) Learn about and contribute to open-source geoscientific Python projects and open source tools. (3) Gain experience to effectively access and read large datasets on commercial platforms. (4) Gain hands-on experience with version control software for collaborative software development via Git and GitHub. (5) Investigate the performance of GPU-native analytics with Xarray.

Project 11. Interactive visualization of uncertainty in high-resolution ensembles

Intern: Ameya Patil (University of Washington)

Mentors: Helen Kershaw, Moha Gharamti, Marlee Smith

Quantifying uncertainty through analysis of ensemble forecasts in numerical weather prediction or ocean modeling remains a challenge. The massive number of observations in addition to the large dimension of atmospheric and ocean models makes it difficult to properly assess the quality and uncertainty of the prediction. Consequently, deriving risk measures and informative solutions may become strenuous and borderline impossible. Ensemble data assimilation (DA) provides a flexible ensemble framework to estimate the state of an earth system. The main goal of this project is to interface the ensemble visualization package OVIS with the Data Assimilation Research Testbed (DART) at NCAR. OVIS is an interactive visualization framework that allows for an efficient and easy analysis of ocean forecasts and their uncertainties. By utilizing data on the fly, OVIS can help users dive into the data, change parameters, select subsets of the ensemble, and instantly visualize the results. Various risk measures can be also computed based on the statistics of the ensemble. While DART is written in Fortran, OVIS is implemented in Objective C and OpenGL. The project also entails extending the scope of OVIS to support atmospheric and other Earth system models and potentially exploring CISL’s own VAPOR for depicting uncertainty. Overall, the objective is to make it possible for DART’s large user base to have access to a state-of-the-art diagnostic package that is modern, highly informative, and easy to use.

Project 12. Creating bias-corrected global model output to support regional climate research

Intern: Kenton Wu (University of Texas at Austin)

Mentors: Thomas Cram, Cindy Bruyère(CPAESS), Riley Conroy

The goal of this project is to work with an NCAR scientist to update and improve existing software to create a new version of an existing global bias-corrected climate dataset that is built from the NCAR Community Earth System Model (CESM) output. The previous version of this dataset was produced under phase 5 of the Coupled Model Intercomparison Project (CMIP5), and the new updated dataset produced in this internship will be built from CESM output as part of the newer CMIP6 initiative and which supports the Intergovernmental Panel on Climate Change Sixth Assessment Report (IPCC AR6). Since all global climate models contain regional scale biases due to insufficient spatial resolution and limited representation of some physical processes, it is common to bias correct the climate model output before using it as input to regional scale models. This new dataset will provide a valuable community resource for regional climate researchers to run numerical simulation experiments based on the most recent climate model predictions, and use the results to determine the expected regional and local impacts from a range of future climate change scenarios. Additionally, the intern will gain experience in data curation, management, and preservation by archiving the new dataset in the NCAR Research Data Archive.

Project 13. Continuous Integration for CPU and GPU applications

Intern: Haniye Kashgarani (University of Wyoming)

Mentor: Supreeth Suresh

An open source and community-driven weather and climate modeling code receives regular updates from scientists and software engineers all over the world. These are often related to science, portability, and performance updates. These updates are often submitted as a pull request to the repository and the admins/owners of the repository usually perform the code review for syntax, standard, verification, code norms, etc. before accepting the pull request. This process is even more time consuming when you have to check using multiple hardware architectures, compilers, and systems. Code updates, changes in the software stack, compilers, and even changes in the system introduce bugs regularly. It is very important to identify these bugs before committing the changes to the code. Continuous integration is a development practice that integrates tests, syntax checks, and so on into source code changes as part of a CI/CD pipeline or DevOps process. This project aims to develop a process to make use of CI tools to develop automated code testing for an ASAP application on both CPU and GPU.

Project 14. Containerization of simulation applications for frequently re-run configurations

Intern: Si Chen (Emory University)

Mentors: Haiying Xu, Sheri Mickelson, Jian Sun

Container technology is rapidly developing, and we wonder if it can help to reduce data storage. With containers, we can run a scientific application with the required OS, software stacks, configurations, initialization/grid files, and input data on current supercomputers. Then we rebuild containers with an older version of MPI (Message Passing Interface) or OS to check if it still can run successfully. In this way, we can validate the re-run capability of containers in the long term. If this workflow works, we can use this method to re-run the old simulation whenever we need it. Thus when scientists try to save large amounts of data from simulations, we can urge them to use this container strategy to re-run their applications at any time and only save some small amount of data instead. The project will focus on using Singularity containers to automate the compilation process of a scientific simulation CM1 with various MPI versions. Basically, students will first manually build CM1, then create container images to automatically build it and test it on Casper, and finally change MPI versions to see if the container images will still be able to run on Casper. The final step will confirm the re-run capability of the containers.

CISL Outreach, Diversity, and Education (CODE) Intern

Intern: Julius Owusu Afriyie (University of Nebraska - Lincoln)

Mentors: Virginia Do, AJ Lauer, Agbeli Ameko

The CODE Intern will provide administrative support to the SIParCS program office and affiliated programs and assist with planning and preparation for education and outreach programs to occur during the 2023 -2024 school year. Responsibilities for student intern support include being an active participant on the SIParCS team to provide support and mentoring for students; living at the suite-style apartments with the interns; planning and participating in after-hours team building activities; and keeping program leadership informed of any issues that arise. The CODE intern may assist students/participants with special needs; travel to assist with intern recruitment during fall months; attend the Rocky Mountain Advanced Computing Consortium (RMACC) with the SIParCS program. During summer, the student will assist with program support including planning and running events such as orientation, professional development workshops, field trips, and other learning opportunities for interns. The student will also assist with apartment move-in and move-out logistics, help write and edit SIParCS Annual Report, update SIParCS program alumni tracking documents for program assessment and evaluation purposes.