SIParCS 2019 Projects

Technical projects for summer 2019
If you are interested in the CISL Outreach, Diversity, and Education (CODE) Intern position (non-technical) please visit the page here

* U,G Denotes Availability for Undergraduate and/or Graduate Applicants.  Please see Eligibility for clarification. 

  1. Applying Machine Learning to Maximize Information Extraction from Imperfect Climate Models *U, G
  2. Building a Historical Data Image Archive to Support Climate Research and Retrospective Research of Hand-Written Documents *G
  3. Climate Model Sonification *U, G
  4. Cloud-based Deployment & Analytics for the Community Earth System Model 2 (CESM2) Climate Model *U, G
  5. Deploying File System Performance Metrics Through Extreme Science and Engineering Discovery Environment (XSEDE) Metrics on Demand (XDMoD) *U, G
  6. Implementing an Observation Support System for Data Assimilation Research Testbed (DART) *U, G
  7. In-Situ visualization for Model for Prediction Across Scale (MPAS) *G
  8. Interactive Science News in Augmented Reality *G
  9. Jupyter Notebooks for High Performance Computing *U
  10. Machine Learning-Based Method for Wind Resource Assessment *U, G
  11. Testing Modernization of Scientific Software *U
  12. Using Cloud-Friendly Data Format in Earth System Models *U, G

Please apply to no more than two (2) SIParCS projects.


  1. Applying Machine Learning to Maximize Information Extraction from Imperfect Climate Models *U, G
    Areas of interest: Data science, Geostatistics

    To study the Earth’s current and future climate, scientists use physically based computational models to represent the earth system.  One challenge scientists face is that all such models are imperfect.  Often that imperfection results in coherent biases in space and time that a human can readily interpret, but humans have limited mental bandwidth. Modern machine learning algorithms provide an opportunity to make better use of existing models by retrieving more of the information available in these imperfect models in an automated fashion. This problem is particularly important for regional climate studies because the enormous computational cost of running higher resolution models means that regional climate models are either lower resolution than desired (with larger potential for errors) or have greater simplifications in their physical representation (with larger potential for errors).
     
    This project will have a student apply machine learning techniques to an archive of regional climate model simulations to improve our understanding of local scale changes in climate that are critical for end users.  By leveraging existing simulations performed for historical time periods, a student will be able to explore the trade-offs associated with different algorithms and different regional climate modeling approaches in comparison to observations.  This work will also provide an opportunity to use and learn modern data science parallel analysis platforms (e.g. xarray, dask, and tensorflow) on the NCAR supercomputer (Cheyenne) and associated GPU accelerated analysis platform (Casper).

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications:  
    Basic programming skills required, strong preference for familiarity with unix and python (or R).  Applicants should have an interest in earth science, and ideally have taken a statistics course. 

    Undergraduates Apply ; Graduates Apply

    Back to top


  2. Building a Historical Data Image Archive to Support Climate Research and Retrospective Research of Hand-Written Documents *G
    Areas of interest: Digital Asset Management

    It is estimated that more than 50% of the pre-1960 ocean and atmosphere observed data are not in digital form.  These logbooks from ships and land observation stations hold valuable numerical data dating back two centuries and more.  The process to make these data accessible for climate and weather research has two major steps, first photographic images are made from the paper pages, and second the numerical data are key entered by individuals through citizen science efforts and dedicated professionals.  This project addresses the preservation and access to the images, or documents as they are called in the library sciences, for the purposes of validating information coming out of the key entry process and for the rich textual comments that add information about historical weather and life events over time.  We will research the underlying technologies available to design, and build a prototype repository containing the digital images/documents, catalog them, and provide user access to review and copy them.  Project will be appropriately scaled with applicant strengths/educational background.

    Students - The project is open to graduate students only.

    Skills and Qualifications:  
    Basic understanding of programming languages such as Python.  Basic understanding of controlled vocabularies and metadata schemas.  Basic understanding of XML and HTML markup languages.  Experience with database query languages such as SQL.  Familiarity with digital provenance concepts.  Ability to interact with mentors and peers in a manner that supports collaboration and inquiry.  Ability to work with diverse staff.  Good problem solving skills.  Good oral and written communication skills.  Willingness to learn and use computing tools and programs.  Overt curiosity to explore new things. 

    Graduates Apply

    Back to top


  3. Climate Model Sonification *U, G
    Areas of Interest: Data Science, Visualization, Music
     

    Develop an app for sonification of past and future climate change from the NCAR climate model. This project would turn an existing exhibit at NCAR into a mobile app so that climate change can be communicated through music to the general public.

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications
    Technical mastery of mobile app software; interest in music; interest in climate change.

    Undergraduates Apply ; Graduates Apply

    Back to top


  4. Cloud-based Deployment & Analytics for the Community Earth System Model 2 (CESM2) Climate Model *U, G
    Areas of Interest: Software Engineering, cloud technology
     

    The installation, configuration and running of a modern climate model like the Community Earth System Model 2 (CESM2) is a complicated task, typically done on large supercomputers and clusters designed for these types of applications.  This presents a challenge to scientists who might not have access to such systems, or the expertise to configure and build the software.  Cloud providers like Amazon Web Services (AWS) can help with the first challenge, as they offer on-demand hardware to anyone, but the complexity of their environment is yet another hurdle for scientists who would like to focus on their research, and not the technical know-how of instances, gateways, libraries and more.

    This project seeks to greatly reduce these barriers to entry by providing a pre-configured, publicly available CESM2 Amazon Machine Image that uses the AWS API to integrate with resource acquisition for simulations, data storage for output, notifications for run status and more.  It will also use either Linux tools or the AWS API for logging usage metrics to a central server to provide data on how scientists are using the image.  The internship will make heavy use of the Linux command-line and scripts, and will include seeking feedback from NCAR scientists on ease-of-use improvements and engaging with AWS engineers on technical issues.
     

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications
    Good Linux skills (including command-line use / scripting), any experience with AWS or APIs is a plus but not necessary.

    Undergraduates Apply ; Graduates Apply

    Back to top


  5. Deploying File System Performance Metrics Through Extreme Science and Engineering Discovery Environment (XSEDE) Metrics on Demand (XDMoD) *U, G
    Areas of interest: Data science, software engineering, supercomputer systems operations

    NCAR supercomputing resources generate and process large volume of scientific data. It is critical to our operational success that we can monitor and manage our file system performance efficiently. Several groups in CISL developed different tools serving this purpose. We have tools monitoring the file system performance on hardware level, and we have tools monitoring the CPU and memory performance on job level. In this project, we try to deploy the new software product that allows us to monitor the file system performance on job level. The product from this project will offer weekly report for our center.  For more information on, see https://www.xsede.org.

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications
    Python, Perl, Linux OS Knowledge

    Undergraduates Apply ; Graduates Apply

    Back to top


  6. Implementing an Observation Support System for Data Assimilation Research Testbed (DART) *U, G
    Areas of interest: Data science, Software engineering, visualization

    The Data Assimilation Research Testbed (DART) is a community software facility for data assimilation. One application of data assimilation is making numerical weather predictions (NWP). For NWP, a forecast from an atmospheric prediction model is statistically combined with measurements of things like temperatures or winds. The measurements may also be from much more sophisticated instruments like radars or satellite radiometers. 

    In order to work with observations, DART must have a data structure and associated file structure that can represent all the information that is known about the observations and their associated instruments. At present, DART uses a fairly limited linked list data structure with a flat binary file representation. In 2019, this will be updated to a more sophisticated data structure that will allow new capabilities for DART, most importantly the ability to do special computations on subsets of the observations. DART will also move to a Network Common Data Form (NetCDF) file format. Because of the unstructured nature of observations in both space and time, there are a number of interesting challenges in implementing both the data structure and the files. 

    The intern will work with the lead scientist and lead software engineer of the DART project to assist with developing software tools to implement and/or visualize the new observation files depending on the intern’s interests. The new files will be implemented and tested in data assimilation experiments. Tools that can reproduce existing DART observation diagnostics will be implemented. New diagnostic methods that are able to work with and visualize features of identified subsets of observations will be designed and implemented. The intern will have a chance to learn more about all aspects of the project while focusing on one of the components. No prior knowledge of data assimilation or NetCDF is necessary.

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications
    Experience in any compiled programming language (e.g. Fortran, Java, C, C++).  Experience with any scripting language (python, bash/csh, perl).  Interest in designing data structures and tools.

    Undergraduates Apply ; Graduates Apply

    Back to top


  7. In-Situ visualization for Model for Prediction Across Scale (MPAS) *G
    Areas of interest: Application optimization/parallelization, data science, visualization

    In-Situ is a technique, where data can be visualized and analyzed in real-time, as it is being generated by the simulation thus minimizing a critical bottleneck of data storage or input/output (I/O). This process is gaining more popularity every day because of the ever-growing need for real-time visualization and analysis, and limitations with data storage or I/O.
     
    The I/O operations have been one of the main bottlenecks in weather/climate modeling applications that tend to produce high-resolution output. The goal of this 2019 summer internship will focus on developing an In-Situ adapter for a certain selected weather/climate application using Paraview catalyst or similar framework. The Paraview catalyst is a set of Application Programming Interfaces (API) that brings together the scalable capabilities of The Visualization Toolkit (VTK) along with Paraview. The In-Situ visualization is one of the possible ways to reduce the I/O burden on the system. The primary focus will be on developing the adapter for an application running on CPU and secondary focus of this project is on developing the adapter for the application running on general-purpose computing on graphics processing units (GPGPU). The student intern will explore a diverse number of In-Situ visualization frameworks and its feasibility of integration with one selected application.

    Students - This project is open to graduate students only.

    Skills and Qualifications:
    Strong programming skills in at least one of the following languages C, C++, and Fortran is required.  Familiarity with Python programming.  Familiarity with use of Paraview, VTK or similar tools is preferred.  Understanding of parallel programming paradigm or knowledge of GPGPU is considered a bonus.

    Graduates Apply

    Back to top


  8. Interactive Science News in Augmented Reality *G
    Areas of interest: Software Engineering, visualization, augmented reality

    This project will offer the intern valuable experience in designing and developing a mobile augmented reality (AR) application to present science articles and news features in innovative and engaging ways.  Inspiring, educating, and informing the public about NCAR research and about the wonder and relevance of science is a primary mission of our organization. Implementing new technologies that enhance our storytelling capabilities and engage our audiences plays a key role in making that possible. 

    The intern will design and develop a mobile AR application to access NCAR/UCAR news article(s) enhanced with relevant AR objects. As an example, the prototype might embed a hurricane model into a story about tropical cyclones and enable the reader to not only read about hurricanes, but also use their mobile device as a window to walk around a virtual, 3D hurricane in their room or office.

    Students - This project is open to graduate students only.

    Skills and Qualifications
    Student in Computer Science, Electrical Engineering or a related field requiring - intermediate to advanced computer skills. Experience developing iOS/Android AR applications in Unity. Experience with C# and familiarity with writing simple custom shaders in Unity.  Experience/working knowledge of ARkit and/or ARCore SDKs.  Excellent written and verbal communication skills. Creativity, imagination, and problem solving skills.  Experience/ familiarity with AWS S3 and AWS Mobile SDK for Unity is desired but not required.  Experience developing and publishing apps to the Apple Store and/or Google Play is desired but not required.

    Graduates Apply

    Back to top


  9. Jupyter Notebooks for High Performance Computing *U
    Areas of interest: Software Engineering, user education

    Jupyter Notebooks are becoming the default “computing environment” in a variety of programming languages. Python is certainly the most popular language for Jupyter Notebooks, but the list is very long, and includes the likes of R and Lua, and surprising entries such as C, Fortran, and even bash! Meanwhile, the daily tasks of high performance computing (HPC) users - shell scripting, job scheduling, resource management, and data movement, to name a few - are often archaic and not very user friendly in comparison. Many individuals self-teach these techniques, but novel methods of interactive learning can make this process more efficient, rewarding and perhaps even fun. 

    The objective of this project is to develop Jupyter Notebooks that demonstrate NCAR’s HPC functionality, and integrate such Notebooks in the CISL documentation. Concepts covered will depend on the interest and expertise of the student; potential topics include using supercomputer workload managers such as Portable Batch System (PBS) and Slurm job schedulers, customizing the user environment, running large scale parallel jobs, understanding permissions on disk and tape, and managing cron jobs. The student will then work with User Support staff to to integrate these Notebooks into our live NCAR documentation. If time allows, the student may investigate other self-guided teaching technologies, contribution of work to the HPC Carpentry project, or integration with JupyterHub.

    Student - The project is open to undergraduate students only.

    Skills and Qualifications:
    Applicants should be comfortable with the Unix computing environment and have experience with shell and Python scripting. An ideal candidate will also have knowledge of Jupyter and exposure to HPC resources. Some experience with technical writing would be also be beneficial.

    Undergraduates Apply

    Back to top


  10. Machine Learning-Based Method for Wind Resource Assessment *U, G
    Areas of interest: Application Optimization/Parallelization, Data Science, Software Engineering

    Analog ensemble (AnEn) techniques have been used with success for short-term weather predictions. In the context of the wind resource assessment, the analog-ensemble method draws on the information contained in the historical data of multiple physical quantities over the period these data overlap with the observations (known as training period; typically 365 days) of the quantity of interest (known as predictand; the wind speed in this study). The relationships derived within the training period are then applied to reconstruct the on-site wind speed over the period for which there are no observations (hereafter referred to as "reconstructed period", e.g., the past 20 years before the measurement campaign started).  In order to apply this simple, but effective algorithm, the entire data set containing the historical data must be kept in memory.  This requirement represents a big limitation to scalability and performance enhancement.

    This project seeks to dramatically reduce memory usage, computational resources, and energy needs of the AnEn code by replacing the dataset of historical data with a representative model generated using machine learning techniques.

    Student - This project is open to undergraduate and graduate students.

    Skills and Qualifications:
    Desire to use machine learning and technology for the public good.  Ability to work in a Unix or Unix-like environment, using makefiles, scripts, compilers, plotting tools, etc. as needed. The successful candidate must be very organized and highly motivated.  Familiarity with Fortran and Object Oriented programming techniques.

    Undergraduates Apply ; Graduates Apply

    Back to top


  11. Testing Modernization of Scientific Software *U
    Areas of interest: Numerical methods, software engineering, supercomputer systems operations

    Over the last several years, SIParCS Interns have developed several tools for modernizing and standardizing Fortran source codes.  There is much code that could be modernized.  The scientists who own this code would be greatly helped by software to assist in the process of modernization and standardization.

    Other software, called K-Gen written in Python, has been developed at NCAR to assist extracting kernels from scientific codes for optimization.  The same need for before and after testing exists in the problem of modernizing and standardizing older codes.

    We seek an Intern to use, and potentially modify, the K-Gen software, to assist scientists modernizing and standardizing their software.  This includes contacting scientists and learning their requirements.

    Students - The project is open to undergraduate students only.

    Skills and Qualifications:  
    Testing scientific software

    Undergraduates Apply 

    Back to top


  12. Using Cloud-Friendly Data Format in Earth System Models *U, G
    Areas of interest: Application optimization/parallelization, software engineering

    Data volume of the earth system model is increasing rapidly every year, thanks to advances in computing efficiency and increases in model resolution and complexity.  To efficiently produce these data volumes, parallel Input/Output (I/O) is a necessity, and to help manage data volumes, compression is also a necessity. Unfortunately, the commonly used data format Network Common Data Form (NetCDF) in earth system model does not support parallel I/O with simultaneous compression without significant performance loss.  Alternatively, Zarr is a cloud-friendly data format that provides an implementation of parallel (multiple threads or processes) I/O for chunked, compressed, N-dimensional arrays.   Experience with Zarr data format could provide us insights as to whether it can improve the parallel read/write/compression performance in the earth system model.

    Over this summer internship, the student will explore and learn about I/O development with Zarr library.  The student will study how to use Zarr library and NetCDF library.  The student will perform parallel benchmarks with both Zarr and NetCDF on NCAR’s Cheyenne supercomputer.

    Students - This project is open to undergraduate and graduate students.

    Skills and Qualifications:
    Familiarity with Linux or Unix.  Experience with Python and C/C++ programming.  Ability and willingness to work with a team.  Good communication and writing skills.  Optional: Familiarity with parallel computing, familiarity with scientific data formats, and experience with NumPy. 

    Undergraduates Apply ; Graduates Apply

    Back to top