DS - Data Science

Projects for summer 2016

  1. A cloud-based storage system for the real-time collection and display of atmospheric data

    Note: this is a team project that will accept both intern and extern applicants. Up to two interns can be accommodated at NCAR in a traditional 11-week SIParCS internship. In the case of an externship, the extern will spend three weeks of the summer at NCAR seven weeks at their home institution, returning to NCAR in the last week to give their final presentation. For externs, a summer faculty mentor at the home institution must be identified.

    The research topic of this summer internship will be to develop the concept of using cloud storage to support an end-to-end system providing public access to real-time atmospheric data. The students will conduct research on: a) the web service, user interface and database technologies required; b) scalable cloud-based storage solutions; c) optimization of performance, cost and resilience of the system; d) sensor data formats; e) data integrity; and f) system security.

    The student interns will implement the end-to-end project operating on a commercial cloud platform, and if possible, also investigate the feasibility of porting the system software components onto a R-Pi bramble. The Raspberry Pi (R-Pi) processor is an ideal platform for teaching high performance computing to students due to its low cost and small size. The versatility of R-Pi allows several parallel computing features that students could explore as a part of their learning experience.

    Skills/Qualifications: Familiarity with UNIX operating system commands; basic principles of web application and web service development, cloud services, database design, and familiarity with file structures and attributes; basic knowledge of file transfer protocols and web service data representations (for example JSON); requires basic programming skills in at least one of Python, Java or C.

    Back to top

  2. Accelerating statistical analysis through parallel computations

    Statistical analysis often involves solving standard linear algebra problems. Specifically, for the statistical analysis of large spatial data, such as climate model output or satellite data, these include: 1) solving a linear positive definite system and 2) finding the determinant of a positive definite matrix are the limiting numerical steps. Making the computation of these two steps faster is directly related to the ability to work with larger data sets and has the potential to create a breakthrough in the analysis of large (a.k.a “big data”) data sets. Previous work has shown that these steps can be greatly accelerated using GPUs and parallel CPUs. This project will further explore the use of modern high performance computing architecture for spatial statistical applications in the geosciences. A key goal is to provide user-friendly high-level Matlab or R functionality accessible to users less familiar with parallel computing.

    The details of the implementation are flexible depending on the candidates’ skills and interests, but will likely involve some of the following: writing Matlab, R, C, and CUDA code, using MPI, OpenMP and packages/libraries to interface between high-level languages and C. The implementation will take place on NCAR’s super computing environment, Yellowstone’s, and/or early-stage testing machines, using large and scientifically meaningful data sets from climate models and satellites.

    The project is open to two graduate students.

    Skills/Experience: The required skills for a candidate include a selection of the following: Matlab, R, C, a Linux/UNIX background, familiarity with parallel architecture and MPI, and preferably experience in CUDA C as well as experience with linear algebra packages such as LAPACK. Required mathematical skills include linear algebra and basic statistics. Desired skills include more advanced statistical knowledge.

    Back to top