AOP - Application Optimization/Parallelization

Projects for summer 2016

  1. Accelerating statistical analysis through parallel computations

    Statistical analysis often involves solving standard linear algebra problems. Specifically, for the statistical analysis of large spatial data, such as climate model output or satellite data, these include: 1) solving a linear positive definite system and 2) finding the determinant of a positive definite matrix are the limiting numerical steps. Making the computation of these two steps faster is directly related to the ability to work with larger data sets and has the potential to create a breakthrough in the analysis of large (a.k.a “big data”) data sets. Previous work has shown that these steps can be greatly accelerated using GPUs and parallel CPUs. This project will further explore the use of modern high performance computing architecture for spatial statistical applications in the geosciences. A key goal is to provide user-friendly high-level Matlab or R functionality accessible to users less familiar with parallel computing.

    The details of the implementation are flexible depending on the candidates’ skills and interests, but will likely involve some of the following: writing Matlab, R, C, and CUDA code, using MPI, OpenMP and packages/libraries to interface between high-level languages and C. The implementation will take place on NCAR’s super computing environment, Yellowstone’s, and/or early-stage testing machines, using large and scientifically meaningful data sets from climate models and satellites.

    The project is open to two graduate students.

    Skills/Experience: The required skills for a candidate include a selection of the following: Matlab, R, C, a Linux/UNIX background, familiarity with parallel architecture and MPI, and preferably experience in CUDA C as well as experience with linear algebra packages such as LAPACK. Required mathematical skills include linear algebra and basic statistics. Desired skills include more advanced statistical knowledge.

    Back to top


  2. Implementation and design of parallel atmospheric PDE solvers using radial basis functions on many-core CPUs and GPUs

    Numerical methods based on radial basis functions (RBFs) for solving partial differential equations (PDEs) are gaining in popularity because of their simplicity and their inherently “grid-free” nature. Good performance results have been demonstrated using RBF finite difference (RBF-FD) methods to solve the 2D shallow water equations on both CPU’s and GPU’s, despite the unstructured memory access patterns and large stencils sizes required by the method.

    The goals of this 2016 summer internship will focus on extending these 2D results to the 3D primitive equations of the atmosphere. The summer project could develop along one or more of the following three research axes, depending on the qualifications and interests of the successful student applicant(s): 1) focus on the optimization and/or parallelization of the 3D primitive equations on one or more many-core architectures; 2) focus on validating, evaluating, and improving the numerical results for standard test cases such as the Held-Suarez or Baroclinic Instability problem; 3) focus on exploring different RBF-based numerical strategies (such as comparing RBF’s based on Gaussian functions with those based on polyharmonic splines) to efficiently implement RBF-based PDE solvers on the sphere or in limited-area domains for climate and weather problems.

    In the case of option 1), the project would focus on achieving good performance on novel architectures, such as the latest Intel Xeon/Xeon Phi processors or NVIDIA GPUs. Beginning with an RBF kernel for the primitive equations on CPUs, the three key software design issues to tackle would be: a) developing an optimal GPU and Xeon Phi implementation of the primitive equations; creating the multi-card/multi-node distributed memory implementation that minimizes the communication overhead, possibly using techniques such as overlapping communication and computation; b) benchmarking and optimizing the node performance through application profiling; and c) understanding the techniques and prospects for achieving performance portability between CPUs and GPUs in a single code base. The student intern will explore, implement and gather profiled measurements for the different strategies involved, and select an optimal strategy for testing scalability. The student will present results along with their analysis at the conclusion of the project.

    Skills/Qualifications: Strong programming skills in Matlab, and either of the following languages C, C++ or Fortran are required. Familiarity with techniques for solving PDEs numerically (and with RBFs is desirable); Familiarity with at least one parallel programming paradigm, such as message passing interface (MPI), thread parallelization using pragmas, or the SIMD programming (e.g. with CUDA or OpenCL) is required.

    Back to top


  3. Multi-node Multi GPU Implementation and Design of prarallel atmospheric PDE solvers using Discontinuious Galerkin Method

    Non-hydrostatic (NH) models based on the Euler system of equations are used for high-resolution atmospheric modeling. Recently, element-based Galerkin methods such as the discontinuous Galerkin (DG) are becoming increasingly popular for NH modeling due to its high-order accuracy, geometric flexibility and excellent parallel efficiency. A 2D prototype (one vertical plus one horizontal) NH model based on the DG method (DG-NH model) has been developed for the research purpose. This MPI parallel Euler solver has multiple time-stepping options, and used as a framework for testing new algorithms for the spatial and temporal discretization.

    The goals of this 2016 summer internship will focus on extending the 2D MPI parallel implementation of DG-NH to GPUs and achieving good performance on NVIDIA GPUs. Beginning with a single threaded DG primitive equations kernel on CPU’s, the three key software design issues to tackle will be: a) developing an optimal GPU implementation of the primitive equations; creating the multi-card/multi-node distributed memory implementation that minimizes the communication overhead, possibly using techniques such as overlapping communication and computation; b) benchmarking and optimizing the node performance through application profiling; and c) understanding the techniques and prospects for achieving performance portability between CPU’s and GPU’s in a single code base. The student intern will explore, implement and gather profiled measurements for the different strategies involved, and select an optimal strategy for testing scalability. The student will present results along with their analysis at the conclusion of the project.

    Skills/Qualifications: Strong programming skills in Matlab, C, C++ or Fortran are required. Familiarity with techniques for solving partial differential equations numerically; Familiarity with at least one parallel programming paradigm, such as message passing interface (MPI), thread parallelization using pragmas, or the SIMD programming (e.g. with CUDA or OpenCL) is required.

    Back to top