SIParCS 2017- Pranay Reddy Kommera

Pranay Reddy Kommera

Pranay Reddy Kommera, University of Wyoming

Implementation of a Discontinuous Galerkin 3D Euler Solver on Many-Core CPUs and GPUs

Recorded Talk

Non-hydrostatic (NH) models based on the compressible Euler system of equations are used for simulating the atmosphere. The discontinuous Galerkin (DG) method has several advantages on contemporary multi-core and many-core processors: nearest neighbor communications, spectral accuracy and a high computational intensity: for these reasons, it is attractive as a spatial discretization schemes for NH models. A 3D prototype of NH-DG model has been developed in (x, y, z) Cartesian domain with z being the vertical direction and x, and y being the horizontal directions.

In this summer project, the 3D NH-DG prototype was efficiently implemented on a suite of current HPC architectures. First, a single-threaded version of the prototype was optimized on Intel Xeon CPUs and subsequently adapted for parallel execution on three architectures: Intel Xeon and Xeon Phi processors, and NVIDIA GPUs. Two different organizations of element-wise loops of the NH-DG algorithm were studied, and for each, three different memory layouts for the element data were evaluated, to see which combination gave the best cross-platform performance. Thread parallelization on the 3D NHDG prototype is primarily implemented using directive based programming models namely OpenMP and OpenACC, which emphasize on the portability. The tradeoff between portability and readability is studied by developing a SIMD implementation using NVIDIA’s CUDA programming language on the GPU: this performance was compared with other versions.

The best performing single node implementations for CPUs and GPUs were extended to distributed memory execution via MPI, with direct GPU-GPU communication enabled on the NVIDIA GPUs. The versions were benchmarked on Intel Xeon E5-2697v4 (Broadwell), Intel Xeon Phi 7250 (Knights Landing), and NVIDIA Tesla P100 (Pascal) systems and the results are used to demonstrate scalability and performance portability.

Mentors: Ram Nair, Raghu Raj, Rich Loft