A Ramdisk Provisioning Service for High Performance Data Analysis

07/29/2011 - 1:00pm to 1:25pm
Main Seminar Room - ML
Allan Espinosa

Allan Espinosa, University of Chicago

Abstract:  Data-intensive postprocessing analysis is an important component of climate simulation science. For example, a high-resolution CESM atmosphere simulation workflow produces vast datasets from computing centers on TeraGrid. As the simulation progresses, scientists need to transfer data back to their home institutions to perform dataintensive diagnostics and analysis. But to do this, there is significant I/O-disk overhead in transferring and reading data. Previous work demonstrated that running the analysis with the dataset on a RAMbased file system, rather than a spinning-disk file system, significantly decreases analysis time. We build on this result by adding storage provisioning and data transfer steps to the analysis workflow and integrating the concepts as services on a Linux cluster. We configured the Torque resource manager and the Maui scheduler to provide a special queue that allocates RAM-disk space for users. The analysis cluster’s resource manager can then be used to manage temporary allocations, data transfer, and postprocessing tasks. The final result is an end-to-end data postprocessing pipeline that orchestrates data transfer from TeraGrid supercomputers to NCAR, executes the standard diagnostic, and transfers the data to the tape archive system, all without placing the data on spinning disk.

Video replay of the presentation

Please install the Flash Plug-in or click here for non-flash supported devices.

Download the entire presentation (right-click/save link as)

Presented on July 29, 2011