SIParCS 2016 - Tao Zhao

Tao Zhao, University of Oklahoma

Analyzing Yellowstone System Monitoring Data to Improve Reliability and Streamline Supercomputer Administration

(Slides)  (Recorded Talk)

Modern scientific research and industrial applications rely highly on supercomputers. Increasing computational power/scale causes increased complexity in the supercomputer systems, which unavoidably leads to vulnerability to failure. Although all individual components in modern supercomputers are very reliable, the probability of a failure in the whole system is still high because of the number of individual components. Therefore, effectively identifying and even predicting system failures can greatly benefit supercomputer system administration. In this study, we used machine learning techniques to analyze system monitoring data acquired on the Yellowstone supercomputer, and developed a utility to predict compute node failures within the Yellowstone supercomputer. Based on our experiments, the developed utility is able to predict up to 50% of the node failures up to 30 days ahead, with less than 1% false alarm rate.

Mentors: Ben Matthews, Irfan Elahi, CISL