SYS - Supercomputer Systems Operations

Projects for summer 2016

  1. Analyzing Yellowstone System Monitoring Data to Improve Reliability and Streamline Supercomputer Administration

    Supercomputing users have come to expect a high degree of reliability and consistency from HPC centers. As compute resources grow towards exascale the likelihood that some portion of the system is not functioning properly increases dramatically. In order to continue to be able to provide a consistent & stable environment it is necessary to be able to process the ever-increasing deluge of diagnostic data produced by HPC systems and identify not only broken components, but those which are simply under-performing. Currently, problem discovery is done, to a large extent, by manually reviewing logs using rudimentary UNIX string processing tools. We would like to investigate the possibility of using modern statistical and parallel programming techniques to separate the useful data from the noise.

    This position will allow the student to investigate and apply statistical, machine learning and/or “big data” techniques to the problem of finding “bad nodes”, that is, nodes which are not performing optimally, within the Yellowstone supercomputer. SSG and other groups at NCAR already collect a substantial amount of data about the system including (but not limited to) temperature, power consumption, software generated logs, hardware event logs and failure statistics. The project is to take this data, along with any other data that can be collected, and come up with an efficient tool and process for analyzing this massive amount of data. This analysis should be automated where possible to reduce or eliminate any time-consuming and painful manual work. Depending on the applicant’s specific interests, this could mean, for example, analyzing logs to separate the real problems from the noise or taking environmental data and using it to try to predict hardware failure. This is an open-ended research project - a successful result is one which allows the Yellowstone administrators to either discover system problems which may have otherwise been missed or more easily discover and/or predict system problems.

    Skills/Qualifications: This project can be adjusted in scope to be suitable for either an undergraduate or graduate student with an interest in data analysis and parallel computing. At least some programming experience will be necessary. Experience with a UNIX like command line environment would be very helpful for the student to get started quickly. Knowledge in the fields of statistical analysis, “big data” and/or machine learning would also be beneficial.

    Back to top

  2. User environment for supercomputing at NCAR

    The petascale environment for supercomputing at NCAR needs to be designed to allow a wide range of users, ranging from people who may have barely seen an UNIX machine, to experienced climate model developers who are comfortable and able to accomplish their work as conveniently and as efficiently as possible. Among other things, the environment needs to support dealing with a large number of tools, libraries and programs, and each one of them needs often to be available in a variety of versions and configuration.

    On the current supercomputer, a combination of three techniques are used: the library search path is encoded in the binaries (as rpath), the compilers and linkers are not called directly but shell wrappers using environmental variables are used to properly find the right libraries and link them into the binaries, and the lmod module utility is used to set up these environmental variables to point to the appropriate library.

    The setup described in the previous paragraph works adequately, but CISL is always interested in improving existing workflows and exploring alternative ones, so we are hiring an intern to do any or all of the following:

    • Improve the testing infrastructure of the current environment setup, using unittest in Python and autoconf, possibly taking advantage to the all-pairs combinatorial testing of
    • Explore alternatives for the user environment, for example any change from the current modules, RPATH, to e.g. Docker/Shifter, for NWSC-2
    • Focus on the Python environment by integrating or replacing the current module-based environment with a virtualenv based one
    • Develop a program to automatically generate module file based upon a template and choice of compiler and mpi program version.
    • Evaluate XALT, a tool that allows to collect job-level information about the libraries and executables that end-users access during their jobs, as well as information about module usage

    Skills/qualifications: Open to undergraduates, experience with Linux or other UNIX-like environment, software testing, fluency in Python and shell scripting needed. Knowledge of autoconf and virtualenv highly desirable.

    Back to top