SIParCS 2021 - Jordan DuBeau
Designing Machine Learning Models to Conserve the RDA's Computing Resources
When users request a subset of one of the Research Data Archive (RDA)’s datasets, the request is carried out by Casper via a batch job that uses some simple estimates for wall time and maximum memory usage. We improved these estimates drastically using machine learning. This project involved collecting data on more than fifty thousand requests, converting that data into formats suitable for machine learning, selecting a model and a strategy for prediction, training and testing the model, and finally integrating the model into the request process on the RDA website. In the end, our best estimates for used memory and wall time came from Random Forest classifications models, which predict the probability that each request lies in a certain range of resource usage (0-50MB, 50-100MB, etc.). We used the probabilities outputted by the model to generate safer estimates, so that jobs will be given larger amounts of resources whenever there is a nontrivial chance that they need a larger amount of resources, even if it is more likely that they require a small amount of resources. As a bonus, we also trained a less generous but more accurate regression model which predicts the time to complete the job for the purposes of putting that estimate on the RDA website for users to see.
Mentors: Riley Conroy & Brian Vanderwende
Slides and poster