Raspberry Pi Hadoop Cluster

07/31/2014 - 10:35am
Mesa Lab Main Seminar Room

Lauren Patterson portrait

Lauren Patterson, CISL Summer Extern
(Hampton University)

Hadoop MapReduce paradigm has recently become popular because of its ability to process large data sets. Approached with the problem of finding the interconnect path between any two nodes in the Yellowstone supercomputer in order to analyze the performance of jobs, we decided to use a low cost miniature cluster made of Raspberry Pi processors running Hadoop MapReduce. Hadoop was used for this project because it has a built-in distributed file system, HDFS, which automatically replicates data onto each of the nodes, making it easy to retrieve and more efficient to use. Hadoop also comes with a parallel data processing framework, MapReduce, which uses Apache Pig, a data flow query language, to execute scripts. This environment was deployed on both the Raspberry Pi cluster and a MacBook Virtual Machine. A Pig language based program was developed to accurately track the path using two log files acquired from Yellowstone. Hadoop benchmarks were executed on both systems to find ways to optimize both the Pig script and Hadoop. We found Hadoop to be inefficient in finding the path compared to an equivalent script written in Python. This presentation will demonstrate in more detail the tests it took to come to this conclusion.

Video Presentation