Performance comparison by running benchmarks on Hadoop, Spark, and HAMR

Date
2015
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Today, Big Data is a hot topic both in industrial and academic fields. Hadoop is developed as a solution to Big Data. It provides reliable, scalable, fault-tolerance and efficient service for large scale data processing based on HDFS and MapReduce. HDFS stands for the Hadoop distributed file system and provides the distributed storage for the system. MapReduce provides the distributed processing for Hadoop. However, MapReduce is not suitable for all classes of applications. An alternative to overcome the limitation of Hadoop is new in-memory runtime systems such as Spark, that is designed to support applications reuse a working set of data across multiple parallel operations [31]. The weakness of Spark is that the performance is restricted by the memory. HAMR is a new technology that runs faster than Hadoop and Spark with less memory and CPU consumptions. At the time I started this thesis, CAPSL didn’t have a platform to provide students an environment to test big data applications. The purpose of the thesis is not to perform an extensive research but to construct a main eco-system that Hadoop and Spark can be in a same working condition. In additional, HAMR has also been installed as a test platform in the research eco-system. I also engaged the work of a selected of big data benchmarks, and took a preliminary test in all three eco-systems. To stress the different aspects of three big data runtimes, we selected and ran PageRank, WordCount, Sort, TeraSort, K-means and Naive Bayes benchmarks on Hadoop and Spark runtime systems, and ran PageRank and WordCount on HAMR runtime system. We measured the running time, maximum and average memory and CPU usage, the throughput to compare the performances difference among these plat- forms for the six benchmarks. As result, we found Spark has a outstanding performance on machine learning applications including K-means and Naive Bayes. For PageRank, Spark runs faster with small input size. Spark is faster on WordCount. For Sort and TeraSort, Spark runs faster with large input. However, Spark consumes more memory capacity and the performance for Spark is restricted by the memory. HAMR is faster than Hadoop for both two benchmarks with improvements on CPU and memory usage.
Description
Keywords
Citation