Comparison of Performance of Big Data Applications in Different Environments

Date

Authors

Motwani, Devang

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Virtualization is utilized by all the Tech companies in their software product development and deployment on a regular basis. Containers lead to a huge hype due to the success of Docker, a tool to use containers, that made software development and deployment easier as well modular. Prior to Containers, Virtual Machines were almost the only form of Virtualization that was used and was considered stable. Due to these new emerging technologies, there has been a lot of research published over comparison in performance of Host, Virtual Machine and Containers. But there has not been much work done in the area of Big Data applications such as Microbenchmarks, Graph applications, Search applications and Machine learning running in these different environments over frameworks such as Apache Hadoop, Spark, Flink, etc. so that we can understand how different components of computer architecture are affected. This thesis compares the performance of these standard big data applications on host, virtual machines and containers (specifically Docker) by running these applications and fetching hardware counter values of C0 state, L3/L2 caches hits, CPU power etc. which would help create a pictorial representation of their comparison. This comparison has been able to justify that application running on Apache Flink, considering some trade-offs, executes Graph, Search and Machine Learning Applications execute faster and more efficiently than on Apache Spark and Hadoop MapReduce. However, Microbenchmark applications perform faster and efficiently on Apache Spark than on Hadoop MapReduce or Apache Flink. Previous research has stated that host and container environment would have almost same performance and there would be some overhead faced in running applications on Virtual Machines but looking at the results, it can be stated that nothing can specifically be drawn about the performance in different environments, different applications running in different frameworks have different performance and this has been discussed and detailed in this work.

Description

Keywords

Virtualization performance, Virtual machines and docker, Intel performance counter monitor, BigData Applications Performance, Container performance, Apache Hadoop, Spark and Flink

Citation