{"title":"Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks","authors":"Alexander Döschl, Max-Emanuel Keller, P. Mandl","doi":"10.1145/3428757.3429121","DOIUrl":null,"url":null,"abstract":"There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428757.3429121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
There have been numerous studies that have examined the performance of distribution frameworks. Most of these studies deal with the processing of large amounts of data. This work compares two of these frameworks for their ability to implement CPU-intensive distributed algorithms. As a case study for our experiments we used a simple but computationally intensive puzzle. To find all solutions using brute-force search, 15! permutations had to be calculated and tested against the solution rules. Our experimental application was implemented in the Java programming language using a simple algorithm and having two distributed solutions with the paradigms MapReduce (Apache Hadoop) and RDD (Apache Spark). The implementations were benchmarked in Amazon-EC2/EMR clusters for performance and scalability measurements, where the processing time of both solutions scaled approximately linearly. However, according to our experiments, the number of tasks, hardware utilization and other aspects should also be taken into consideration when assessing scalability. The comparison of the solutions with MapReduce (Apache Hadoop) and RDD (Apache Spark) under Amazon EMR showed that the processing time measured in CPU minutes with Spark was up to 30 % lower, while the performance of Spark especially benefits from an increasing number of tasks. Considering the efficiency of using the EC2 resources, the implementation via Apache Spark was even more powerful than a comparable multithreaded Java solution.