Accelerating big data applications using lightweight virtualization framework on enterprise cloud

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI:10.1109/HPEC.2017.8091086

J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi

{"title":"Accelerating big data applications using lightweight virtualization framework on enterprise cloud","authors":"J. Bhimani, Zhengyu Yang, M. Leeser, N. Mi","doi":"10.1109/HPEC.2017.8091086","DOIUrl":null,"url":null,"abstract":"Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"516 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"57","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2017.8091086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 57

Abstract

Hypervisor-based virtualization technology has been successfully used to deploy high-performance and scalable infrastructure for Hadoop, and now Spark applications. Container-based virtualization techniques are becoming an important option, which is increasingly used due to their lightweight operation and better scaling when compared to Virtual Machines (VM). With containerization techniques such as Docker becoming mature and promising better performance, we can use Docker to speed-up big data applications. However, as applications have different behaviors and resource requirements, before replacing traditional hypervisor-based virtual machines with Docker, it is important to analyze and compare performance of applications running in the cloud with VMs and Docker containers. VM provides distributed resource management for different virtual machines running with their own allocated resources, while Docker relies on shared pool of resources among all containers. Here, we investigate the performance of different Apache Spark applications using both Virtual Machines (VM) and Docker containers. While others have looked at Docker's performance, this is the first study that compares these different virtualization frameworks for a big data enterprise cloud environment using Apache Spark. In addition to makespan and execution time, we also analyze different resource utilization (CPU, disk, memory, etc.) by Spark applications. Our results show that Spark using Docker can obtain speed-up of over 10 times when compared to using VM. However, we observe that this may not apply to all applications due to different workload patterns and different resource management schemes performed by virtual machines and containers. Our work can guide application developers, system administrators and researchers to better design and deploy big data applications on their platforms to improve the overall performance.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在企业云上使用轻量级虚拟化框架加速大数据应用

基于hypervisor的虚拟化技术已经成功地用于为Hadoop和现在的Spark应用程序部署高性能和可伸缩的基础设施。基于容器的虚拟化技术正在成为一种重要的选择，与虚拟机(VM)相比，由于其轻量级操作和更好的可伸缩性，它被越来越多地使用。随着Docker等容器化技术的成熟和性能的提高，我们可以使用Docker来加速大数据应用。但是，由于应用程序具有不同的行为和资源需求，因此在用Docker替换传统的基于hypervisor的虚拟机之前，有必要对运行在云中、使用虚拟机和Docker容器的应用程序的性能进行分析和比较。VM为使用自己分配的资源运行的不同虚拟机提供分布式资源管理，而Docker依赖于所有容器之间的共享资源池。在这里，我们研究了使用虚拟机(VM)和Docker容器的不同Apache Spark应用程序的性能。虽然其他人已经研究了Docker的性能，但这是第一次比较使用Apache Spark的大数据企业云环境中不同虚拟化框架的研究。除了makespan和执行时间外，我们还分析了Spark应用程序对不同资源(CPU，磁盘，内存等)的使用情况。我们的结果表明，与使用VM相比，使用Docker的Spark可以获得10倍以上的速度提升。然而，我们注意到，由于虚拟机和容器执行的工作负载模式和资源管理方案不同，这可能不适用于所有应用程序。我们的工作可以指导应用开发者、系统管理员和研究人员在他们的平台上更好地设计和部署大数据应用，以提高整体性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量

期刊最新文献

Optimized task graph mapping on a many-core neuromorphic supercomputer Software-defined extreme scale networks for bigdata applications Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi xDCI, a data science cyberinfrastructure for interdisciplinary research Leakage energy reduction for hard real-time caches