提高使用混合存储处理大数据的Shuffle I/O性能

2017 International Conference on Computing, Networking and Communications (ICNC) Pub Date : 1900-01-01 DOI:10.1109/ICCNC.2017.7876175

X. Ruan, Haiquan Chen

{"title":"提高使用混合存储处理大数据的Shuffle I/O性能","authors":"X. Ruan, Haiquan Chen","doi":"10.1109/ICCNC.2017.7876175","DOIUrl":null,"url":null,"abstract":"Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.","PeriodicalId":135028,"journal":{"name":"2017 International Conference on Computing, Networking and Communications (ICNC)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improving Shuffle I/O performance for big data processing using hybrid storage\",\"authors\":\"X. Ruan, Haiquan Chen\",\"doi\":\"10.1109/ICCNC.2017.7876175\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.\",\"PeriodicalId\":135028,\"journal\":{\"name\":\"2017 International Conference on Computing, Networking and Communications (ICNC)\",\"volume\":\"105 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 International Conference on Computing, Networking and Communications (ICNC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCNC.2017.7876175\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Computing, Networking and Communications (ICNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCNC.2017.7876175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

如今，大数据分析已广泛应用于天气预报、社会网络分析、科学计算、生物信息学等领域。作为大数据分析不可或缺的一部分，MapReduce已经成为分布式计算框架事实上的标准模型。随着软件和硬件组件的日益复杂，大数据分析系统在处理日益增长的计算工作量时面临性能瓶颈的挑战。在我们的研究中，我们发现由于Shuffle I/O延迟，当前Spark实现中现有的Shuffle机制仍然是性能瓶颈。我们证明Shuffle阶段会导致MapReduce作业之间的性能下降。通过观察高端固态硬盘(ssd)由于高效的闪存转换层算法和更大的板载I/O缓存而能够很好地处理随机写入，我们提出了一种基于混合存储系统的解决方案，该解决方案使用硬盘驱动器(hdd)存储大型数据集，使用ssd提高Shuffle I/O性能，以缓解这种性能下降问题。我们使用实际工作负载和合成工作负载进行的大量实验表明，与原始的基于hdd的Spark实现相比，我们基于混合存储系统的方法在Shuffle阶段实现了性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Improving Shuffle I/O performance for big data processing using hybrid storage

Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 International Conference on Computing, Networking and Communications (ICNC)

自引率

0.00%

发文量

期刊最新文献

A game-theoretic analysis of energy-depleting jamming attacks Overlapping user grouping in IoT oriented massive MIMO systems Towards zero packet loss with LISP Mobile Node Social factors for data sparsity problem of trust models in MANETs An approach to online network monitoring using clustered patterns