Exploiting MapReduce and data compression for data-intensive applications

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery Pub Date : 2013-07-22 DOI:10.1145/2484762.2484785

Guangchen Ruan, Hui Zhang, Beth Plale

{"title":"Exploiting MapReduce and data compression for data-intensive applications","authors":"Guangchen Ruan, Hui Zhang, Beth Plale","doi":"10.1145/2484762.2484785","DOIUrl":null,"url":null,"abstract":"HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the \"moving computation to data\" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为数据密集型应用程序开发MapReduce和数据压缩

HPC平台在以计算密集型为主的作业方面取得了良好的成功，但是，数据密集型作业在HPC平台上仍然很挣扎，因为从I/O节点到计算节点的大量并发数据移动很容易使网络链路饱和。MapReduce是许多令人愉快的并行应用程序的“将计算移动到数据”范例，它假设数据驻留在本地磁盘上，计算在数据所在的位置进行调度。然而，在HPC机器上，数据必须从一个更广泛的文件系统(如Luster)暂放到HDFS，在那里它可以被访问;这种分段可以表示处理过程中的大量延迟。在本文中，我们将研究数据压缩对减少向应用程序获取数据的带宽需求的影响，以及它对数据密集型应用程序的总体性能的影响。我们的研究考察了两种类型的应用，一种是专注于大规模医学图像数据集的3d时间序列龋齿损伤评估，另一种是在XSEDE资源上运行的涉及大规模文本分析的HTRC单词计数任务。我们广泛的实验结果表明，在存储空间、数据阶段导入时间和作业执行时间方面，性能有了显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

自引率

0.00%

发文量