Exploiting MapReduce and data compression for data-intensive applications

Guangchen Ruan, Hui Zhang, Beth Plale
{"title":"Exploiting MapReduce and data compression for data-intensive applications","authors":"Guangchen Ruan, Hui Zhang, Beth Plale","doi":"10.1145/2484762.2484785","DOIUrl":null,"url":null,"abstract":"HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the \"moving computation to data\" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.","PeriodicalId":426819,"journal":{"name":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484762.2484785","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为数据密集型应用程序开发MapReduce和数据压缩
HPC平台在以计算密集型为主的作业方面取得了良好的成功,但是,数据密集型作业在HPC平台上仍然很挣扎,因为从I/O节点到计算节点的大量并发数据移动很容易使网络链路饱和。MapReduce是许多令人愉快的并行应用程序的“将计算移动到数据”范例,它假设数据驻留在本地磁盘上,计算在数据所在的位置进行调度。然而,在HPC机器上,数据必须从一个更广泛的文件系统(如Luster)暂放到HDFS,在那里它可以被访问;这种分段可以表示处理过程中的大量延迟。在本文中,我们将研究数据压缩对减少向应用程序获取数据的带宽需求的影响,以及它对数据密集型应用程序的总体性能的影响。我们的研究考察了两种类型的应用,一种是专注于大规模医学图像数据集的3d时间序列龋齿损伤评估,另一种是在XSEDE资源上运行的涉及大规模文本分析的HTRC单词计数任务。我们广泛的实验结果表明,在存储空间、数据阶段导入时间和作业执行时间方面,性能有了显著提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Optimizing utilization across XSEDE platforms Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology Optimizing the PCIT algorithm on stampede's Xeon and Xeon Phi processors for faster discovery of biological networks Training, education, and outreach: raising the bar Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1