Distributed Diversification of Large Datasets

M. Hasan, A. Mueen, V. Tsotras
{"title":"Distributed Diversification of Large Datasets","authors":"M. Hasan, A. Mueen, V. Tsotras","doi":"10.1109/IC2E.2014.19","DOIUrl":null,"url":null,"abstract":"Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.","PeriodicalId":273902,"journal":{"name":"2014 IEEE International Conference on Cloud Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Cloud Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2014.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型数据集的分布式多样化
多样化最近被提议作为一种方法,允许用户更好地掌握一个大的结果集,而不必查看所有相关的结果。在本文中,我们扩展了多样化作为分析工具的使用,以探索分散在许多节点上的大型数据集。多样化问题一般是np完全的,现有的单处理器算法不适合我们环境的分布式设置。使用MapReduce框架,我们考虑了两种不同的方法来解决分布式多样化问题,一种侧重于优化磁盘I/O,另一种侧重于优化网络I/O。我们的方法本质上是迭代的,如果有更多的时间,允许用户继续改进多样化的过程。此外,我们证明了(i)该迭代过程是收敛的,(ii)与最优解相比,它产生了一个2-近似的多样化结果集。我们还开发了一个成本模型来预测基于网络和磁盘特性的两种方法的运行时间。我们在40核的集群上实现了我们的方法,并证明了它们是可扩展的,并且产生了与最先进的单处理器算法相同的质量结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Combining Declarative and Imperative Cloud Application Provisioning Based on TOSCA Splicing MPLS and OpenFlow Tunnels Based on SDN Paradigm CoMoT -- A Platform-as-a-Service for Elasticity in the Cloud A Verification Platform for SDN-Enabled Applications Extraction of Bridges from High Resolution Remote Sensing Image Based on Topology Modeling
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1