大型数据集的分布式多样化

2014 IEEE International Conference on Cloud Engineering Pub Date : 2014-03-11 DOI:10.1109/IC2E.2014.19

M. Hasan, A. Mueen, V. Tsotras

{"title":"大型数据集的分布式多样化","authors":"M. Hasan, A. Mueen, V. Tsotras","doi":"10.1109/IC2E.2014.19","DOIUrl":null,"url":null,"abstract":"Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.","PeriodicalId":273902,"journal":{"name":"2014 IEEE International Conference on Cloud Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Distributed Diversification of Large Datasets\",\"authors\":\"M. Hasan, A. Mueen, V. Tsotras\",\"doi\":\"10.1109/IC2E.2014.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.\",\"PeriodicalId\":273902,\"journal\":{\"name\":\"2014 IEEE International Conference on Cloud Engineering\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Conference on Cloud Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC2E.2014.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Cloud Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2014.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

多样化最近被提议作为一种方法，允许用户更好地掌握一个大的结果集，而不必查看所有相关的结果。在本文中，我们扩展了多样化作为分析工具的使用，以探索分散在许多节点上的大型数据集。多样化问题一般是np完全的，现有的单处理器算法不适合我们环境的分布式设置。使用MapReduce框架，我们考虑了两种不同的方法来解决分布式多样化问题，一种侧重于优化磁盘I/O，另一种侧重于优化网络I/O。我们的方法本质上是迭代的，如果有更多的时间，允许用户继续改进多样化的过程。此外，我们证明了(i)该迭代过程是收敛的，(ii)与最优解相比，它产生了一个2-近似的多样化结果集。我们还开发了一个成本模型来预测基于网络和磁盘特性的两种方法的运行时间。我们在40核的集群上实现了我们的方法，并证明了它们是可扩展的，并且产生了与最先进的单处理器算法相同的质量结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Distributed Diversification of Large Datasets

Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE International Conference on Cloud Engineering

自引率

0.00%

发文量

期刊最新文献

Combining Declarative and Imperative Cloud Application Provisioning Based on TOSCA Splicing MPLS and OpenFlow Tunnels Based on SDN Paradigm CoMoT -- A Platform-as-a-Service for Elasticity in the Cloud A Verification Platform for SDN-Enabled Applications Extraction of Bridges from High Resolution Remote Sensing Image Based on Topology Modeling