{"title":"大型数据集的分布式多样化","authors":"M. Hasan, A. Mueen, V. Tsotras","doi":"10.1109/IC2E.2014.19","DOIUrl":null,"url":null,"abstract":"Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.","PeriodicalId":273902,"journal":{"name":"2014 IEEE International Conference on Cloud Engineering","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Distributed Diversification of Large Datasets\",\"authors\":\"M. Hasan, A. Mueen, V. Tsotras\",\"doi\":\"10.1109/IC2E.2014.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.\",\"PeriodicalId\":273902,\"journal\":{\"name\":\"2014 IEEE International Conference on Cloud Engineering\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE International Conference on Cloud Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC2E.2014.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE International Conference on Cloud Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC2E.2014.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Diversification has been recently proposed as an approach to allow a user to better grasp a large result set without having to look through all relevant results. In this paper, we expand the use of diversification as an analytical tool to explore large datasets dispersed over many nodes. The diversification problem is in general NP-complete and existing uniprocessor algorithms are unfortunately not suitable for the distributed setting of our environment. Using the MapReduce framework we consider two distinct approaches to solve the distributed diversification problem, one that focuses on optimizing disk I/O and one that optimizes for network I/O. Our approaches are iterative in nature, allowing the user to continue refining the diversification process if more time is available. Moreover, we prove that (i) this iteration process converges and (ii) it produces a 2-approximate diversified result set when compared to the optimal solution. We also develop a cost model to predict the run-time for both approaches based on the network and disk characteristics. We implemented our approaches on a cluster of 40 cores and showed that they are scalable and produce the same quality results as the state-of-the-art uniprocessor algorithms.