模糊连接使用MapReduce

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.66

F. Afrati, A. Sarma, David Menestrina, Aditya G. Parameswaran, J. Ullman

{"title":"模糊连接使用MapReduce","authors":"F. Afrati, A. Sarma, David Menestrina, Aditya G. Parameswaran, J. Ullman","doi":"10.1109/ICDE.2012.66","DOIUrl":null,"url":null,"abstract":"Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"106","resultStr":"{\"title\":\"Fuzzy Joins Using MapReduce\",\"authors\":\"F. Afrati, A. Sarma, David Menestrina, Aditya G. Parameswaran, J. Ullman\",\"doi\":\"10.1109/ICDE.2012.66\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.\",\"PeriodicalId\":321608,\"journal\":{\"name\":\"2012 IEEE 28th International Conference on Data Engineering\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"106\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 28th International Conference on Data Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE.2012.66\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.66","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 106

摘要

模糊/相似连接在研究界得到了广泛的研究，并广泛应用于实际应用中。本文提出并评估了几种从满足相似性阈值的输入集合中寻找所有元素对的算法。计算模型是单个MapReduce作业。因为我们只允许一次MapReduce循环，所以Reduce函数必须被设计成只由一个任务产生给定的输出对，对于许多算法来说，满足这个条件是最大的挑战之一。我们将一个算法的成本分为三个部分:映射器的执行成本，简化器的执行成本，以及从映射器到简化器的通信成本。这些算法首先从汉明距离的角度提出，然后扩展到编辑距离和Jaccard距离。我们发现使用MapReduce有许多不同的方法来解决相似连接问题，当考虑通信和reducer成本时，没有一种方法优于其他方法。我们的成本分析使应用程序能够根据其通信、内存和集群需求选择最佳算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Fuzzy Joins Using MapReduce

Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task, for many algorithms, satisfying this condition is one of the biggest challenges. We break the cost of an algorithm into three components: the execution cost of the mappers, the execution cost of the reducers, and the communication cost from the mappers to reducers. The algorithms are presented first in terms of Hamming distance, but extensions to edit distance and Jaccard distance are shown as well. We find that there are many different approaches to the similarity-join problem using MapReduce, and none dominates the others when both communication and reducer costs are considered. Our cost analyses enable applications to pick the optimal algorithm based on their communication, memory, and cluster requirements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 28th International Conference on Data Engineering

自引率

0.00%

发文量

期刊最新文献

Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge