Similarity query processing for probabilistic sets

2013 IEEE 29th International Conference on Data Engineering (ICDE) Pub Date : 2013-04-08 DOI:10.1109/ICDE.2013.6544885

Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou

{"title":"Similarity query processing for probabilistic sets","authors":"Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou","doi":"10.1109/ICDE.2013.6544885","DOIUrl":null,"url":null,"abstract":"Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2013.6544885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

概率集的相似性查询处理

评估集合之间的相似性是计算机科学中的一项基本任务。然而，在许多应用中，由于各种原因，集合中的元素可能是不确定的。现有的对这类概率集进行建模和计算其相似度的工作，由于模型规模大或相似度评估成本高，因此只适用于小概率集。在本文中，我们提出了一个简单而富有表现力的模型，该模型支持许多应用，其中一个概率集可能有数千个元素。我们用可能世界语义定义了两个概率集之间的两种相似性;它们在获取可能世界叉积的相似性分布方面是互补的。我们设计了高效的基于动态规划的算法来计算这两种类型的相似度。提出了基于相似性值上界的单个和批量剪枝技术。为了适应非常大的概率集，我们还设计了具有强概率保证的基于抽样的近似查询处理方法。我们已经使用合成和真实数据集进行了大量的实验，并证明了我们提出的方法的有效性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE 29th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Big data integration T-share: A large-scale dynamic taxi ridesharing service Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects The adaptive radix tree: ARTful indexing for main-memory databases Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search