Similarity query processing for probabilistic sets

Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou
{"title":"Similarity query processing for probabilistic sets","authors":"Ming Gao, Cheqing Jin, Wei Wang, Xuemin Lin, Aoying Zhou","doi":"10.1109/ICDE.2013.6544885","DOIUrl":null,"url":null,"abstract":"Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.","PeriodicalId":399979,"journal":{"name":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 29th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2013.6544885","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling such probabilistic sets and computing their similarities suffers from huge model sizes or significant similarity evaluation cost, and hence is only applicable to small probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
概率集的相似性查询处理
评估集合之间的相似性是计算机科学中的一项基本任务。然而,在许多应用中,由于各种原因,集合中的元素可能是不确定的。现有的对这类概率集进行建模和计算其相似度的工作,由于模型规模大或相似度评估成本高,因此只适用于小概率集。在本文中,我们提出了一个简单而富有表现力的模型,该模型支持许多应用,其中一个概率集可能有数千个元素。我们用可能世界语义定义了两个概率集之间的两种相似性;它们在获取可能世界叉积的相似性分布方面是互补的。我们设计了高效的基于动态规划的算法来计算这两种类型的相似度。提出了基于相似性值上界的单个和批量剪枝技术。为了适应非常大的概率集,我们还设计了具有强概率保证的基于抽样的近似查询处理方法。我们已经使用合成和真实数据集进行了大量的实验,并证明了我们提出的方法的有效性和效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Big data integration T-share: A large-scale dynamic taxi ridesharing service Coupled clustering ensemble: Incorporating coupling relationships both between base clusterings and objects The adaptive radix tree: ARTful indexing for main-memory databases Learning to rank from distant supervision: Exploiting noisy redundancy for relational entity search
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1