Attribute-Based Subsequence Matching and Mining

Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu
{"title":"Attribute-Based Subsequence Matching and Mining","authors":"Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu","doi":"10.1109/ICDE.2012.81","DOIUrl":null,"url":null,"abstract":"Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like \"Director\" and \"Actors\". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于属性的子序列匹配与挖掘
序列分析在我们的日常生活中非常重要。通常,每个序列都与一个有序的元素列表相关联。例如,在电影租赁应用程序中,客户的电影租赁记录包含一个有序的电影列表,这是一个序列示例。序列分析的研究大多集中在子序列匹配上,即找到数据库中存储的所有序列,使给定的查询序列是这些序列中的每一个序列的子序列。在许多应用程序中,元素与属性或属性相关联。例如,每部电影都与一些属性相关联,如“导演”和“演员”。不幸的是,据我们所知,所有现有的序列分析研究都没有考虑元素的属性。在本文中,我们提出两个问题。第一个问题是:给定一个查询序列和一组序列,考虑到元素的属性,我们希望找到与该查询序列匹配的所有序列。这个问题被称为基于属性的子序列匹配(ASM)。所有传统子序列匹配问题的现有应用都可以应用于我们的新问题,只要我们给定元素的属性。提出了一种求解ASM问题的有效算法。该算法效率的关键思想是将每个具有潜在许多相关属性的整个序列压缩成一个数字三元组。通过处理这些非常压缩的表示,我们大大加快了基于属性的子序列匹配。第二个问题是找到所有频繁的基于属性的子序列。我们还对第二个问题采用了一个现有的高效算法,以表明我们可以使用为第一个问题开发的算法。实证研究表明,我们的算法在大型数据集中是可扩展的。特别是,在大多数情况下,我们的算法运行速度至少比直接方法快一个数量级。这项工作可以激发生物信息学中基于子序列匹配的序列分类、频繁序列挖掘、基序检测和序列匹配等现有数据挖掘问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Keyword Query Reformulation on Structured Data Accuracy-Aware Uncertain Stream Databases Extracting Analyzing and Visualizing Triangle K-Core Motifs within Networks Project Daytona: Data Analytics as a Cloud Service Automatic Extraction of Structured Web Data with Domain Knowledge
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1