Attribute-Based Subsequence Matching and Mining

2012 IEEE 28th International Conference on Data Engineering Pub Date : 2012-04-01 DOI:10.1109/ICDE.2012.81

Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu

{"title":"Attribute-Based Subsequence Matching and Mining","authors":"Yu Peng, R. C. Wong, Liangliang Ye, Philip S. Yu","doi":"10.1109/ICDE.2012.81","DOIUrl":null,"url":null,"abstract":"Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like \"Director\" and \"Actors\". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.","PeriodicalId":321608,"journal":{"name":"2012 IEEE 28th International Conference on Data Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 28th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2012.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Sequence analysis is very important in our daily life. Typically, each sequence is associated with an ordered list of elements. For example, in a movie rental application, a customer's movie rental record containing an ordered list of movies is a sequence example. Most studies about sequence analysis focus on subsequence matching which finds all sequences stored in the database such that a given query sequence is a subsequence of each of these sequences. In many applications, elements are associated with properties or attributes. For example, each movie is associated with some attributes like "Director" and "Actors". Unfortunately, to the best of our knowledge, all existing studies about sequence analysis do not consider the attributes of elements. In this paper, we propose two problems. The first problem is: given a query sequence and a set of sequences, considering the attributes of elements, we want to find all sequences which are matched by this query sequence. This problem is called attribute-based subsequence matching (ASM). All existing applications for the traditional subsequence matching problem can also be applied to our new problem provided that we are given the attributes of elements. We propose an efficient algorithm for problem ASM. The key idea to the efficiency of this algorithm is to compress each whole sequence with potentially many associated attributes into just a triplet of numbers. By dealing with these very compressed representations, we greatly speed up the attribute-based subsequence matching. The second problem is to find all frequent attribute-based subsequence. We also adapt an existing efficient algorithm for this second problem to show we can use the algorithm developed for the first problem. Empirical studies show that our algorithms are scalable in large datasets. In particular, our algorithms run at least an order of magnitude faster than a straightforward method in most cases. This work can stimulate a number of existing data mining problems which are fundamentally based on subsequence matching such as sequence classification, frequent sequence mining, motif detection and sequence matching in bioinformatics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于属性的子序列匹配与挖掘

序列分析在我们的日常生活中非常重要。通常，每个序列都与一个有序的元素列表相关联。例如，在电影租赁应用程序中，客户的电影租赁记录包含一个有序的电影列表，这是一个序列示例。序列分析的研究大多集中在子序列匹配上，即找到数据库中存储的所有序列，使给定的查询序列是这些序列中的每一个序列的子序列。在许多应用程序中，元素与属性或属性相关联。例如，每部电影都与一些属性相关联，如“导演”和“演员”。不幸的是，据我们所知，所有现有的序列分析研究都没有考虑元素的属性。在本文中，我们提出两个问题。第一个问题是:给定一个查询序列和一组序列，考虑到元素的属性，我们希望找到与该查询序列匹配的所有序列。这个问题被称为基于属性的子序列匹配(ASM)。所有传统子序列匹配问题的现有应用都可以应用于我们的新问题，只要我们给定元素的属性。提出了一种求解ASM问题的有效算法。该算法效率的关键思想是将每个具有潜在许多相关属性的整个序列压缩成一个数字三元组。通过处理这些非常压缩的表示，我们大大加快了基于属性的子序列匹配。第二个问题是找到所有频繁的基于属性的子序列。我们还对第二个问题采用了一个现有的高效算法，以表明我们可以使用为第一个问题开发的算法。实证研究表明，我们的算法在大型数据集中是可扩展的。特别是，在大多数情况下，我们的算法运行速度至少比直接方法快一个数量级。这项工作可以激发生物信息学中基于子序列匹配的序列分类、频繁序列挖掘、基序检测和序列匹配等现有数据挖掘问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助