MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.

IF 3.8 2区农林科学 Q2 SOIL SCIENCE European Journal of Soil Science Pub Date : 2016-05-01 DOI:10.1137/1.9781611974348.63

Jingbo Shang, Jian Peng, Jiawei Han

{"title":"MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance.","authors":"Jingbo Shang, Jian Peng, Jiawei Han","doi":"10.1137/1.9781611974348.63","DOIUrl":null,"url":null,"abstract":"<p><p>Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.</p>","PeriodicalId":12043,"journal":{"name":"European Journal of Soil Science","volume":"55 1","pages":"558-566"},"PeriodicalIF":3.8000,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5292242/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Soil Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611974348.63","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOIL SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MACFP：编辑距离下的最大近似连续频繁模式挖掘。

连续模式挖掘旨在发现连续模式子串，是频繁模式挖掘的一个特例，在现实世界的许多应用中，特别是在生物序列分析、时间序列分析和网络日志挖掘中发挥了至关重要的作用。在生物序列比较中，字符串之间的近似（包括插入、删除和替换）被广泛使用。然而，现有的字符串模式挖掘方法大多只考虑不含插入/删除（indels）的汉明距离。人们很少关注编辑距离下的一般近似连续频繁模式挖掘，这可能是由于计算复杂度较高，尤其是在有数十亿碱基对的 DNA 序列上。在本文中，我们介绍了这一问题的高效解决方案。我们首先提出了最大近似连续频繁模式挖掘（MACFP）问题，该问题可识别长查询序列中编辑距离下的子串模式。然后，我们提出了一种具有线性时间复杂度的新算法，用于检查查询序列中子串模式的支持度是否高于预定义的阈值，从而大大降低了 MACFP 的计算复杂度。有了这种快速决策算法，我们就能利用多种索引和搜索技术高效地解决原始模式发现问题。序列模式分析的综合实验和癌症基因组学的应用研究表明，与现有的几种方法相比，我们的算法是有效和高效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

European Journal of Soil Science 农林科学-土壤科学

CiteScore

8.20

自引率

4.80%

发文量

117

审稿时长

5 months

期刊介绍： The EJSS is an international journal that publishes outstanding papers in soil science that advance the theoretical and mechanistic understanding of physical, chemical and biological processes and their interactions in soils acting from molecular to continental scales in natural and managed environments.