一种基于互补双播种的广义序列模式匹配算法

Bing Ni, Leung-Yau Lo, K. Leung
{"title":"一种基于互补双播种的广义序列模式匹配算法","authors":"Bing Ni, Leung-Yau Lo, K. Leung","doi":"10.1109/BIBM.2010.5706593","DOIUrl":null,"url":null,"abstract":"In this work, we define generalized (sequence) patterns, which is based on several real Biological problems, including transcription factors (TFs) binding to transcription factor binding sites (TFBSs), cis-regulatory modules, protein domain analysis, and alternative splicing etc. Simply speaking, a generalized pattern is composed of several substrings with gaps in-between two substrings. We propose a generalized pattern matching algorithm that uses a complementary dualseeding strategy, which is sensitive to errors (both mismatches and indels). We also develop a generalized pattern matching tool1, which is to our knowledge the first ever developed specially for generalized pattern matching. Rather than replacing the existing general purpose matching tools, such as BLAST, BLAT, and PatternHunter etc, our tool provides an alternative and helps users to solve real problems, especially those that can be modeled as generalized patterns. We use data randomly sampled from reference sequences of human genome (NCBI build v18) in experiments, and hit 98.74% generalized patterns on average. The tool runs on both LINUX and Windows platforms, and the memory peak goes to a little bit larger than 1GB only.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A generalized sequence pattern matching algorithm using complementary dual-seeding\",\"authors\":\"Bing Ni, Leung-Yau Lo, K. Leung\",\"doi\":\"10.1109/BIBM.2010.5706593\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we define generalized (sequence) patterns, which is based on several real Biological problems, including transcription factors (TFs) binding to transcription factor binding sites (TFBSs), cis-regulatory modules, protein domain analysis, and alternative splicing etc. Simply speaking, a generalized pattern is composed of several substrings with gaps in-between two substrings. We propose a generalized pattern matching algorithm that uses a complementary dualseeding strategy, which is sensitive to errors (both mismatches and indels). We also develop a generalized pattern matching tool1, which is to our knowledge the first ever developed specially for generalized pattern matching. Rather than replacing the existing general purpose matching tools, such as BLAST, BLAT, and PatternHunter etc, our tool provides an alternative and helps users to solve real problems, especially those that can be modeled as generalized patterns. We use data randomly sampled from reference sequences of human genome (NCBI build v18) in experiments, and hit 98.74% generalized patterns on average. The tool runs on both LINUX and Windows platforms, and the memory peak goes to a little bit larger than 1GB only.\",\"PeriodicalId\":275098,\"journal\":{\"name\":\"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM.2010.5706593\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2010.5706593","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在这项工作中,我们定义了广义(序列)模式,这是基于几个实际的生物学问题,包括转录因子(tffs)结合转录因子结合位点(TFBSs),顺式调控模块,蛋白质结构域分析,和选择性剪接等。简单地说,一个广义模式是由几个子字符串组成的,两个子字符串之间有间隙。我们提出了一种使用互补双播策略的广义模式匹配算法,该算法对错误(不匹配和索引)都很敏感。我们还开发了一个广义模式匹配工具1,据我们所知,这是第一个专门为广义模式匹配开发的工具。我们的工具不是取代现有的通用匹配工具,如BLAST、BLAT和PatternHunter等,而是提供了一种替代方案,帮助用户解决实际问题,特别是那些可以建模为通用模式的问题。在实验中,我们从人类基因组参考序列(NCBI build v18)中随机抽取数据,平均达到98.74%的广义模式。该工具可以在LINUX和Windows平台上运行,内存峰值仅略大于1GB。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A generalized sequence pattern matching algorithm using complementary dual-seeding
In this work, we define generalized (sequence) patterns, which is based on several real Biological problems, including transcription factors (TFs) binding to transcription factor binding sites (TFBSs), cis-regulatory modules, protein domain analysis, and alternative splicing etc. Simply speaking, a generalized pattern is composed of several substrings with gaps in-between two substrings. We propose a generalized pattern matching algorithm that uses a complementary dualseeding strategy, which is sensitive to errors (both mismatches and indels). We also develop a generalized pattern matching tool1, which is to our knowledge the first ever developed specially for generalized pattern matching. Rather than replacing the existing general purpose matching tools, such as BLAST, BLAT, and PatternHunter etc, our tool provides an alternative and helps users to solve real problems, especially those that can be modeled as generalized patterns. We use data randomly sampled from reference sequences of human genome (NCBI build v18) in experiments, and hit 98.74% generalized patterns on average. The tool runs on both LINUX and Windows platforms, and the memory peak goes to a little bit larger than 1GB only.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A gene ranking method using text-mining for the identification of disease related genes alns — A searchable and filterable sequence alignment format A fast and noise-adaptive rough-fuzzy hybrid algorithm for medical image segmentation An accurate, automatic method for markerless alignment of electron tomographic images Unsupervised integration of multiple protein disorder predictors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1