基于哈希函数的过度表示库适配体的无环识别

2013 39th Annual Northeast Bioengineering Conference Pub Date : 2013-04-05 DOI:10.1109/NEBEC.2013.2

Yiou Xiao, K. Mehrotra, C. Mohan, P. Borer, D. Allis

{"title":"基于哈希函数的过度表示库适配体的无环识别","authors":"Yiou Xiao, K. Mehrotra, C. Mohan, P. Borer, D. Allis","doi":"10.1109/NEBEC.2013.2","DOIUrl":null,"url":null,"abstract":"In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences compared with random background libraries on the same chip. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weighted tool that takes advantage of the hash functions to reduce the size of genomic data and conducts k-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with the existing tool. Furthermore, the calculation of k-neighbor hash values decreases the mutant searching overhead. In a dataset of 1 million sequences, the program accurately counted the frequency of the Human alpha-Thrombin sequence and found the mutant versions of the target sequence in less than 40 seconds, whereas the existing method takes 8280 seconds (2 hours 13 minutes).","PeriodicalId":153112,"journal":{"name":"2013 39th Annual Northeast Bioengineering Conference","volume":"133 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Acyclic Identification of Aptamer from Over-Represented Libraries Using Hash Functions\",\"authors\":\"Yiou Xiao, K. Mehrotra, C. Mohan, P. Borer, D. Allis\",\"doi\":\"10.1109/NEBEC.2013.2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences compared with random background libraries on the same chip. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weighted tool that takes advantage of the hash functions to reduce the size of genomic data and conducts k-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with the existing tool. Furthermore, the calculation of k-neighbor hash values decreases the mutant searching overhead. In a dataset of 1 million sequences, the program accurately counted the frequency of the Human alpha-Thrombin sequence and found the mutant versions of the target sequence in less than 40 seconds, whereas the existing method takes 8280 seconds (2 hours 13 minutes).\",\"PeriodicalId\":153112,\"journal\":{\"name\":\"2013 39th Annual Northeast Bioengineering Conference\",\"volume\":\"133 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 39th Annual Northeast Bioengineering Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NEBEC.2013.2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 39th Annual Northeast Bioengineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NEBEC.2013.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，随着快速测序技术的出现，基因组数据库发展迅速。生物信息学领域的研究人员期待更快、更准确的工具来有效地分析庞大的数据集。在适体搜索的背景下，目标是在同一芯片上与随机背景文库比较，搜索过度代表的DNA序列。哈希函数广泛应用于子串比较、序列比对和聚类工具中。我们开发了一个轻量级的工具，利用哈希函数来减小基因组数据的大小，并对质心序列进行k邻域搜索。与现有工具相比，这大大提高了搜索效率。此外，k邻居哈希值的计算减少了突变体搜索开销。在100万个序列的数据集中，该程序准确地计算了人类α -凝血酶序列的频率，并在不到40秒的时间内找到了目标序列的突变版本，而现有的方法需要8280秒(2小时13分钟)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Acyclic Identification of Aptamer from Over-Represented Libraries Using Hash Functions

In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences compared with random background libraries on the same chip. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weighted tool that takes advantage of the hash functions to reduce the size of genomic data and conducts k-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with the existing tool. Furthermore, the calculation of k-neighbor hash values decreases the mutant searching overhead. In a dataset of 1 million sequences, the program accurately counted the frequency of the Human alpha-Thrombin sequence and found the mutant versions of the target sequence in less than 40 seconds, whereas the existing method takes 8280 seconds (2 hours 13 minutes).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 39th Annual Northeast Bioengineering Conference

自引率

0.00%

发文量