{"title":"Tandem repeats analysis in DNA sequences based on improved Burrows-Wheeler transform","authors":"P. Ochieng, Taufik Djatna, W. Kusuma","doi":"10.1109/ICACSIS.2015.7415159","DOIUrl":null,"url":null,"abstract":"The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including Mapping and Assembly with Quality (MAQ), which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Therefore, we carried out an in-depth performance analysis of BWA a popular BWT-based aligner and discovered that its performance is significantly better than MAQ although, it has drawbacks regarding execution speed, time complexity and accuracy. Based on those factors we implemented an improved Burrows-Wheeler Alignment algorithm (BWA), anew read alignment package which is original BWT optimized by source code of Ziv-Lempel (LZ-77) sliding window technique and prefix trie string matching, to efficiently search for inexact and exact matches on tandem repeats against a large reference sequence genome. Our analysis show that search speed of improved BWA significantly increased by approximately 1.40 ×faster than MAQ-32 while achieving sufficiently higher accuracy with percent confidence of 96.7 % and 93.0 %. Moreover, it is more efficient to search exact and inexact matches supported by percent error of 0.05 % single ends and 0.04 % for paired end reads also more effective to search for left and right overlap tandem repeat at percent confidence of 88.9%.","PeriodicalId":325539,"journal":{"name":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Advanced Computer Science and Information Systems (ICACSIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICACSIS.2015.7415159","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including Mapping and Assembly with Quality (MAQ), which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Therefore, we carried out an in-depth performance analysis of BWA a popular BWT-based aligner and discovered that its performance is significantly better than MAQ although, it has drawbacks regarding execution speed, time complexity and accuracy. Based on those factors we implemented an improved Burrows-Wheeler Alignment algorithm (BWA), anew read alignment package which is original BWT optimized by source code of Ziv-Lempel (LZ-77) sliding window technique and prefix trie string matching, to efficiently search for inexact and exact matches on tandem repeats against a large reference sequence genome. Our analysis show that search speed of improved BWA significantly increased by approximately 1.40 ×faster than MAQ-32 while achieving sufficiently higher accuracy with percent confidence of 96.7 % and 93.0 %. Moreover, it is more efficient to search exact and inexact matches supported by percent error of 0.05 % single ends and 0.04 % for paired end reads also more effective to search for left and right overlap tandem repeat at percent confidence of 88.9%.
新的DNA测序技术产生了大量的短读,这要求开发快速准确的读比对程序。第一代基于哈希表的方法已经被开发出来,包括Mapping and Assembly with Quality (MAQ),它准确、功能丰富、速度足够快,可以对来自单个个体的短读取进行对齐。然而,MAQ不支持单端读取的间隙对齐,这使得它不适合经常出现索引的较长读取的对齐。当校准规模扩大到数百个个体的重测序时,MAQ的速度也是一个问题。因此,我们对流行的基于bwt的对齐器BWA进行了深入的性能分析,发现它的性能明显优于MAQ,尽管它在执行速度,时间复杂度和准确性方面存在缺点。基于这些因素,我们实现了改进的Burrows-Wheeler比对算法(BWA),该算法是基于Ziv-Lempel (LZ-77)滑动窗口技术和前缀三串匹配的原始BWT优化的新的读取比对包,用于在大参考序列基因组上高效地搜索串联重复序列的不精确和精确匹配。我们的分析表明,改进的BWA的搜索速度比MAQ-32显著提高了约1.40 ×faster,同时获得了足够高的准确率,百分比置信度分别为96.7%和93.0%。此外,精确匹配和不精确匹配的搜索效率更高,单端误差为0.05%,成对端读长为0.04%,左右重叠串联重复的搜索效率更高,置信度为88.9%。