Hu Lanyue, Chen Jianhua, Wang Rongshu, Luo Zhiwen, Hou Bin
{"title":"A Long read hybrid error correction algorithm based on segmented pHMM","authors":"Hu Lanyue, Chen Jianhua, Wang Rongshu, Luo Zhiwen, Hou Bin","doi":"10.1109/ICMCCE51767.2020.00329","DOIUrl":null,"url":null,"abstract":"Although the second-generation DNA sequencing technology has high throughput and high accuracy, it can't cross the repeat region when the data volume is large due to its read length, which makes the analysis difficult. The third generation DNA sequencing technology can produce longer sequences, although it can make up for some of the weakness of the second generation sequencing, with the increase of sequence length, the error rate also increases. In this case, researchers usually combine the two methods, using short reads to correct long reads, so as to improve the accuracy of the sequence without losing the length of the sequence as much as possible. This paper proposes an error correction algorithm based on piecewise profile Hidden Markov Model (pHMM), which does not deal with the matching part, and uses HMM to correct the unmatched part or the part with more noise. The experiment results on E.coli and Yeast data sets show that the running time is reduced about 3 times compared with the Hercules algorithm.","PeriodicalId":6712,"journal":{"name":"2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE)","volume":"10 1","pages":"1501-1504"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMCCE51767.2020.00329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Although the second-generation DNA sequencing technology has high throughput and high accuracy, it can't cross the repeat region when the data volume is large due to its read length, which makes the analysis difficult. The third generation DNA sequencing technology can produce longer sequences, although it can make up for some of the weakness of the second generation sequencing, with the increase of sequence length, the error rate also increases. In this case, researchers usually combine the two methods, using short reads to correct long reads, so as to improve the accuracy of the sequence without losing the length of the sequence as much as possible. This paper proposes an error correction algorithm based on piecewise profile Hidden Markov Model (pHMM), which does not deal with the matching part, and uses HMM to correct the unmatched part or the part with more noise. The experiment results on E.coli and Yeast data sets show that the running time is reduced about 3 times compared with the Hercules algorithm.