{"title":"[Regular Paper] Detection of Errors in Multi-genome Alignments Using Machine Learning Approaches","authors":"Jaspal Singh, R. Ramakrishnan, M. Blanchette","doi":"10.1109/BIBE.2018.00017","DOIUrl":null,"url":null,"abstract":"Whole-genome multiple alignments are widely used in genomics and evolution, and yet their accuracy is imperfect, due in part to the computational complexity of the task at hand. Identifying portions of these alignments that are likely to be incorrect would allow researchers to either work on improving them or flagging them for exclusion from downstream analyses. We introduce MSA-ED, a machine learning tool for the detection of errors in whole-genome multiple alignments. MSA-ED uses random forests or artificial neural networks to identify and classify several types of alignment errors. It is trained on labeled data obtained by using an evolution simulator to generate fake orthologous sequences and their correct alignment, and comparing it to the alignment produced by Multiz, a popular whole-genome aligner. Key to the success of MSA-ED is the engineering of several types of evolutionarily-inspired features that boost prediction accuracy. MSA-ED is shown to be able to detect certain types of errors with good accuracy. It is then applied to actual genomic alignments to identify putative alignment errors. Availability: https://github.com/jaspal1329/MSA-ED","PeriodicalId":127507,"journal":{"name":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2018.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Whole-genome multiple alignments are widely used in genomics and evolution, and yet their accuracy is imperfect, due in part to the computational complexity of the task at hand. Identifying portions of these alignments that are likely to be incorrect would allow researchers to either work on improving them or flagging them for exclusion from downstream analyses. We introduce MSA-ED, a machine learning tool for the detection of errors in whole-genome multiple alignments. MSA-ED uses random forests or artificial neural networks to identify and classify several types of alignment errors. It is trained on labeled data obtained by using an evolution simulator to generate fake orthologous sequences and their correct alignment, and comparing it to the alignment produced by Multiz, a popular whole-genome aligner. Key to the success of MSA-ED is the engineering of several types of evolutionarily-inspired features that boost prediction accuracy. MSA-ED is shown to be able to detect certain types of errors with good accuracy. It is then applied to actual genomic alignments to identify putative alignment errors. Availability: https://github.com/jaspal1329/MSA-ED