{"title":"Detecting retroviruses using reading frame information and side effect machines","authors":"W. Ashlock, S. Datta","doi":"10.1109/CIBCB.2010.5510699","DOIUrl":null,"url":null,"abstract":"This paper addresses the problem of distinguishing retroviruses from non-coding DNA sequences. Retroviruses have a distinctive reading frame structure that includes multiple reading frames that often overlap. This paper uses reading frame information generated from Fourier spectral analysis as input for Side Effect Machines (SEMs) that are evolved to create clusterings which separate the two types of sequences. The output from these SEMs is then used to train Support Vector Machines (SVMs) to perform the classification. The best classifier out of 100 replicates achieves 100% accuracy using complete retroviral genomes and the average classifier achieves 85% accuracy. Using endogenous retroviral data that includes many mutations, the best classifier achieves 86% accuracy; the average achieves an accuracy of 71%. The method also was able to distinguish lentiviruses from other types of retroviruses with a best accuracy of 100% (average 93%). In order to better understand the evolved SEMs, classifiers trained on SEMs evolved using endogenous retroviral data were used to classify the complete unmutated retroviral genomes and vice versa. It was found that, regardless of which type of data was used to create the classifiers, their performance on the test data sets was similar. This suggests that SEMs are able to extract the distinctive retroviral reading frame structure from the Fourier spectra, but that in some of the endogenous retroviruses in our data set there were too many mutations for this structure to be discernable from the data using this method.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"28 12","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2010.5510699","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
This paper addresses the problem of distinguishing retroviruses from non-coding DNA sequences. Retroviruses have a distinctive reading frame structure that includes multiple reading frames that often overlap. This paper uses reading frame information generated from Fourier spectral analysis as input for Side Effect Machines (SEMs) that are evolved to create clusterings which separate the two types of sequences. The output from these SEMs is then used to train Support Vector Machines (SVMs) to perform the classification. The best classifier out of 100 replicates achieves 100% accuracy using complete retroviral genomes and the average classifier achieves 85% accuracy. Using endogenous retroviral data that includes many mutations, the best classifier achieves 86% accuracy; the average achieves an accuracy of 71%. The method also was able to distinguish lentiviruses from other types of retroviruses with a best accuracy of 100% (average 93%). In order to better understand the evolved SEMs, classifiers trained on SEMs evolved using endogenous retroviral data were used to classify the complete unmutated retroviral genomes and vice versa. It was found that, regardless of which type of data was used to create the classifiers, their performance on the test data sets was similar. This suggests that SEMs are able to extract the distinctive retroviral reading frame structure from the Fourier spectra, but that in some of the endogenous retroviruses in our data set there were too many mutations for this structure to be discernable from the data using this method.