Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson
{"title":"classLog: Logistic regression for the classification of genetic sequences","authors":"Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson","doi":"10.3389/fviro.2023.1215012","DOIUrl":null,"url":null,"abstract":"<sec><title>Introduction</title><p>Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. </p></sec><sec><title>Methods</title><p>We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. </p></sec><sec><title>Results</title><p>When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. </p></sec><sec><title>Discussion</title><p>Our approach is implemented as a python package with code available at <uri xlink:href=\"https://github.com/flu-crew/classLog\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">https://github.com/flu-crew/classLog</uri>.</p></sec>","PeriodicalId":73114,"journal":{"name":"Frontiers in virology","volume":"28 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in virology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fviro.2023.1215012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"VIROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.
Methods
We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.
Results
When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.
Discussion
Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog.