用于基因序列分类的逻辑回归

IF 2 Q4 VIROLOGY Frontiers in virology Pub Date : 2023-11-06 DOI:10.3389/fviro.2023.1215012
Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson
{"title":"用于基因序列分类的逻辑回归","authors":"Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson","doi":"10.3389/fviro.2023.1215012","DOIUrl":null,"url":null,"abstract":"<sec><title>Introduction</title><p>Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. </p></sec><sec><title>Methods</title><p>We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. </p></sec><sec><title>Results</title><p>When applied to a poor-quality sequence data, the classifier achieved between &gt;85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. </p></sec><sec><title>Discussion</title><p>Our approach is implemented as a python package with code available at <uri xlink:href=\"https://github.com/flu-crew/classLog\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">https://github.com/flu-crew/classLog</uri>.</p></sec>","PeriodicalId":73114,"journal":{"name":"Frontiers in virology","volume":"28 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"classLog: Logistic regression for the classification of genetic sequences\",\"authors\":\"Michael A. Zeller, Zebulun W. Arendsee, Gavin J.D. Smith, Tavis K. Anderson\",\"doi\":\"10.3389/fviro.2023.1215012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<sec><title>Introduction</title><p>Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference. </p></sec><sec><title>Methods</title><p>We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%. </p></sec><sec><title>Results</title><p>When applied to a poor-quality sequence data, the classifier achieved between &gt;85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations. </p></sec><sec><title>Discussion</title><p>Our approach is implemented as a python package with code available at <uri xlink:href=\\\"https://github.com/flu-crew/classLog\\\" xmlns:xlink=\\\"http://www.w3.org/1999/xlink\\\">https://github.com/flu-crew/classLog</uri>.</p></sec>\",\"PeriodicalId\":73114,\"journal\":{\"name\":\"Frontiers in virology\",\"volume\":\"28 1\",\"pages\":\"\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2023-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in virology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fviro.2023.1215012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"VIROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in virology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fviro.2023.1215012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"VIROLOGY","Score":null,"Total":0}
引用次数: 0

摘要

测序和系统发育分类已经成为人类和动物诊断实验室的共同任务。常规做法是对病原体进行测序,以确定具有诊断意义的遗传变异,并将这些数据用于实时基因组接触者追踪和监测。在这种模式下,产生了前所未有的数据量,需要快速分析以提供有意义的推理。方法提出了一种可以对基因序列数据进行分类的机器学习逻辑回归管道。该管道实现了一种直观和可定制的方法来开发一个训练有素的预测模型,该模型在线性时间复杂度下运行,即使数据不完整也能快速生成准确的输出。我们的方法以猪呼吸与生殖综合征病毒(PRRSv)和猪H1甲型流感病毒(IAV)数据集为基准。训练好的分类器针对序列和模拟数据集进行了测试,这些数据集人为地将序列质量降低了0、10、20、30和40%。结果当应用于低质量的序列数据时,分类器对PRRSv和猪H1 IAV HA数据集的准确率在85%到95%之间,当使用完整数据集时,这一准确率增加到接近完美。该模型还通过模型内的特征选择排序来确定用于确定遗传进化支身份的氨基酸位置。这些位置可以映射到最大似然系统发育树上,允许对进化枝定义突变的推断。我们的方法是作为python包实现的,其代码可在https://github.com/flu-crew/classLog上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
classLog: Logistic regression for the classification of genetic sequences
Introduction

Sequencing and phylogenetic classification have become a common task in human and animal diagnostic laboratories. It is routine to sequence pathogens to identify genetic variations of diagnostic significance and to use these data in realtime genomic contact tracing and surveillance. Under this paradigm, unprecedented volumes of data are generated that require rapid analysis to provide meaningful inference.

Methods

We present a machine learning logistic regression pipeline that can assign classifications to genetic sequence data. The pipeline implements an intuitive and customizable approach to developing a trained prediction model that runs in linear time complexity, generating accurate output rapidly, even with incomplete data. Our approach was benchmarked against porcine respiratory and reproductive syndrome virus (PRRSv) and swine H1 influenza A virus (IAV) datasets. Trained classifiers were tested against sequences and simulated datasets that artificially degraded sequence quality at 0, 10, 20, 30, and 40%.

Results

When applied to a poor-quality sequence data, the classifier achieved between >85% to 95% accuracy for the PRRSv and the swine H1 IAV HA dataset and this increased to near perfect accuracy when using the full dataset. The model also identifies amino acid positions used to determine genetic clade identity through a feature selection ranking within the model. These positions can be mapped onto a maximum-likelihood phylogenetic tree, allowing for the inference of clade defining mutations.

Discussion

Our approach is implemented as a python package with code available at https://github.com/flu-crew/classLog.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Frontiers | Phylogenetic-based methods for fine-scale classification of PRRSV-2 ORF5 sequences: a comparison of their robustness and reproducibility Frontiers | A proposed new Tombusviridae genus featuring extremely long 5' untranslated regions and a luteo/polerovirus-like gene block Frontiers | Severe Acute Respiratory Syndrome Coronavirus-2 seroprevalence in non-vaccinated People Living with HIV in Uganda during the year 2022 Frontiers | Predicting Antibody and ACE2 Affinity for SARS-CoV-2 BA.2.86 and JN.1 with In Silico Protein Modeling and Docking Frontiers | HIV latency potential may beis influenced by intra-subtype genetic differences in the viral long-terminal repeat
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1