深入研究用血凝素序列预测流感病毒宿主的机器学习算法

IF 1.2 Q3 Computer Science Bio-Algorithms and Med-Systems Pub Date : 2022-07-28 DOI:10.48550/arXiv.2207.13842
Yanhua Xu, D. Wojtczak
{"title":"深入研究用血凝素序列预测流感病毒宿主的机器学习算法","authors":"Yanhua Xu, D. Wojtczak","doi":"10.48550/arXiv.2207.13842","DOIUrl":null,"url":null,"abstract":"Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.","PeriodicalId":42620,"journal":{"name":"Bio-Algorithms and Med-Systems","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2022-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences\",\"authors\":\"Yanhua Xu, D. Wojtczak\",\"doi\":\"10.48550/arXiv.2207.13842\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.\",\"PeriodicalId\":42620,\"journal\":{\"name\":\"Bio-Algorithms and Med-Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2022-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bio-Algorithms and Med-Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2207.13842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bio-Algorithms and Med-Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.13842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 7

摘要

流感病毒变异迅速,可对公众健康构成威胁,特别是对弱势群体。纵观历史,甲型流感病毒曾在不同物种之间造成大流行。为了防止疫情的蔓延,确定病毒的来源是很重要的。最近,人们对使用机器学习算法为病毒序列提供快速准确的预测越来越感兴趣。在本研究中,使用真实测试数据集和各种评估指标来评估不同分类水平的机器学习算法。由于血凝素是免疫反应的主要蛋白,因此仅使用血凝素序列,并采用位置特异性评分矩阵和词嵌入表示。结果表明,5-g -transformer神经网络是预测病毒序列起源最有效的算法,在高分类水平上,AUCPR为99.54%,F1得分为98.01%,MCC为96.60%;在低分类水平上,AUCPR为94.74%,F1得分为87.41%,MCC为80.79%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences
Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Bio-Algorithms and Med-Systems
Bio-Algorithms and Med-Systems MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
3.80
自引率
0.00%
发文量
3
期刊介绍: The journal Bio-Algorithms and Med-Systems (BAMS), edited by the Jagiellonian University Medical College, provides a forum for the exchange of information in the interdisciplinary fields of computational methods applied in medicine, presenting new algorithms and databases that allows the progress in collaborations between medicine, informatics, physics, and biochemistry. Projects linking specialists representing these disciplines are welcome to be published in this Journal. Articles in BAMS are published in English. Topics Bioinformatics Systems biology Telemedicine E-Learning in Medicine Patient''s electronic record Image processing Medical databases.
期刊最新文献
Propagation of electrical signals by fungi Light induced spiking of proteinoids Transfer Functions of Proteinoid Microspheres Application of quantum entanglement induced polarization for dual-positron and prompt gamma imaging. Transcriptomic data analysis of melanocytes and melanoma cell lines of LAT transporter genes for precise medicine
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1