Feature selection for effective prediction of SARS-COV-2 using machine learning.

IF 1.7 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Genes & genomics Pub Date : 2024-03-01 Epub Date: 2023-11-20 DOI:10.1007/s13258-023-01467-6
Gagan Punacha, Rama Adiga
{"title":"Feature selection for effective prediction of SARS-COV-2 using machine learning.","authors":"Gagan Punacha, Rama Adiga","doi":"10.1007/s13258-023-01467-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.</p><p><strong>Objective: </strong>With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.</p><p><strong>Methods: </strong>All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.</p><p><strong>Results: </strong>Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.</p><p><strong>Conclusion: </strong>The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.</p>","PeriodicalId":12675,"journal":{"name":"Genes & genomics","volume":" ","pages":"341-354"},"PeriodicalIF":1.7000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-023-01467-6","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/20 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.

Objective: With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.

Methods: All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.

Results: Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.

Conclusion: The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于机器学习的SARS-COV-2有效预测特征选择
背景:随着SARS-CoV-2变体的增加,有必要对新出现的SARS-CoV-2进行分类,以便早期发现,从而减少人际传播。在检测SARS-CoV-2的机器学习(ML)方法中,基因组和蛋白质组学信息较少用于分类。目的:利用SARS-CoV-2的核蛋白和病毒蛋白质组学进化信息,结合不同菌株氨基酸的电荷和碱度分布,建立基于ml的SARS-CoV-2疾病严重程度模型。蛋白质组水平计算被加入到数据集中。训练集用于特征选择。采用Select K- Best特征选择方法,与测试集进行交叉验证,并对性能进行评价。德龙的试验也完成了。我们还采用了SARS-CoV-2的BIRCH聚类方法对菌株进行聚类。结果:6个ML模型中有4个在训练和测试中成功。Extra Trees算法产生的微平均f1得分为74.2%,多类别选项下的受试者工作特征曲线下加权平均面积(AUC-ROC)得分为73.7%。特征选择设置为5,ROC AUC从73.7提高到76.4%。所选模型的准确率达到了86.9%。结论:在ML方法中确定的独特特征能够将疾病严重程度分类,并具有预测新变体风险的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Genes & genomics
Genes & genomics 生物-生化与分子生物学
CiteScore
3.70
自引率
4.80%
发文量
131
审稿时长
6-12 weeks
期刊介绍: Genes & Genomics is an official journal of the Korean Genetics Society (http://kgenetics.or.kr/). Although it is an official publication of the Genetics Society of Korea, membership of the Society is not required for contributors. It is a peer-reviewed international journal publishing print (ISSN 1976-9571) and online version (E-ISSN 2092-9293). It covers all disciplines of genetics and genomics from prokaryotes to eukaryotes from fundamental heredity to molecular aspects. The articles can be reviews, research articles, and short communications.
期刊最新文献
Genome-wide identification of the FAD gene family in Fragaria nilgerrensis reveals distinct roles of two FnFAD3 genes in peach-like aroma formation. Exploring Cyclo (-Gly-Pro) for inflammation modulation in atopic dermatitis: a study on streptococcal postbiotics. Transcriptomic analysis of zonula occludens-1 (ZO-1) knockout in ovarian cancer cell lines. Ferroptosis modulation by Toxoplasma gondii suppresses sodium iodate-driven age-related macular degeneration. Ubiquilin 1 Inhibits intracellular proliferation of Salmonella enterica serovar Typhimurium through Xenophagy.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1