根据系统发育基因序列的 K-Mer 编码预测单链 RNA 病毒的病毒科和宿主

IF 2.6 4区生物学 Q2 BIOLOGY Computational Biology and Chemistry Pub Date : 2024-05-31 DOI:10.1016/j.compbiolchem.2024.108114

Bahar Çi̇ftçi̇ , Ramazan Teki̇n

{"title":"根据系统发育基因序列的 K-Mer 编码预测单链 RNA 病毒的病毒科和宿主","authors":"Bahar Çi̇ftçi̇ , Ramazan Teki̇n","doi":"10.1016/j.compbiolchem.2024.108114","DOIUrl":null,"url":null,"abstract":"<div><p>There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.</p></div>","PeriodicalId":10616,"journal":{"name":"Computational Biology and Chemistry","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences\",\"authors\":\"Bahar Çi̇ftçi̇ , Ramazan Teki̇n\",\"doi\":\"10.1016/j.compbiolchem.2024.108114\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.</p></div>\",\"PeriodicalId\":10616,\"journal\":{\"name\":\"Computational Biology and Chemistry\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Biology and Chemistry\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1476927124001026\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Biology and Chemistry","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1476927124001026","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

全世界有数十亿种病毒，病毒作为最小的寄生实体，构成了严重威胁。因此，防治相关疾病需要了解病毒的基因结构。考虑到病毒的广泛多样性和快速进化，亟需快速准确地对病毒种类及其潜在宿主进行分类，以便更好地了解传播动态，促进靶向疗法的开发。有鉴于此，本研究利用机器学习（ML）和深度学习（DL）模型，根据 RNA 病毒的基因组序列研究了病毒的类别。研究分析了由不同宿主和物种的巴尔的摩第四组（+ssRNA）和第五组（-ssRNA）致病性单链 RNA 病毒组成的 PhyVirus 数据集。包含病毒基因序列的数据集采用 K-Mer 编码技术进行分析，该技术基于不同长度的碱基词。研究使用了经典的 ML 算法（随机森林、梯度提升和额外树）和全连接深度神经网络（一种深度学习算法）来预测病毒家族和宿主。对 K-Mer 不同训练测试比和不同词长（k 值）情况下的分类器性能进行了详细分析。观察结果表明，全连接深度神经网络预测病毒家族的成功率高达 99.60%。在预测病毒宿主方面，额外树分类器的成功率最高，达到 81.53%。该数据集由单链 RNA 病毒的基因序列组成，具有非常大的科属和宿主多样性。我们对基于基因序列中 K-Mer 编码的不同字长如何影响病毒科和宿主分类进行了详细调查，这使得这项研究尤为宝贵。这项研究表明，ML 和 DL 方法有可能在系统发育研究中产生有价值的结果。此外，研究结果和高性能值还表明，这些方法可以成功地用于基因序列的再生应用或消除基因序列损失等研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Prediction of viral families and hosts of single-stranded RNA viruses based on K-Mer coding from phylogenetic gene sequences

There are billions of virus species worldwide, and viruses, the smallest parasitic entities, pose a serious threat. Therefore, fighting associated disorders requires an understanding of the genetic structure of viruses. Considering the wide diversity and rapid evolution of viruses, there is a critical need to quickly and accurately classify viral species and their potential hosts to better understand transmission dynamics, facilitating the development of targeted therapies. Recognizing this, this study has investigated the classes of RNA viruses based on their genomic sequences using Machine Learning (ML) and Deep Learning (DL) models. The PhyVirus dataset, consisting of pathogenic Single-stranded RNA viruses of Baltimore group four (+ssRNA) and five (-ssRNA) with different hosts and species, was analyzed. The dataset containing viral gene sequences was analyzed using the K-Mer coding technique, which is based on base words of various lengths. The study used classical ML algorithms (Random Forest, Gradient Boosting and Extra Trees) and the Fully Connected Deep Neural Network, a Deep Learning algorithm, to predict viral families and hosts. Detailed analyses were performed on the classifier performance in scenarios with different train-test ratios and different word lengths (k-values) for K-Mer. The observed results show that Fully Connected Deep Neural Network has a high success rate of 99.60 % in predicting virus families. In predicting virus hosts, the Extra Trees classifier achieved the highest success rate of 81.53 %. This study is considered to be the first classification study in the literature on this dataset, which has a very large family and host diversity consisting of gene sequences of Single-stranded RNA viruses. Our detailed investigations on how varying word lengths based on K-Mer coding in gene sequences affect the classification into viral families and hosts make this study particularly valuable. This study shows that ML and DL methods have the potential to produce valuable results in phylogenetic studies. In addition, the results and high-performance values show that these methods can be successfully used in regenerative applications of gene sequences or in studies such as the elimination of losses in gene sequences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Biology and Chemistry 生物-计算机：跨学科应用

CiteScore

6.10

自引率

3.20%

发文量

142

审稿时长

24 days

期刊介绍： Computational Biology and Chemistry publishes original research papers and review articles in all areas of computational life sciences. High quality research contributions with a major computational component in the areas of nucleic acid and protein sequence research, molecular evolution, molecular genetics (functional genomics and proteomics), theory and practice of either biology-specific or chemical-biology-specific modeling, and structural biology of nucleic acids and proteins are particularly welcome. Exceptionally high quality research work in bioinformatics, systems biology, ecology, computational pharmacology, metabolism, biomedical engineering, epidemiology, and statistical genetics will also be considered. Given their inherent uncertainty, protein modeling and molecular docking studies should be thoroughly validated. In the absence of experimental results for validation, the use of molecular dynamics simulations along with detailed free energy calculations, for example, should be used as complementary techniques to support the major conclusions. Submissions of premature modeling exercises without additional biological insights will not be considered. Review articles will generally be commissioned by the editors and should not be submitted to the journal without explicit invitation. However prospective authors are welcome to send a brief (one to three pages) synopsis, which will be evaluated by the editors.