ViraLM: Empowering Virus Discovery through the Genome Foundation Model.

Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, Yanni Sun
{"title":"ViraLM: Empowering Virus Discovery through the Genome Foundation Model.","authors":"Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, Yanni Sun","doi":"10.1093/bioinformatics/btae704","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.</p><p><strong>Results: </strong>In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By employing the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.</p><p><strong>Availability: </strong>The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Viruses, with their ubiquitous presence and high diversity, play pivotal roles in ecological systems and public health. Accurate identification of viruses in various ecosystems is essential for comprehending their variety and assessing their ecological influence. Metagenomic sequencing has become a major strategy to survey the viruses in various ecosystems. However, accurate and comprehensive virus detection in metagenomic data remains difficult. Limited reference sequences prevent alignment-based methods from identifying novel viruses. Machine learning-based tools are more promising in novel virus detection but often miss short viral contigs, which are abundant in typical metagenomic data. The inconsistency in virus search results produced by available tools further highlights the urgent need for a more robust tool for virus identification.

Results: In this work, we develop ViraLM for identifying novel viral contigs in metagenomic data. By employing the latest genome foundation model as the backbone and training on a rigorously constructed dataset, the model is able to distinguish viruses from other organisms based on the learned genomic characteristics. We thoroughly tested ViraLM on multiple datasets and the experimental results show that ViraLM outperforms available tools in different scenarios. In particular, ViraLM improves the F1-score on short contigs by 22%.

Availability: The source code of ViraLM is available via: https://github.com/ChengPENG-wolf/ViraLM.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ViraLM:通过基因组基金会模式促进病毒发现。
动机病毒无处不在,种类繁多,在生态系统和公共卫生中发挥着举足轻重的作用。准确鉴定各种生态系统中的病毒对于了解病毒的多样性和评估其生态影响至关重要。元基因组测序已成为调查各种生态系统中病毒的主要策略。然而,在元基因组数据中准确、全面地检测病毒仍然很困难。由于参考序列有限,基于比对的方法无法识别新型病毒。基于机器学习的工具在新型病毒检测方面更有前途,但往往会漏掉典型元基因组数据中大量的短病毒等位基因。现有工具在病毒搜索结果上的不一致性进一步凸显了对更强大的病毒识别工具的迫切需求:在这项工作中,我们开发了 ViraLM,用于识别元基因组数据中的新型病毒序列。通过采用最新的基因组基础模型作为骨干,并在严格构建的数据集上进行训练,该模型能够根据学习到的基因组特征将病毒与其他生物体区分开来。我们在多个数据集上对 ViraLM 进行了全面测试,实验结果表明,ViraLM 在不同场景下的表现优于现有工具。特别是,ViraLM 在短节段上的 F1 分数提高了 22%:ViraLM 的源代码可通过 https://github.com/ChengPENG-wolf/ViraLM 获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RUCova: Removal of Unwanted Covariance in mass cytometry data. ViraLM: Empowering Virus Discovery through the Genome Foundation Model. CVR-BBI: An Open-Source VR Platform for Multi-User Collaborative Brain to Brain Interfaces. Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction. FungiFun3: Systemic gene set enrichment analysis for fungal species.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1