VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.

IF 5.5 3区 材料科学 Q2 CHEMISTRY, PHYSICAL ACS Applied Energy Materials Pub Date : 2024-04-10 DOI:10.1093/bioinformatics/btae192
Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li
{"title":"VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.","authors":"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li","doi":"10.1093/bioinformatics/btae192","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\n\n\nRESULTS\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\n\n\nAVAILABILITY\nwww.dllab.org/software/VirusPredictor.html.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":4,"journal":{"name":"ACS Applied Energy Materials","volume":"216 1‐2","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Energy Materials","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae192","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY www.dllab.org/software/VirusPredictor.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
VirusPredictor:基于 XGBoost 的软件,用于预测人类数据中的病毒相关序列。
动机发现致病病原体,尤其是没有参考基因组的病毒,是一项技术挑战,因为通过序列比对往往无法识别这些病原体。对无法与人类和病原体基因组比对的病人高通量序列进行机器学习预测,可能会发现源自未定性病毒的序列。目前,还缺乏专门用于准确预测人类数据中此类病毒序列的软件。结果我们利用内部病毒基因组数据库开发了一种快速 XGBoost 方法和软件 VirusPredictor。我们的两步 XGBoost 模型首先将每个查询序列分为三类:传染性病毒、内源性逆转录病毒 (ERV) 或非ERV 人类。序列越长,预测准确率越高,150-350(Illumina 短读数)、850-950(Sanger 测序数据)和 2,000-5,000 bp 序列的预测准确率分别为 0.76、0.93 和 0.98。当查询序列从 150-350 bp 增加到大于 850 bp 时,准确度从 0.92 增加到大于 0.98。结果表明,Illumina 短读数应尽可能在预测前从头组装成等体(例如,1000 bp 或更长)。我们将 VirusPredictor 应用于多个真实的基因组和元基因组数据集,并获得了很高的准确率。VirusPredictor 是一款用户友好的开源 Python 软件,可用于预测患者不可应用序列的来源。这项研究首次在传染性病毒序列预测中对 ERV 进行了分类。这也是第一项结合病毒亚群预测的研究。AVAILABILITYwww.dllab.org/software/VirusPredictor.html.SUPPLEMENTARY INFORMATIONS补充数据可在生物信息学网上获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACS Applied Energy Materials
ACS Applied Energy Materials Materials Science-Materials Chemistry
CiteScore
10.30
自引率
6.20%
发文量
1368
期刊介绍: ACS Applied Energy Materials is an interdisciplinary journal publishing original research covering all aspects of materials, engineering, chemistry, physics and biology relevant to energy conversion and storage. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrate knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important energy applications.
期刊最新文献
Issue Editorial Masthead Issue Publication Information PtFe Alloy Nanoparticles Supported on Polymeric Schiff Base-Derived N-Doped Carbon for Oxygen Reduction Reaction Improved Perovskite Solar Cells with an Environmentally Friendly Phthalocyanine Hole Extracting Interlayer Boosting MIL-101(V) as a Vanadium-Based Metal–Organic Framework via MoS2/Graphene Quantum Dot Nanocomposite in Electrochemical Hydrogen Storage
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1