VirusPredictor：基于 XGBoost 的软件，用于预测人类数据中的病毒相关序列。

IF 5.5 3区材料科学 Q2 CHEMISTRY, PHYSICAL ACS Applied Energy Materials Pub Date : 2024-04-10 DOI:10.1093/bioinformatics/btae192

Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li

{"title":"VirusPredictor：基于 XGBoost 的软件，用于预测人类数据中的病毒相关序列。","authors":"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li","doi":"10.1093/bioinformatics/btae192","DOIUrl":null,"url":null,"abstract":"MOTIVATION\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\n\n\nRESULTS\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\n\n\nAVAILABILITY\nwww.dllab.org/software/VirusPredictor.html.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.","PeriodicalId":4,"journal":{"name":"ACS Applied Energy Materials","volume":"216 1‐2","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.\",\"authors\":\"Guangchen Liu, Xun Chen, Yihui Luan, Dawei Li\",\"doi\":\"10.1093/bioinformatics/btae192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MOTIVATION\\nDiscovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data.\\n\\n\\nRESULTS\\nWe developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions.\\n\\n\\nAVAILABILITY\\nwww.dllab.org/software/VirusPredictor.html.\\n\\n\\nSUPPLEMENTARY INFORMATION\\nSupplementary data are available at Bioinformatics online.\",\"PeriodicalId\":4,\"journal\":{\"name\":\"ACS Applied Energy Materials\",\"volume\":\"216 1‐2\",\"pages\":\"\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-04-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Energy Materials\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btae192\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Energy Materials","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae192","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

动机发现致病病原体，尤其是没有参考基因组的病毒，是一项技术挑战，因为通过序列比对往往无法识别这些病原体。对无法与人类和病原体基因组比对的病人高通量序列进行机器学习预测，可能会发现源自未定性病毒的序列。目前，还缺乏专门用于准确预测人类数据中此类病毒序列的软件。结果我们利用内部病毒基因组数据库开发了一种快速 XGBoost 方法和软件 VirusPredictor。我们的两步 XGBoost 模型首先将每个查询序列分为三类：传染性病毒、内源性逆转录病毒 (ERV) 或非ERV 人类。序列越长，预测准确率越高，150-350（Illumina 短读数）、850-950（Sanger 测序数据）和 2,000-5,000 bp 序列的预测准确率分别为 0.76、0.93 和 0.98。当查询序列从 150-350 bp 增加到大于 850 bp 时，准确度从 0.92 增加到大于 0.98。结果表明，Illumina 短读数应尽可能在预测前从头组装成等体（例如，1000 bp 或更长）。我们将 VirusPredictor 应用于多个真实的基因组和元基因组数据集，并获得了很高的准确率。VirusPredictor 是一款用户友好的开源 Python 软件，可用于预测患者不可应用序列的来源。这项研究首次在传染性病毒序列预测中对 ERV 进行了分类。这也是第一项结合病毒亚群预测的研究。AVAILABILITYwww.dllab.org/software/VirusPredictor.html.SUPPLEMENTARY INFORMATIONS补充数据可在生物信息学网上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.

MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, ie, 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2,000-5,000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to > 0.98 when query sequences increased from 150-350 to > 850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g., ∼1,000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY www.dllab.org/software/VirusPredictor.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Applied Energy Materials Materials Science-Materials Chemistry

CiteScore

10.30

自引率

6.20%

发文量

1368

期刊介绍： ACS Applied Energy Materials is an interdisciplinary journal publishing original research covering all aspects of materials, engineering, chemistry, physics and biology relevant to energy conversion and storage. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrate knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important energy applications.