{"title":"TransGINmer: Identifying viral sequences from metagenomes with self-attention and Graph Isomorphism Network","authors":"","doi":"10.1016/j.future.2024.07.025","DOIUrl":null,"url":null,"abstract":"<div><p>Viruses, abundant across diverse environments, play pivotal roles in microbial ecosystems and impact human health. Traditional virus studies are limited by their reliance on culture cultivation, which has been mitigated by metagenomics. It obtains nucleotide sequences of all microorganisms from the environment samples through the next-generation sequencing technology. This advancement prompts the need for efficient viral identification methods. To identify viruses accurately and quickly, We propose TransGINmer, a novel deep learning model to identify viral sequences directly from metagenomes. It encodes sequences by a k-mer frequency embedding model, constructs graphs from significant codon token correlations, and classifies them using graph isomorphism neural networks. In comparative tests against some SOTA methods DeepVirFinder, VirSorter2 and PhaMer on the testing dataset, the Amazon River dataset, the Sharon dataset and the CAMI Strain dataset, TransGINmer demonstrates superior accuracy, sensitivity, specificity, and AUC values, showcasing its potential as a robust tool for viral identification from metagenomes. TransGINmer is freely available at Github (<span><span>https://github.com/xizhilangcc/TransGINmer</span><svg><path></path></svg></span>).</p></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":null,"pages":null},"PeriodicalIF":6.2000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24003893","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Viruses, abundant across diverse environments, play pivotal roles in microbial ecosystems and impact human health. Traditional virus studies are limited by their reliance on culture cultivation, which has been mitigated by metagenomics. It obtains nucleotide sequences of all microorganisms from the environment samples through the next-generation sequencing technology. This advancement prompts the need for efficient viral identification methods. To identify viruses accurately and quickly, We propose TransGINmer, a novel deep learning model to identify viral sequences directly from metagenomes. It encodes sequences by a k-mer frequency embedding model, constructs graphs from significant codon token correlations, and classifies them using graph isomorphism neural networks. In comparative tests against some SOTA methods DeepVirFinder, VirSorter2 and PhaMer on the testing dataset, the Amazon River dataset, the Sharon dataset and the CAMI Strain dataset, TransGINmer demonstrates superior accuracy, sensitivity, specificity, and AUC values, showcasing its potential as a robust tool for viral identification from metagenomes. TransGINmer is freely available at Github (https://github.com/xizhilangcc/TransGINmer).
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.