{"title":"Deepdefense: annotation of immune systems in prokaryotes using deep learning.","authors":"Sven Hauns, Omer S Alkhnbashi, Rolf Backofen","doi":"10.1093/gigascience/giae062","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Due to a constant evolutionary arms race, archaea and bacteria have evolved an abundance and diversity of immune responses to protect themselves against phages. Since the discovery and application of CRISPR-Cas adaptive immune systems, numerous novel candidates for immune systems have been identified. Previous approaches to identifying these new immune systems rely on hidden Markov model (HMM)-based homolog searches or use labor-intensive and costly wet-lab experiments. To aid in finding and classifying immune systems genomes, we use machine learning to classify already known immune system proteins and discover potential candidates in the genome. Neural networks have shown promising results in classifying and predicting protein functionality in recent years. However, these methods often operate under the closed-world assumption, where it is presumed that all potential outcomes or classes are already known and included in the training dataset. This assumption does not always hold true in real-world scenarios, such as in genomics, where new samples can emerge that were not previously accounted for in the training phase.</p><p><strong>Results: </strong>In this work, we explore neural networks for immune protein classification, deal with different methods for rejecting unrelated proteins in a genome-wide search, and establish a benchmark. Then, we optimize our approach for accuracy. Based on this, we develop an algorithm called Deepdefense to predict immune cassette classes based on a genome. This design facilitates the differentiation between immune system-related and unrelated proteins by analyzing variations in model-predicted confidence values, aiding in the identification of both known and potentially novel immune system proteins. Finally, we test our approach for detecting immune systems in the genome against an HMM-based method.</p><p><strong>Conclusions: </strong>Deepdefense can automatically detect genes and define cassette annotations and classifications using 2 model classifications. This is achieved by creating an optimized deep learning model to annotate immune systems, in combination with calibration methods, and a second model to enable the scanning of an entire genome.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae062","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Due to a constant evolutionary arms race, archaea and bacteria have evolved an abundance and diversity of immune responses to protect themselves against phages. Since the discovery and application of CRISPR-Cas adaptive immune systems, numerous novel candidates for immune systems have been identified. Previous approaches to identifying these new immune systems rely on hidden Markov model (HMM)-based homolog searches or use labor-intensive and costly wet-lab experiments. To aid in finding and classifying immune systems genomes, we use machine learning to classify already known immune system proteins and discover potential candidates in the genome. Neural networks have shown promising results in classifying and predicting protein functionality in recent years. However, these methods often operate under the closed-world assumption, where it is presumed that all potential outcomes or classes are already known and included in the training dataset. This assumption does not always hold true in real-world scenarios, such as in genomics, where new samples can emerge that were not previously accounted for in the training phase.
Results: In this work, we explore neural networks for immune protein classification, deal with different methods for rejecting unrelated proteins in a genome-wide search, and establish a benchmark. Then, we optimize our approach for accuracy. Based on this, we develop an algorithm called Deepdefense to predict immune cassette classes based on a genome. This design facilitates the differentiation between immune system-related and unrelated proteins by analyzing variations in model-predicted confidence values, aiding in the identification of both known and potentially novel immune system proteins. Finally, we test our approach for detecting immune systems in the genome against an HMM-based method.
Conclusions: Deepdefense can automatically detect genes and define cassette annotations and classifications using 2 model classifications. This is achieved by creating an optimized deep learning model to annotate immune systems, in combination with calibration methods, and a second model to enable the scanning of an entire genome.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.