The human gut microbiota produces diverse, extensive metabolites that have the potential to affect host physiology. Despite significant efforts to identify metabolic pathways for producing these microbial metabolites, a comprehensive metabolic pathway database for the human gut microbiota is still lacking. Here, we present Enteropathway, a metabolic pathway database that integrates 3269 compounds, 3677 reactions, and 876 modules that were obtained from 1012 manually curated scientific literature. Notably, 698 modules of these modules are new entries and cannot be found in any other databases. The database is accessible from a web application (https://enteropathway.org) that offers a metabolic diagram for graphical visualization of metabolic pathways, a customization interface, and an enrichment analysis feature for highlighting enriched modules on the metabolic diagram. Overall, Enteropathway is a comprehensive reference database that can complement widely used databases, and a tool for visual and statistical analysis in human gut microbiota studies and was designed to help researchers pinpoint new insights into the complex interplay between microbiota and host metabolism.
{"title":"Enteropathway: the metabolic pathway database for the human gut microbiota.","authors":"Hirotsugu Shiroma, Youssef Darzi, Etsuko Terajima, Zenichi Nakagawa, Hirotaka Tsuchikura, Naoki Tsukuda, Yuki Moriya, Shujiro Okuda, Susumu Goto, Takuji Yamada","doi":"10.1093/bib/bbae419","DOIUrl":"10.1093/bib/bbae419","url":null,"abstract":"<p><p>The human gut microbiota produces diverse, extensive metabolites that have the potential to affect host physiology. Despite significant efforts to identify metabolic pathways for producing these microbial metabolites, a comprehensive metabolic pathway database for the human gut microbiota is still lacking. Here, we present Enteropathway, a metabolic pathway database that integrates 3269 compounds, 3677 reactions, and 876 modules that were obtained from 1012 manually curated scientific literature. Notably, 698 modules of these modules are new entries and cannot be found in any other databases. The database is accessible from a web application (https://enteropathway.org) that offers a metabolic diagram for graphical visualization of metabolic pathways, a customization interface, and an enrichment analysis feature for highlighting enriched modules on the metabolic diagram. Overall, Enteropathway is a comprehensive reference database that can complement widely used databases, and a tool for visual and statistical analysis in human gut microbiota studies and was designed to help researchers pinpoint new insights into the complex interplay between microbiota and host metabolism.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11367760/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142104463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Young Su Ko, Jonathan Parkinson, Cong Liu, Wei Wang
Protein-protein interactions (PPIs) are important for many biological processes, but predicting them from sequence data remains challenging. Existing deep learning models often cannot generalize to proteins not present in the training set and do not provide uncertainty estimates for their predictions. To address these limitations, we present TUnA, a Transformer-based uncertainty-aware model for PPI prediction. TUnA uses ESM-2 embeddings with Transformer encoders and incorporates a Spectral-normalized Neural Gaussian Process. TUnA achieves state-of-the-art performance and, importantly, evaluates uncertainty for unseen sequences. We demonstrate that TUnA's uncertainty estimates can effectively identify the most reliable predictions, significantly reducing false positives. This capability is crucial in bridging the gap between computational predictions and experimental validation.
{"title":"TUnA: an uncertainty-aware transformer model for sequence-based protein-protein interaction prediction.","authors":"Young Su Ko, Jonathan Parkinson, Cong Liu, Wei Wang","doi":"10.1093/bib/bbae359","DOIUrl":"10.1093/bib/bbae359","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are important for many biological processes, but predicting them from sequence data remains challenging. Existing deep learning models often cannot generalize to proteins not present in the training set and do not provide uncertainty estimates for their predictions. To address these limitations, we present TUnA, a Transformer-based uncertainty-aware model for PPI prediction. TUnA uses ESM-2 embeddings with Transformer encoders and incorporates a Spectral-normalized Neural Gaussian Process. TUnA achieves state-of-the-art performance and, importantly, evaluates uncertainty for unseen sequences. We demonstrate that TUnA's uncertainty estimates can effectively identify the most reliable predictions, significantly reducing false positives. This capability is crucial in bridging the gap between computational predictions and experimental validation.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11269822/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141757276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinze Liu, Jingxuan Shi, Yuanyuan Jiao, Jiaqi An, Jingwei Tian, Yue Yang, Li Zhuo
The development of omics technologies has driven a profound expansion in the scale of biological data and the increased complexity in internal dimensions, prompting the utilization of machine learning (ML) as a powerful toolkit for extracting knowledge and understanding underlying biological patterns. Kidney disease represents one of the major growing global health threats with intricate pathogenic mechanisms and a lack of precise molecular pathology-based therapeutic modalities. Accordingly, there is a need for advanced high-throughput approaches to capture implicit molecular features and complement current experiments and statistics. This review aims to delineate strategies for integrating multi-omics data with appropriate ML methods, highlighting key clinical translational scenarios, including predicting disease progression risks to improve medical decision-making, comprehensively understanding disease molecular mechanisms, and practical applications of image recognition in renal digital pathology. Examining the benefits and challenges of current integration efforts is expected to shed light on the complexity of kidney disease and advance clinical practice.
全息技术的发展推动了生物数据规模的大幅扩大和内部维度复杂性的增加,促使人们利用机器学习(ML)作为提取知识和理解潜在生物模式的强大工具包。肾脏疾病是日益增长的全球主要健康威胁之一,其致病机制错综复杂,但缺乏基于分子病理学的精确治疗方法。因此,需要先进的高通量方法来捕捉隐含的分子特征,并对当前的实验和统计进行补充。本综述旨在阐述将多组学数据与适当的 ML 方法相结合的策略,重点介绍关键的临床转化方案,包括预测疾病进展风险以改善医疗决策、全面了解疾病分子机制以及图像识别在肾脏数字病理学中的实际应用。研究当前整合工作的益处和挑战有望揭示肾脏疾病的复杂性并推动临床实践。
{"title":"Integrated multi-omics with machine learning to uncover the intricacies of kidney disease.","authors":"Xinze Liu, Jingxuan Shi, Yuanyuan Jiao, Jiaqi An, Jingwei Tian, Yue Yang, Li Zhuo","doi":"10.1093/bib/bbae364","DOIUrl":"10.1093/bib/bbae364","url":null,"abstract":"<p><p>The development of omics technologies has driven a profound expansion in the scale of biological data and the increased complexity in internal dimensions, prompting the utilization of machine learning (ML) as a powerful toolkit for extracting knowledge and understanding underlying biological patterns. Kidney disease represents one of the major growing global health threats with intricate pathogenic mechanisms and a lack of precise molecular pathology-based therapeutic modalities. Accordingly, there is a need for advanced high-throughput approaches to capture implicit molecular features and complement current experiments and statistics. This review aims to delineate strategies for integrating multi-omics data with appropriate ML methods, highlighting key clinical translational scenarios, including predicting disease progression risks to improve medical decision-making, comprehensively understanding disease molecular mechanisms, and practical applications of image recognition in renal digital pathology. Examining the benefits and challenges of current integration efforts is expected to shed light on the complexity of kidney disease and advance clinical practice.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289682/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141854885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Constructing accurate gene regulatory network s (GRNs), which reflect the dynamic governing process between genes, is critical to understanding the diverse cellular process and unveiling the complexities in biological systems. With the development of computer sciences, computational-based approaches have been applied to the GRNs inference task. However, current methodologies face challenges in effectively utilizing existing topological information and prior knowledge of gene regulatory relationships, hindering the comprehensive understanding and accurate reconstruction of GRNs. In response, we propose a novel graph neural network (GNN)-based Multi-Task Learning framework for GRN reconstruction, namely MTLGRN. Specifically, we first encode the gene promoter sequences and the gene biological features and concatenate the corresponding feature representations. Then, we construct a multi-task learning framework including GRN reconstruction, Gene knockout predict, and Gene expression matrix reconstruction. With joint training, MTLGRN can optimize the gene latent representations by integrating gene knockout information, promoter characteristics, and other biological attributes. Extensive experimental results demonstrate superior performance compared with state-of-the-art baselines on the GRN reconstruction task, efficiently leveraging biological knowledge and comprehensively understanding the gene regulatory relationships. MTLGRN also pioneered attempts to simulate gene knockouts on bulk data by incorporating gene knockout information.
{"title":"Refining computational inference of gene regulatory networks: integrating knockout data within a multi-task framework.","authors":"Wentao Cui, Qingqing Long, Meng Xiao, Xuezhi Wang, Guihai Feng, Xin Li, Pengfei Wang, Yuanchun Zhou","doi":"10.1093/bib/bbae361","DOIUrl":"10.1093/bib/bbae361","url":null,"abstract":"<p><p>Constructing accurate gene regulatory network s (GRNs), which reflect the dynamic governing process between genes, is critical to understanding the diverse cellular process and unveiling the complexities in biological systems. With the development of computer sciences, computational-based approaches have been applied to the GRNs inference task. However, current methodologies face challenges in effectively utilizing existing topological information and prior knowledge of gene regulatory relationships, hindering the comprehensive understanding and accurate reconstruction of GRNs. In response, we propose a novel graph neural network (GNN)-based Multi-Task Learning framework for GRN reconstruction, namely MTLGRN. Specifically, we first encode the gene promoter sequences and the gene biological features and concatenate the corresponding feature representations. Then, we construct a multi-task learning framework including GRN reconstruction, Gene knockout predict, and Gene expression matrix reconstruction. With joint training, MTLGRN can optimize the gene latent representations by integrating gene knockout information, promoter characteristics, and other biological attributes. Extensive experimental results demonstrate superior performance compared with state-of-the-art baselines on the GRN reconstruction task, efficiently leveraging biological knowledge and comprehensively understanding the gene regulatory relationships. MTLGRN also pioneered attempts to simulate gene knockouts on bulk data by incorporating gene knockout information.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11289685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141854887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During the drug discovery and design process, the acid-base dissociation constant (pKa) of a molecule is critically emphasized due to its crucial role in influencing the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties and biological activity. However, the experimental determination of pKa values is often laborious and complex. Moreover, existing prediction methods exhibit limitations in both the quantity and quality of the training data, as well as in their capacity to handle the complex structural and physicochemical properties of compounds, consequently impeding accuracy and generalization. Therefore, developing a method that can quickly and accurately predict molecular pKa values will to some extent help the structural modification of molecules, and thus assist the development process of new drugs. In this study, we developed a cutting-edge pKa prediction model named GR-pKa (Graph Retention pKa), leveraging a message-passing neural network and employing a multi-fidelity learning strategy to accurately predict molecular pKa values. The GR-pKa model incorporates five quantum mechanical properties related to molecular thermodynamics and dynamics as key features to characterize molecules. Notably, we originally introduced the novel retention mechanism into the message-passing phase, which significantly improves the model's ability to capture and update molecular information. Our GR-pKa model outperforms several state-of-the-art models in predicting macro-pKa values, achieving impressive results with a low mean absolute error of 0.490 and root mean square error of 0.588, and a high R2 of 0.937 on the SAMPL7 dataset.
{"title":"GR-pKa: a message-passing neural network with retention mechanism for pKa prediction.","authors":"Runyu Miao, Danlin Liu, Liyun Mao, Xingyu Chen, Leihao Zhang, Zhen Yuan, Shanshan Shi, Honglin Li, Shiliang Li","doi":"10.1093/bib/bbae408","DOIUrl":"10.1093/bib/bbae408","url":null,"abstract":"<p><p>During the drug discovery and design process, the acid-base dissociation constant (pKa) of a molecule is critically emphasized due to its crucial role in influencing the ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties and biological activity. However, the experimental determination of pKa values is often laborious and complex. Moreover, existing prediction methods exhibit limitations in both the quantity and quality of the training data, as well as in their capacity to handle the complex structural and physicochemical properties of compounds, consequently impeding accuracy and generalization. Therefore, developing a method that can quickly and accurately predict molecular pKa values will to some extent help the structural modification of molecules, and thus assist the development process of new drugs. In this study, we developed a cutting-edge pKa prediction model named GR-pKa (Graph Retention pKa), leveraging a message-passing neural network and employing a multi-fidelity learning strategy to accurately predict molecular pKa values. The GR-pKa model incorporates five quantum mechanical properties related to molecular thermodynamics and dynamics as key features to characterize molecules. Notably, we originally introduced the novel retention mechanism into the message-passing phase, which significantly improves the model's ability to capture and update molecular information. Our GR-pKa model outperforms several state-of-the-art models in predicting macro-pKa values, achieving impressive results with a low mean absolute error of 0.490 and root mean square error of 0.588, and a high R2 of 0.937 on the SAMPL7 dataset.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11339865/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142016416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microsatellite instability (MSI) is a phenomenon seen in several cancer types, which can be used as a biomarker to help guide immune checkpoint inhibitor treatment. To facilitate this, researchers have developed computational tools to categorize samples as having high microsatellite instability, or as being microsatellite stable using next-generation sequencing data. Most of these tools were published with unclear scope and usage, and they have yet to be independently benchmarked. To address these issues, we assessed the performance of eight leading MSI tools across several unique datasets that encompass a wide variety of sequencing methods. While we were able to replicate the original findings of each tool on whole exome sequencing data, most tools had worse receiver operating characteristic and precision-recall area under the curve values on whole genome sequencing data. We also found that they lacked agreement with one another and with commercial MSI software on gene panel data, and that optimal threshold cut-offs vary by sequencing type. Lastly, we tested tools made specifically for RNA sequencing data and found they were outperformed by tools designed for use with DNA sequencing data. Out of all, two tools (MSIsensor2, MANTIS) performed well across nearly all datasets, but when all datasets were combined, their precision decreased. Our results caution that MSI tools can have much lower performance on datasets other than those on which they were originally evaluated, and in the case of RNA sequencing tools, can even perform poorly on the type of data for which they were created.
{"title":"Performance assessment of computational tools to detect microsatellite instability.","authors":"Harrison Anthony, Cathal Seoighe","doi":"10.1093/bib/bbae390","DOIUrl":"10.1093/bib/bbae390","url":null,"abstract":"<p><p>Microsatellite instability (MSI) is a phenomenon seen in several cancer types, which can be used as a biomarker to help guide immune checkpoint inhibitor treatment. To facilitate this, researchers have developed computational tools to categorize samples as having high microsatellite instability, or as being microsatellite stable using next-generation sequencing data. Most of these tools were published with unclear scope and usage, and they have yet to be independently benchmarked. To address these issues, we assessed the performance of eight leading MSI tools across several unique datasets that encompass a wide variety of sequencing methods. While we were able to replicate the original findings of each tool on whole exome sequencing data, most tools had worse receiver operating characteristic and precision-recall area under the curve values on whole genome sequencing data. We also found that they lacked agreement with one another and with commercial MSI software on gene panel data, and that optimal threshold cut-offs vary by sequencing type. Lastly, we tested tools made specifically for RNA sequencing data and found they were outperformed by tools designed for use with DNA sequencing data. Out of all, two tools (MSIsensor2, MANTIS) performed well across nearly all datasets, but when all datasets were combined, their precision decreased. Our results caution that MSI tools can have much lower performance on datasets other than those on which they were originally evaluated, and in the case of RNA sequencing tools, can even perform poorly on the type of data for which they were created.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11317526/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141916103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Margarita Pertseva, Oceane Follonier, Daniele Scarcella, Sai T Reddy
Effective clustering of T-cell receptor (TCR) sequences could be used to predict their antigen-specificities. TCRs with highly dissimilar sequences can bind to the same antigen, thus making their clustering into a common antigen group a central challenge. Here, we develop TouCAN, a method that relies on contrastive learning and pretrained protein language models to perform TCR sequence clustering and antigen-specificity predictions. Following training, TouCAN demonstrates the ability to cluster highly dissimilar TCRs into common antigen groups. Additionally, TouCAN demonstrates TCR clustering performance and antigen-specificity predictions comparable to other leading methods in the field.
{"title":"TCR clustering by contrastive learning on antigen specificity.","authors":"Margarita Pertseva, Oceane Follonier, Daniele Scarcella, Sai T Reddy","doi":"10.1093/bib/bbae375","DOIUrl":"10.1093/bib/bbae375","url":null,"abstract":"<p><p>Effective clustering of T-cell receptor (TCR) sequences could be used to predict their antigen-specificities. TCRs with highly dissimilar sequences can bind to the same antigen, thus making their clustering into a common antigen group a central challenge. Here, we develop TouCAN, a method that relies on contrastive learning and pretrained protein language models to perform TCR sequence clustering and antigen-specificity predictions. Following training, TouCAN demonstrates the ability to cluster highly dissimilar TCRs into common antigen groups. Additionally, TouCAN demonstrates TCR clustering performance and antigen-specificity predictions comparable to other leading methods in the field.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11317525/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141916106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Katarzyna Nałęcz-Charkiewicz, Kamil Charkiewicz, Robert M Nowak
The field of quantum computing (QC) is expanding, with efforts being made to apply it to areas previously covered by classical algorithms and methods. Bioinformatics is one such domain that is developing in terms of QC. This article offers a broad mapping review of methods and algorithms of QC in bioinformatics, marking the first of its kind. It presents an overview of the domain and aids researchers in identifying further research directions in the early stages of this field of knowledge. The work presented here shows the current state-of-the-art solutions, focuses on general future directions, and highlights the limitations of current methods. The gathered data includes a comprehensive list of identified methods along with descriptions, classifications, and elaborations of their advantages and disadvantages. Results are presented not just in a descriptive table but also in an aggregated and visual format.
{"title":"Quantum computing in bioinformatics: a systematic review mapping.","authors":"Katarzyna Nałęcz-Charkiewicz, Kamil Charkiewicz, Robert M Nowak","doi":"10.1093/bib/bbae391","DOIUrl":"10.1093/bib/bbae391","url":null,"abstract":"<p><p>The field of quantum computing (QC) is expanding, with efforts being made to apply it to areas previously covered by classical algorithms and methods. Bioinformatics is one such domain that is developing in terms of QC. This article offers a broad mapping review of methods and algorithms of QC in bioinformatics, marking the first of its kind. It presents an overview of the domain and aids researchers in identifying further research directions in the early stages of this field of knowledge. The work presented here shows the current state-of-the-art solutions, focuses on general future directions, and highlights the limitations of current methods. The gathered data includes a comprehensive list of identified methods along with descriptions, classifications, and elaborations of their advantages and disadvantages. Results are presented not just in a descriptive table but also in an aggregated and visual format.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11323091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141975078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suronjit Kumar Roy, Mohammad Shahangir Biswas, Md Foyzur Raman, Rubait Hasan, Zahidur Rahmann, Md Moyen Uddin P K
Pseudomonas aeruginosa is a complex nosocomial infectious agent responsible for numerous illnesses, with its growing resistance variations complicating treatment development. Studies have emphasized the importance of virulence factors OprE and OprF in pathogenesis, highlighting their potential as vaccine candidates. In this study, B-cell, MHC-I, and MHC-II epitopes were identified, and molecular linkers were active to join these epitopes with an appropriate adjuvant to construct a vaccine. Computational tools were employed to forecast the tertiary framework, characteristics, and also to confirm the vaccine's composition. The potency was weighed through population coverage analysis and immune simulation. This project aims to create a multi-epitope vaccine to reduce P. aeruginosa-related illness and mortality using immunoinformatics resources. The ultimate complex has been determined to be stable, soluble, antigenic, and non-allergenic upon inspection of its physicochemical and immunological properties. Additionally, the protein exhibited acidic and hydrophilic characteristics. The Ramachandran plot, ProSA-web, ERRAT, and Verify3D were employed to ensure the final model's authenticity once the protein's three-dimensional structure had been established and refined. The vaccine model showed a significant binding score and stability when interacting with MHC receptors. Population coverage analysis indicated a global coverage rate of 83.40%, with the USA having the highest coverage rate, exceeding 90%. Moreover, the vaccine sequence underwent codon optimization before being cloned into the Escherichia coli plasmid vector pET-28a (+) at the EcoRI and EcoRV restriction sites. Our research has developed a vaccine against P. aeruginosa that has strong binding affinity and worldwide coverage, offering an acceptable way to mitigate nosocomial infections.
{"title":"A computational approach to developing a multi-epitope vaccine for combating Pseudomonas aeruginosa-induced pneumonia and sepsis.","authors":"Suronjit Kumar Roy, Mohammad Shahangir Biswas, Md Foyzur Raman, Rubait Hasan, Zahidur Rahmann, Md Moyen Uddin P K","doi":"10.1093/bib/bbae401","DOIUrl":"10.1093/bib/bbae401","url":null,"abstract":"<p><p>Pseudomonas aeruginosa is a complex nosocomial infectious agent responsible for numerous illnesses, with its growing resistance variations complicating treatment development. Studies have emphasized the importance of virulence factors OprE and OprF in pathogenesis, highlighting their potential as vaccine candidates. In this study, B-cell, MHC-I, and MHC-II epitopes were identified, and molecular linkers were active to join these epitopes with an appropriate adjuvant to construct a vaccine. Computational tools were employed to forecast the tertiary framework, characteristics, and also to confirm the vaccine's composition. The potency was weighed through population coverage analysis and immune simulation. This project aims to create a multi-epitope vaccine to reduce P. aeruginosa-related illness and mortality using immunoinformatics resources. The ultimate complex has been determined to be stable, soluble, antigenic, and non-allergenic upon inspection of its physicochemical and immunological properties. Additionally, the protein exhibited acidic and hydrophilic characteristics. The Ramachandran plot, ProSA-web, ERRAT, and Verify3D were employed to ensure the final model's authenticity once the protein's three-dimensional structure had been established and refined. The vaccine model showed a significant binding score and stability when interacting with MHC receptors. Population coverage analysis indicated a global coverage rate of 83.40%, with the USA having the highest coverage rate, exceeding 90%. Moreover, the vaccine sequence underwent codon optimization before being cloned into the Escherichia coli plasmid vector pET-28a (+) at the EcoRI and EcoRV restriction sites. Our research has developed a vaccine against P. aeruginosa that has strong binding affinity and worldwide coverage, offering an acceptable way to mitigate nosocomial infections.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11318047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141916157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.
{"title":"Heterogeneous biomedical entity representation learning for gene-disease association prediction.","authors":"Zhaohan Meng, Siwei Liu, Shangsong Liang, Bhautesh Jani, Zaiqiao Meng","doi":"10.1093/bib/bbae380","DOIUrl":"10.1093/bib/bbae380","url":null,"abstract":"<p><p>Understanding the genetic basis of disease is a fundamental aspect of medical research, as genes are the classic units of heredity and play a crucial role in biological function. Identifying associations between genes and diseases is critical for diagnosis, prevention, prognosis, and drug development. Genes that encode proteins with similar sequences are often implicated in related diseases, as proteins causing identical or similar diseases tend to show limited variation in their sequences. Predicting gene-disease association (GDA) requires time-consuming and expensive experiments on a large number of potential candidate genes. Although methods have been proposed to predict associations between genes and diseases using traditional machine learning algorithms and graph neural networks, these approaches struggle to capture the deep semantic information within the genes and diseases and are dependent on training data. To alleviate this issue, we propose a novel GDA prediction model named FusionGDA, which utilizes a pre-training phase with a fusion module to enrich the gene and disease semantic representations encoded by pre-trained language models. Multi-modal representations are generated by the fusion module, which includes rich semantic information about two heterogeneous biomedical entities: protein sequences and disease descriptions. Subsequently, the pooling aggregation strategy is adopted to compress the dimensions of the multi-modal representation. In addition, FusionGDA employs a pre-training phase leveraging a contrastive learning loss to extract potential gene and disease features by training on a large public GDA dataset. To rigorously evaluate the effectiveness of the FusionGDA model, we conduct comprehensive experiments on five datasets and compare our proposed model with five competitive baseline models on the DisGeNet-Eval dataset. Notably, our case study further demonstrates the ability of FusionGDA to discover hidden associations effectively. The complete code and datasets of our experiments are available at https://github.com/ZhaohanM/FusionGDA.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":null,"pages":null},"PeriodicalIF":6.8,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11330343/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141995342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}