Machine learning has emerged as a transformative tool for elucidating cellular heterogeneity in single-cell RNA sequencing. However, a significant challenge lies in the "black box" nature of deep learning models, which obscures the decision-making process and limits interpretability in cell status annotation. In this study, we introduced scGO, a Gene Ontology (GO)-inspired deep learning framework designed to provide interpretable cell status annotation for scRNA-seq data. scGO employs sparse neural networks to leverage the intrinsic biological relationships among genes, transcription factors, and GO terms, significantly augmenting interpretability and reducing computational cost. scGO outperforms state-of-the-art methods in the precise characterization of cell subtypes across diverse datasets. Our extensive experimentation across a spectrum of scRNA-seq datasets underscored the remarkable efficacy of scGO in disease diagnosis, prediction of developmental stages, and evaluation of disease severity and cellular senescence status. Furthermore, we incorporated in silico individual gene manipulations into the scGO model, introducing an additional layer for discovering therapeutic targets. Our results provide an interpretable model for accurately annotating cell status, capturing latent biological knowledge, and informing clinical practice.
{"title":"scGO: interpretable deep neural network for cell status annotation and disease diagnosis.","authors":"You Wu, Pengfei Xu, Liyuan Wang, Shuai Liu, Yingnan Hou, Hui Lu, Peng Hu, Xiaofei Li, Xiang Yu","doi":"10.1093/bib/bbaf018","DOIUrl":"10.1093/bib/bbaf018","url":null,"abstract":"<p><p>Machine learning has emerged as a transformative tool for elucidating cellular heterogeneity in single-cell RNA sequencing. However, a significant challenge lies in the \"black box\" nature of deep learning models, which obscures the decision-making process and limits interpretability in cell status annotation. In this study, we introduced scGO, a Gene Ontology (GO)-inspired deep learning framework designed to provide interpretable cell status annotation for scRNA-seq data. scGO employs sparse neural networks to leverage the intrinsic biological relationships among genes, transcription factors, and GO terms, significantly augmenting interpretability and reducing computational cost. scGO outperforms state-of-the-art methods in the precise characterization of cell subtypes across diverse datasets. Our extensive experimentation across a spectrum of scRNA-seq datasets underscored the remarkable efficacy of scGO in disease diagnosis, prediction of developmental stages, and evaluation of disease severity and cellular senescence status. Furthermore, we incorporated in silico individual gene manipulations into the scGO model, introducing an additional layer for discovering therapeutic targets. Our results provide an interpretable model for accurately annotating cell status, capturing latent biological knowledge, and informing clinical practice.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11737892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143000433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pathway analysis plays a critical role in bioinformatics, enabling researchers to identify biological pathways associated with various conditions by analyzing gene expression data. However, the rise of large, multi-center datasets has highlighted limitations in traditional methods like Over-Representation Analysis (ORA) and Functional Class Scoring (FCS), which struggle with low signal-to-noise ratios (SNR) and large sample sizes. To tackle these challenges, we use a deep learning-based classification method, Gene PointNet, and a novel $P$-value computation approach leveraging the confusion matrix to address pathway analysis tasks. We validated our method effectiveness through a comparative study using a simulated dataset and RNA-Seq data from The Cancer Genome Atlas breast cancer dataset. Our method was benchmarked against traditional techniques (ORA, FCS), shallow machine learning models (logistic regression, support vector machine), and deep learning approaches (DeepHisCom, PASNet). The results demonstrate that GPNet outperforms these methods in low-SNR, large-sample datasets, where it remains robust and reliable, significantly reducing both Type I error and improving power. This makes our method well suited for pathway analysis in large, multi-center studies. The code can be found at https://github.com/haolu123/GPNet_pathway">https://github.com/haolu123/GPNet_pathway.
{"title":"Classification-based pathway analysis using GPNet with novel P-value computation.","authors":"Hao Lu, Mostafa Rezapour, Haseebullah Baha, Muhammad Khalid Khan Niazi, Aarthi Narayanan, Metin Nafi Gurcan","doi":"10.1093/bib/bbaf039","DOIUrl":"10.1093/bib/bbaf039","url":null,"abstract":"<p><p>Pathway analysis plays a critical role in bioinformatics, enabling researchers to identify biological pathways associated with various conditions by analyzing gene expression data. However, the rise of large, multi-center datasets has highlighted limitations in traditional methods like Over-Representation Analysis (ORA) and Functional Class Scoring (FCS), which struggle with low signal-to-noise ratios (SNR) and large sample sizes. To tackle these challenges, we use a deep learning-based classification method, Gene PointNet, and a novel $P$-value computation approach leveraging the confusion matrix to address pathway analysis tasks. We validated our method effectiveness through a comparative study using a simulated dataset and RNA-Seq data from The Cancer Genome Atlas breast cancer dataset. Our method was benchmarked against traditional techniques (ORA, FCS), shallow machine learning models (logistic regression, support vector machine), and deep learning approaches (DeepHisCom, PASNet). The results demonstrate that GPNet outperforms these methods in low-SNR, large-sample datasets, where it remains robust and reliable, significantly reducing both Type I error and improving power. This makes our method well suited for pathway analysis in large, multi-center studies. The code can be found at https://github.com/haolu123/GPNet_pathway\">https://github.com/haolu123/GPNet_pathway.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11775473/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143063819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: HHOMR: a hybrid high-order moment residual model for miRNA-disease association prediction.","authors":"","doi":"10.1093/bib/bbae684","DOIUrl":"10.1093/bib/bbae684","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11649758/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142833836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xizi Luo, Amadeus Song Yi Chi, Andre Huikai Lin, Tze Jet Ong, Limsoon Wong, Chowdhury Rafeed Rahman
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.
{"title":"Benchmarking recent computational tools for DNA-binding protein identification.","authors":"Xizi Luo, Amadeus Song Yi Chi, Andre Huikai Lin, Tze Jet Ong, Limsoon Wong, Chowdhury Rafeed Rahman","doi":"10.1093/bib/bbae634","DOIUrl":"10.1093/bib/bbae634","url":null,"abstract":"<p><p>Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11630855/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142805994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boya Ji, Xiaoqi Wang, Xiang Wang, Liwen Xu, Shaoliang Peng
Cell-cell communications (CCCs) involve signaling from multiple sender cells that collectively impact downstream functional processes in receiver cells. Currently, computational methods are lacking for quantifying the contribution of pairwise combinations of cell types to specific functional processes in receiver cells (e.g. target gene expression or cell states). This limitation has impeded understanding the underlying mechanisms of cancer progression and identifying potential therapeutic targets. Here, we proposed a deep learning-based method, scDCA, to decipher the dominant cell communication assembly (DCA) that have a higher impact on a particular functional event in receiver cells from single-cell RNA-seq data. Specifically, scDCA employed a multi-view graph convolution network to reconstruct the CCCs landscape at single-cell resolution, and then identified DCA by interpreting the model with the attention mechanism. Taking the samples from advanced renal cell carcinoma as a case study, the scDCA was successfully applied and validated in revealing the DCA affecting the crucial gene expression in immune cells. The scDCA was also applied and validated in revealing the DCA responsible for the variation of 14 typical functional states of malignant cells. Furthermore, the scDCA was applied and validated to explore the alteration of CCCs under clinical intervention by comparing the DCA for certain cytotoxic factors between patients with and without immunotherapy. In summary, scDCA provides a valuable and practical tool for deciphering the cell type combinations with the most dominant impact on a specific functional process of receiver cells, which is of great significance for precise cancer treatment. Our data and code are free available at a public GitHub repository: https://github.com/pengsl-lab/scDCA.git.
{"title":"scDCA: deciphering the dominant cell communication assembly of downstream functional events from single-cell RNA-seq data.","authors":"Boya Ji, Xiaoqi Wang, Xiang Wang, Liwen Xu, Shaoliang Peng","doi":"10.1093/bib/bbae663","DOIUrl":"10.1093/bib/bbae663","url":null,"abstract":"<p><p>Cell-cell communications (CCCs) involve signaling from multiple sender cells that collectively impact downstream functional processes in receiver cells. Currently, computational methods are lacking for quantifying the contribution of pairwise combinations of cell types to specific functional processes in receiver cells (e.g. target gene expression or cell states). This limitation has impeded understanding the underlying mechanisms of cancer progression and identifying potential therapeutic targets. Here, we proposed a deep learning-based method, scDCA, to decipher the dominant cell communication assembly (DCA) that have a higher impact on a particular functional event in receiver cells from single-cell RNA-seq data. Specifically, scDCA employed a multi-view graph convolution network to reconstruct the CCCs landscape at single-cell resolution, and then identified DCA by interpreting the model with the attention mechanism. Taking the samples from advanced renal cell carcinoma as a case study, the scDCA was successfully applied and validated in revealing the DCA affecting the crucial gene expression in immune cells. The scDCA was also applied and validated in revealing the DCA responsible for the variation of 14 typical functional states of malignant cells. Furthermore, the scDCA was applied and validated to explore the alteration of CCCs under clinical intervention by comparing the DCA for certain cytotoxic factors between patients with and without immunotherapy. In summary, scDCA provides a valuable and practical tool for deciphering the cell type combinations with the most dominant impact on a specific functional process of receiver cells, which is of great significance for precise cancer treatment. Our data and code are free available at a public GitHub repository: https://github.com/pengsl-lab/scDCA.git.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11653571/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142852814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets-potential biomarker solutions-to fine-tune a machine learning model's performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.
{"title":"Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection.","authors":"Luca Cattelani, Vittorio Fortino","doi":"10.1093/bib/bbae674","DOIUrl":"10.1093/bib/bbae674","url":null,"abstract":"<p><p>The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets-potential biomarker solutions-to fine-tune a machine learning model's performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The diffusion generative model has achieved remarkable performance across various research fields. In this study, we propose a transferable graph attention diffusion model, GADIFF, for a molecular conformation generation task. With adopting multiple equivariant networks in the Markov chain, GADIFF adds GIN (Graph Isomorphism Network) to acquire local information of subgraphs with different edge types (atomic bonds, bond angle interactions, torsion angle interactions, long-range interactions) and applies MSA (Multi-head Self-attention) as noise attention mechanism to capture global molecular information, which improves the representative of features. In addition, we utilize MSA to calculate dynamic noise weights to boost molecular conformation noise prediction. Upon the improvements, GADIFF achieves competitive performance compared with recently reported state-of-the-art models in terms of generation diversity(COV-R, COV-P), accuracy (MAT-R, MAT-P), and property prediction for GEOM-QM9 and GEOM-Drugs datasets. In particular, on the GEOM-Drugs dataset, the average COV-R is improved by 3.75% compared with the best baseline model at a threshold (1.25 Å). Furthermore, a transfer model named GADIFF-NCI based on GADIFF is developed to generate conformations for noncovalent interaction (NCI) molecular systems. It takes GADIFF with GEOM-QM9 dataset as a pre-trained model, and incorporates a graph encoder for learning molecular vectors at the NCI molecular level. The resulting NCI molecular conformations are reasonable, as assessed by the evaluation of conformation and property predictions. This suggests that the proposed transferable model may hold noteworthy value for the study of multi-molecular conformations. The code and data of GADIFF is freely downloaded from https://github.com/WangDHg/GADIFF.
{"title":"GADIFF: a transferable graph attention diffusion model for generating molecular conformations.","authors":"Donghan Wang, Xu Dong, Xueyou Zhang, LiHong Hu","doi":"10.1093/bib/bbae676","DOIUrl":"10.1093/bib/bbae676","url":null,"abstract":"<p><p>The diffusion generative model has achieved remarkable performance across various research fields. In this study, we propose a transferable graph attention diffusion model, GADIFF, for a molecular conformation generation task. With adopting multiple equivariant networks in the Markov chain, GADIFF adds GIN (Graph Isomorphism Network) to acquire local information of subgraphs with different edge types (atomic bonds, bond angle interactions, torsion angle interactions, long-range interactions) and applies MSA (Multi-head Self-attention) as noise attention mechanism to capture global molecular information, which improves the representative of features. In addition, we utilize MSA to calculate dynamic noise weights to boost molecular conformation noise prediction. Upon the improvements, GADIFF achieves competitive performance compared with recently reported state-of-the-art models in terms of generation diversity(COV-R, COV-P), accuracy (MAT-R, MAT-P), and property prediction for GEOM-QM9 and GEOM-Drugs datasets. In particular, on the GEOM-Drugs dataset, the average COV-R is improved by 3.75% compared with the best baseline model at a threshold (1.25 Å). Furthermore, a transfer model named GADIFF-NCI based on GADIFF is developed to generate conformations for noncovalent interaction (NCI) molecular systems. It takes GADIFF with GEOM-QM9 dataset as a pre-trained model, and incorporates a graph encoder for learning molecular vectors at the NCI molecular level. The resulting NCI molecular conformations are reasonable, as assessed by the evaluation of conformation and property predictions. This suggests that the proposed transferable model may hold noteworthy value for the study of multi-molecular conformations. The code and data of GADIFF is freely downloaded from https://github.com/WangDHg/GADIFF.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684900/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chelsea Chen Yuge, Ee Soon Hang, Madasamy Ravi Nadar Mamtha, Shashikant Vishwakarma, Sijia Wang, Cheng Wang, Nguyen Quoc Khanh Le
Accurate prediction of RNA modifications holds profound implications for elucidating RNA function and mechanism, with potential applications in drug development. Here, the RNA-ModX presents a highly precise predictive model designed to forecast post-transcriptional RNA modifications, complemented by a user-friendly web application tailored for seamless utilization by future researchers. To achieve exceptional accuracy, the RNA-ModX systematically explored a range of machine learning models, including Long Short-Term Memory (LSTM), Gated Recurrent Unit, and Transformer-based architectures. The model underwent rigorous testing using a dataset comprising RNA sequences containing the four fundamental nucleotides (A, C, G, U) and spanning 12 prevalent modification classes (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um), with sequences of length 1001 nucleotides. Notably, the LSTM model, augmented with 3-mer encoding, demonstrated the highest level of model accuracy. Furthermore, Local Interpretable Model-Agnostic Explanations were employed to facilitate result interpretation, enhancing the transparency and interpretability of the model's predictions. In conjunction with the model development, a user-friendly web application was meticulously crafted, featuring an intuitive interface for researchers to effortlessly upload RNA sequences. Upon submission, the model executes in the backend, generating predictions which are seamlessly presented to the user in a coherent manner. This integration of cutting-edge predictive modeling with a user-centric interface signifies a significant step forward in facilitating the exploration and utilization of RNA modification prediction technologies by the broader research community.
准确预测RNA修饰对阐明RNA的功能和机制具有深远的意义,在药物开发中具有潜在的应用价值。在这里,RNA- modx提出了一个高度精确的预测模型,旨在预测转录后RNA修饰,辅以用户友好的web应用程序,为未来的研究人员量身定制无缝使用。为了达到卓越的准确性,RNA-ModX系统地探索了一系列机器学习模型,包括长短期记忆(LSTM)、门控循环单元和基于变压器的架构。该模型使用包含包含四种基本核苷酸(a, C, G, U)的RNA序列的数据集进行了严格的测试,这些序列包含12种常见的修饰类(m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm和Um),序列长度为1001个核苷酸。值得注意的是,使用3-mer编码增强的LSTM模型显示出最高水平的模型精度。此外,采用局部可解释模型不可知论解释(Local Interpretable model - agnostic interpretation)促进结果解释,提高模型预测的透明度和可解释性。与模型开发相结合,精心制作了一个用户友好的web应用程序,具有直观的界面,供研究人员毫不费力地上传RNA序列。提交后,模型在后端执行,生成以一致的方式无缝呈现给用户的预测。将尖端预测建模与以用户为中心的界面相结合,标志着更广泛的研究界在促进RNA修饰预测技术的探索和利用方面迈出了重要的一步。
{"title":"RNA-ModX: a multilabel prediction and interpretation framework for RNA modifications.","authors":"Chelsea Chen Yuge, Ee Soon Hang, Madasamy Ravi Nadar Mamtha, Shashikant Vishwakarma, Sijia Wang, Cheng Wang, Nguyen Quoc Khanh Le","doi":"10.1093/bib/bbae688","DOIUrl":"10.1093/bib/bbae688","url":null,"abstract":"<p><p>Accurate prediction of RNA modifications holds profound implications for elucidating RNA function and mechanism, with potential applications in drug development. Here, the RNA-ModX presents a highly precise predictive model designed to forecast post-transcriptional RNA modifications, complemented by a user-friendly web application tailored for seamless utilization by future researchers. To achieve exceptional accuracy, the RNA-ModX systematically explored a range of machine learning models, including Long Short-Term Memory (LSTM), Gated Recurrent Unit, and Transformer-based architectures. The model underwent rigorous testing using a dataset comprising RNA sequences containing the four fundamental nucleotides (A, C, G, U) and spanning 12 prevalent modification classes (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um), with sequences of length 1001 nucleotides. Notably, the LSTM model, augmented with 3-mer encoding, demonstrated the highest level of model accuracy. Furthermore, Local Interpretable Model-Agnostic Explanations were employed to facilitate result interpretation, enhancing the transparency and interpretability of the model's predictions. In conjunction with the model development, a user-friendly web application was meticulously crafted, featuring an intuitive interface for researchers to effortlessly upload RNA sequences. Upon submission, the model executes in the backend, generating predictions which are seamlessly presented to the user in a coherent manner. This integration of cutting-edge predictive modeling with a user-centric interface signifies a significant step forward in facilitating the exploration and utilization of RNA modification prediction technologies by the broader research community.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684893/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongping Liu, Dinghao Liu, Kewei Sheng, Zhenyong Cheng, Zixuan Liu, Yanling Qiao, Shangxuan Cai, Yulong Li, Jubo Wang, Hongyang Chen, Chi Hu, Peng Xu, Bin Di, Jun Liao
The supervision of novel psychoactive substances (NPSs) is a global problem, and the regulation of NPSs was heavily relied on identifying structural matches in established NPSs databases. However, violators could circumvent legal oversight by altering the side chain structure of recognized NPSs and the existing methods cannot overcome the inaccuracy and lag of supervision. In this study, we propose a scaffold and transformer-based NPS generation and Screening (STNGS) framework to systematically identify and evaluate potential NPSs. A scaffold-based generative model and a rank function with four parts are contained by our framework. Our generative model shows excellent performance in the design and optimization of general molecules and NPS-like molecules by chemical space analysis and property distribution analysis. The rank function includes synthetic accessibility score and frequency score, as well as confidence score and affinity score evaluated by a neural network, which enables the precise positioning of potential NPSs. Applied STNGS framework with molecular docking and a G protein-coupled receptor (GPCR) activation-based sensor (GRAB), we successfully identify three novel synthetic cannabinoids with activity. STNGS constrains the chemical space to generate NPS-like molecules database with diversity and novelty, which assists in the ex-ante regulation of NPSs.
{"title":"STNGS: a deep scaffold learning-driven generation and screening framework for discovering potential novel psychoactive substances.","authors":"Dongping Liu, Dinghao Liu, Kewei Sheng, Zhenyong Cheng, Zixuan Liu, Yanling Qiao, Shangxuan Cai, Yulong Li, Jubo Wang, Hongyang Chen, Chi Hu, Peng Xu, Bin Di, Jun Liao","doi":"10.1093/bib/bbae690","DOIUrl":"10.1093/bib/bbae690","url":null,"abstract":"<p><p>The supervision of novel psychoactive substances (NPSs) is a global problem, and the regulation of NPSs was heavily relied on identifying structural matches in established NPSs databases. However, violators could circumvent legal oversight by altering the side chain structure of recognized NPSs and the existing methods cannot overcome the inaccuracy and lag of supervision. In this study, we propose a scaffold and transformer-based NPS generation and Screening (STNGS) framework to systematically identify and evaluate potential NPSs. A scaffold-based generative model and a rank function with four parts are contained by our framework. Our generative model shows excellent performance in the design and optimization of general molecules and NPS-like molecules by chemical space analysis and property distribution analysis. The rank function includes synthetic accessibility score and frequency score, as well as confidence score and affinity score evaluated by a neural network, which enables the precise positioning of potential NPSs. Applied STNGS framework with molecular docking and a G protein-coupled receptor (GPCR) activation-based sensor (GRAB), we successfully identify three novel synthetic cannabinoids with activity. STNGS constrains the chemical space to generate NPS-like molecules database with diversity and novelty, which assists in the ex-ante regulation of NPSs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684896/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142906353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot learning is a crucial approach for macromolecule classification of the cryo-electron tomography (Cryo-ET) subvolumes, enabling rapid adaptation to novel tasks with a small support set of labeled data. However, existing few-shot classification methods for macromolecules in Cryo-ET consider only marginal distributions and overlook joint distributions, failing to capture feature dependencies fully. To address this issue, we propose a method for macromolecular few-shot classification using deep Brownian Distance Covariance (BDC). Our method models the joint distribution within a transfer learning framework, enhancing the modeling capabilities. We insert the BDC module after the feature extractor and only train the feature extractor during the training phase. Then, we enhance the model's generalization capability with self-distillation techniques. In the adaptation phase, we fine-tune the classifier with minimal labeled data. We conduct experiments on publicly available SHREC datasets and a small-scale synthetic dataset to evaluate our method. Results show that our method improves the classification capabilities by introducing the joint distribution.
{"title":"Few-shot classification of Cryo-ET subvolumes with deep Brownian distance covariance.","authors":"Xueshi Yu, Renmin Han, Haitao Jiao, Wenjia Meng","doi":"10.1093/bib/bbae643","DOIUrl":"10.1093/bib/bbae643","url":null,"abstract":"<p><p>Few-shot learning is a crucial approach for macromolecule classification of the cryo-electron tomography (Cryo-ET) subvolumes, enabling rapid adaptation to novel tasks with a small support set of labeled data. However, existing few-shot classification methods for macromolecules in Cryo-ET consider only marginal distributions and overlook joint distributions, failing to capture feature dependencies fully. To address this issue, we propose a method for macromolecular few-shot classification using deep Brownian Distance Covariance (BDC). Our method models the joint distribution within a transfer learning framework, enhancing the modeling capabilities. We insert the BDC module after the feature extractor and only train the feature extractor during the training phase. Then, we enhance the model's generalization capability with self-distillation techniques. In the adaptation phase, we fine-tune the classifier with minimal labeled data. We conduct experiments on publicly available SHREC datasets and a small-scale synthetic dataset to evaluate our method. Results show that our method improves the classification capabilities by introducing the joint distribution.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11637689/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142817166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}