Recent advances in high-throughput sequencing have led to an explosion of genomic and transcriptomic data, offering a wealth of protein sequence information. However, the functions of most proteins remain unannotated. Traditional experimental methods for annotation of protein functions are costly and time-consuming. Current deep learning methods typically rely on Graph Convolutional Networks to propagate features between protein residues. However, these methods fail to capture fine atomic-level geometric structural features and cannot directly compute or propagate structural features (such as distances, directions, and angles) when transmitting features, often simplifying them to scalars. Additionally, difficulties in capturing long-range dependencies limit the model's ability to identify key nodes (residues). To address these challenges, we propose a geometric graph network (GGN-GO) for predicting protein function that enriches feature extraction by capturing multi-scale geometric structural features at the atomic and residue levels. We use a geometric vector perceptron to convert these features into vector representations and aggregate them with node features for better understanding and propagation in the network. Moreover, we introduce a graph attention pooling layer captures key node information by adaptively aggregating local functional motifs, while contrastive learning enhances graph representation discriminability through random noise and different views. The experimental results show that GGN-GO outperforms six comparative methods in tasks with the most labels for both experimentally validated and predicted protein structures. Furthermore, GGN-GO identifies functional residues corresponding to those experimentally confirmed, showcasing its interpretability and the ability to pinpoint key protein regions. The code and data are available at: https://github.com/MiJia-ID/GGN-GO.
{"title":"GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features.","authors":"Jia Mi, Han Wang, Jing Li, Jinghong Sun, Chang Li, Jing Wan, Yuan Zeng, Jingyang Gao","doi":"10.1093/bib/bbae559","DOIUrl":"10.1093/bib/bbae559","url":null,"abstract":"<p><p>Recent advances in high-throughput sequencing have led to an explosion of genomic and transcriptomic data, offering a wealth of protein sequence information. However, the functions of most proteins remain unannotated. Traditional experimental methods for annotation of protein functions are costly and time-consuming. Current deep learning methods typically rely on Graph Convolutional Networks to propagate features between protein residues. However, these methods fail to capture fine atomic-level geometric structural features and cannot directly compute or propagate structural features (such as distances, directions, and angles) when transmitting features, often simplifying them to scalars. Additionally, difficulties in capturing long-range dependencies limit the model's ability to identify key nodes (residues). To address these challenges, we propose a geometric graph network (GGN-GO) for predicting protein function that enriches feature extraction by capturing multi-scale geometric structural features at the atomic and residue levels. We use a geometric vector perceptron to convert these features into vector representations and aggregate them with node features for better understanding and propagation in the network. Moreover, we introduce a graph attention pooling layer captures key node information by adaptively aggregating local functional motifs, while contrastive learning enhances graph representation discriminability through random noise and different views. The experimental results show that GGN-GO outperforms six comparative methods in tasks with the most labels for both experimentally validated and predicted protein structures. Furthermore, GGN-GO identifies functional residues corresponding to those experimentally confirmed, showcasing its interpretability and the ability to pinpoint key protein regions. The code and data are available at: https://github.com/MiJia-ID/GGN-GO.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11530295/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anjana Kushwaha, Patrice Duroux, Véronique Giudicelli, Konstantin Todorov, Sofia Kossida
The accurate prediction of peptide-major histocompatibility complex (MHC) class I binding probabilities is a critical endeavor in immunoinformatics, with broad implications for vaccine development and immunotherapies. While recent deep neural network based approaches have showcased promise in peptide-MHC (pMHC) prediction, they have two shortcomings: (i) they rely on hand-crafted pseudo-sequence extraction, (ii) they do not generalize well to different datasets, which limits the practicality of these approaches. While existing methods rely on a 34 amino acid pseudo-sequence, our findings uncover the involvement of 147 positions in direct interactions between MHC and peptide. We further show that neural architectures can learn the intricacies of pMHC binding using even full sequences. To this end, we present PerceiverpMHC that is able to learn accurate representations on full-sequences by leveraging efficient transformer based architectures. Additionally, we propose IMGT/RobustpMHC that harnesses the potential of unlabeled data in improving the robustness of pMHC binding predictions through a self-supervised learning strategy. We extensively evaluate RobustpMHC on eight different datasets and showcase an overall improvement of over 6% in binding prediction accuracy compared to state-of-the-art approaches. We compile CrystalIMGT, a crystallography-verified dataset presenting a challenge to existing approaches due to significantly different pMHC distributions. Finally, to mitigate this distribution gap, we further develop a transfer learning pipeline.
{"title":"IMGT/RobustpMHC: robust training for class-I MHC peptide binding prediction.","authors":"Anjana Kushwaha, Patrice Duroux, Véronique Giudicelli, Konstantin Todorov, Sofia Kossida","doi":"10.1093/bib/bbae552","DOIUrl":"10.1093/bib/bbae552","url":null,"abstract":"<p><p>The accurate prediction of peptide-major histocompatibility complex (MHC) class I binding probabilities is a critical endeavor in immunoinformatics, with broad implications for vaccine development and immunotherapies. While recent deep neural network based approaches have showcased promise in peptide-MHC (pMHC) prediction, they have two shortcomings: (i) they rely on hand-crafted pseudo-sequence extraction, (ii) they do not generalize well to different datasets, which limits the practicality of these approaches. While existing methods rely on a 34 amino acid pseudo-sequence, our findings uncover the involvement of 147 positions in direct interactions between MHC and peptide. We further show that neural architectures can learn the intricacies of pMHC binding using even full sequences. To this end, we present PerceiverpMHC that is able to learn accurate representations on full-sequences by leveraging efficient transformer based architectures. Additionally, we propose IMGT/RobustpMHC that harnesses the potential of unlabeled data in improving the robustness of pMHC binding predictions through a self-supervised learning strategy. We extensively evaluate RobustpMHC on eight different datasets and showcase an overall improvement of over 6% in binding prediction accuracy compared to state-of-the-art approaches. We compile CrystalIMGT, a crystallography-verified dataset presenting a challenge to existing approaches due to significantly different pMHC distributions. Finally, to mitigate this distribution gap, we further develop a transfer learning pipeline.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11540059/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142589993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng Liu, Hye Seung Nam, Ziyu Zeng, Xuehong Deng, Elnaz Pashaei, Yong Zang, Lei Yang, Chenglong Li, Jiaoti Huang, Michael K Wendt, Xin Lu, Rong Huang, Jun Wan
Prostate cancer (PCa) is the most prevalent cancer affecting American men. Castration-resistant prostate cancer (CRPC) can emerge during hormone therapy for PCa, manifesting with elevated serum prostate-specific antigen levels, continued disease progression, and/or metastasis to the new sites, resulting in a poor prognosis. A subset of CRPC patients shows a neuroendocrine (NE) phenotype, signifying reduced or no reliance on androgen receptor signaling and a particularly unfavorable prognosis. In this study, we incorporated computational approaches based on both gene expression profiles and protein-protein interaction networks. We identified 500 potential marker genes, which are significantly enriched in cell cycle and neuronal processes. The top 40 candidates, collectively named CDHu40, demonstrated superior performance in distinguishing NE PCa (NEPC) and non-NEPC samples based on gene expression profiles. CDHu40 outperformed most of the other published marker sets, excelling particularly at the prognostic level. Notably, some marker genes in CDHu40, absent in the other marker sets, have been reported to be associated with NEPC in the literature, such as DDC, FOLH1, BEX1, MAST1, and CACNA1A. Importantly, elevated CDHu40 scores derived from our predictive model showed a robust correlation with unfavorable survival outcomes in patients, indicating the potential of the CDHu40 score as a promising indicator for predicting the survival prognosis of those patients with the NE phenotype. Motif enrichment analysis on the top candidates suggests that REST and E2F6 may serve as key regulators in the NEPC progression.
{"title":"CDHu40: a novel marker gene set of neuroendocrine prostate cancer.","authors":"Sheng Liu, Hye Seung Nam, Ziyu Zeng, Xuehong Deng, Elnaz Pashaei, Yong Zang, Lei Yang, Chenglong Li, Jiaoti Huang, Michael K Wendt, Xin Lu, Rong Huang, Jun Wan","doi":"10.1093/bib/bbae471","DOIUrl":"10.1093/bib/bbae471","url":null,"abstract":"<p><p>Prostate cancer (PCa) is the most prevalent cancer affecting American men. Castration-resistant prostate cancer (CRPC) can emerge during hormone therapy for PCa, manifesting with elevated serum prostate-specific antigen levels, continued disease progression, and/or metastasis to the new sites, resulting in a poor prognosis. A subset of CRPC patients shows a neuroendocrine (NE) phenotype, signifying reduced or no reliance on androgen receptor signaling and a particularly unfavorable prognosis. In this study, we incorporated computational approaches based on both gene expression profiles and protein-protein interaction networks. We identified 500 potential marker genes, which are significantly enriched in cell cycle and neuronal processes. The top 40 candidates, collectively named CDHu40, demonstrated superior performance in distinguishing NE PCa (NEPC) and non-NEPC samples based on gene expression profiles. CDHu40 outperformed most of the other published marker sets, excelling particularly at the prognostic level. Notably, some marker genes in CDHu40, absent in the other marker sets, have been reported to be associated with NEPC in the literature, such as DDC, FOLH1, BEX1, MAST1, and CACNA1A. Importantly, elevated CDHu40 scores derived from our predictive model showed a robust correlation with unfavorable survival outcomes in patients, indicating the potential of the CDHu40 score as a promising indicator for predicting the survival prognosis of those patients with the NE phenotype. Motif enrichment analysis on the top candidates suggests that REST and E2F6 may serve as key regulators in the NEPC progression.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11422505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harvard Wai Hann Hui, Weijia Kong, Wilson Wen Bin Goh
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses.
{"title":"Thinking points for effective batch correction on biomedical data.","authors":"Harvard Wai Hann Hui, Weijia Kong, Wilson Wen Bin Goh","doi":"10.1093/bib/bbae515","DOIUrl":"https://doi.org/10.1093/bib/bbae515","url":null,"abstract":"<p><p>Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mediation analysis has been widely utilized to identify potential pathways connecting exposures and outcomes. However, there remains a lack of analytical methods for high-dimensional mediation analysis in longitudinal data. To tackle this concern, we proposed an effective and novel approach with variable selection and the indirect effect (IE) assessment based on both linear mixed-effect model and generalized estimating equation. Initially, we employ sure independence screening to reduce the dimension of candidate mediators. Subsequently, we implement the Sobel test with the Bonferroni correction for IE hypothesis testing. Through extensive simulation studies, we demonstrate the performance of our proposed procedure with a higher F$_{1}$ score (0.8056 and 0.9983 at sample sizes of 150 and 500, respectively) compared with the linear method (0.7779 and 0.9642 at the same sample sizes), along with more accurate parameter estimation and a significantly lower false discovery rate. Moreover, we apply our methodology to explore the mediation mechanisms involving over 730 000 DNA methylation sites with potential effects between the paternal body mass index (BMI) and offspring growing BMI in the Shanghai sleeping birth cohort data, leading to the identification of two previously undiscovered mediating CpG sites.
{"title":"Mediation analysis in longitudinal study with high-dimensional methylation mediators.","authors":"Yidan Cui, Qingmin Lin, Xin Yuan, Fan Jiang, Shiyang Ma, Zhangsheng Yu","doi":"10.1093/bib/bbae496","DOIUrl":"https://doi.org/10.1093/bib/bbae496","url":null,"abstract":"<p><p>Mediation analysis has been widely utilized to identify potential pathways connecting exposures and outcomes. However, there remains a lack of analytical methods for high-dimensional mediation analysis in longitudinal data. To tackle this concern, we proposed an effective and novel approach with variable selection and the indirect effect (IE) assessment based on both linear mixed-effect model and generalized estimating equation. Initially, we employ sure independence screening to reduce the dimension of candidate mediators. Subsequently, we implement the Sobel test with the Bonferroni correction for IE hypothesis testing. Through extensive simulation studies, we demonstrate the performance of our proposed procedure with a higher F$_{1}$ score (0.8056 and 0.9983 at sample sizes of 150 and 500, respectively) compared with the linear method (0.7779 and 0.9642 at the same sample sizes), along with more accurate parameter estimation and a significantly lower false discovery rate. Moreover, we apply our methodology to explore the mediation mechanisms involving over 730 000 DNA methylation sites with potential effects between the paternal body mass index (BMI) and offspring growing BMI in the Shanghai sleeping birth cohort data, leading to the identification of two previously undiscovered mediating CpG sites.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11479716/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142458376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs) and microRNAs (miRNAs), play crucial roles in gene expression regulation and are significant in disease associations and medical research. Accurate ncRNA-disease association prediction is essential for understanding disease mechanisms and developing treatments. Existing methods often focus on single tasks like lncRNA-disease associations (LDAs), miRNA-disease associations (MDAs), or lncRNA-miRNA interactions (LMIs), and fail to exploit heterogeneous graph characteristics. We propose ACLNDA, an asymmetric graph contrastive learning framework for analyzing heterophilic ncRNA-disease associations. It constructs inter-layer adjacency matrices from the original lncRNA, miRNA, and disease associations, and uses a Top-K intra-layer similarity edges construction approach to form a triple-layer heterogeneous graph. Unlike traditional works, to account for both node attribute features (ncRNA/disease) and node preference features (association), ACLNDA employs an asymmetric yet simple graph contrastive learning framework to maximize one-hop neighborhood context and two-hop similarity, extracting ncRNA-disease features without relying on graph augmentations or homophily assumptions, reducing computational cost while preserving data integrity. Our framework is capable of being applied to a universal range of potential LDA, MDA, and LMI association predictions. Further experimental results demonstrate superior performance to other existing state-of-the-art baseline methods, which shows its potential for providing insights into disease diagnosis and therapeutic target identification. The source code and data of ACLNDA is publicly available at https://github.com/AI4Bread/ACLNDA.
{"title":"ACLNDA: an asymmetric graph contrastive learning framework for predicting noncoding RNA-disease associations in heterogeneous graphs.","authors":"Laiyi Fu, ZhiYuan Yao, Yangyi Zhou, Qinke Peng, Hongqiang Lyu","doi":"10.1093/bib/bbae533","DOIUrl":"https://doi.org/10.1093/bib/bbae533","url":null,"abstract":"<p><p>Noncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs) and microRNAs (miRNAs), play crucial roles in gene expression regulation and are significant in disease associations and medical research. Accurate ncRNA-disease association prediction is essential for understanding disease mechanisms and developing treatments. Existing methods often focus on single tasks like lncRNA-disease associations (LDAs), miRNA-disease associations (MDAs), or lncRNA-miRNA interactions (LMIs), and fail to exploit heterogeneous graph characteristics. We propose ACLNDA, an asymmetric graph contrastive learning framework for analyzing heterophilic ncRNA-disease associations. It constructs inter-layer adjacency matrices from the original lncRNA, miRNA, and disease associations, and uses a Top-K intra-layer similarity edges construction approach to form a triple-layer heterogeneous graph. Unlike traditional works, to account for both node attribute features (ncRNA/disease) and node preference features (association), ACLNDA employs an asymmetric yet simple graph contrastive learning framework to maximize one-hop neighborhood context and two-hop similarity, extracting ncRNA-disease features without relying on graph augmentations or homophily assumptions, reducing computational cost while preserving data integrity. Our framework is capable of being applied to a universal range of potential LDA, MDA, and LMI association predictions. Further experimental results demonstrate superior performance to other existing state-of-the-art baseline methods, which shows its potential for providing insights into disease diagnosis and therapeutic target identification. The source code and data of ACLNDA is publicly available at https://github.com/AI4Bread/ACLNDA.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11497849/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to: Structure prediction of linear and cyclic peptides using CABS-flex.","authors":"","doi":"10.1093/bib/bbae605","DOIUrl":"10.1093/bib/bbae605","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11567773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142643855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Zhang, Yaxin Xu, Xiaoying Zheng, Juan Shen, Yuanyuan Li
Single-cell RNA sequencing (scRNA-seq) technology is one of the most cost-effective and efficacious methods for revealing cellular heterogeneity and diversity. Precise identification of cell types is essential for establishing a robust foundation for downstream analyses and is a prerequisite for understanding heterogeneous mechanisms. However, the accuracy of existing methods warrants improvement, and highly accurate methods often impose stringent equipment requirements. Moreover, most unsupervised learning-based approaches are constrained by the need to input the number of cell types a prior, which limits their widespread application. In this paper, we propose a novel algorithm framework named WLGG. Initially, to capture the underlying nonlinear information, we introduce a weighted distance penalty term utilizing the Gaussian kernel function, which maps data from a low-dimensional nonlinear space to a high-dimensional linear space. We subsequently impose a Lasso constraint on the regularized Gaussian graphical model to enhance its ability to capture linear data characteristics. Additionally, we utilize the Eigengap strategy to predict the number of cell types and obtain predicted labels via spectral clustering. The experimental results on 14 test datasets demonstrate the superior clustering accuracy of the WLGG algorithm over 16 alternative methods. Furthermore, downstream analysis, including marker gene identification, pseudotime inference, and functional enrichment analysis based on the similarity matrix and predicted labels from the WLGG algorithm, substantiates the reliability of WLGG and offers valuable insights into biological dynamic biological processes and regulatory mechanisms.
{"title":"Identifying cell types by lasso-constraint regularized Gaussian graphical model based on weighted distance penalty.","authors":"Wei Zhang, Yaxin Xu, Xiaoying Zheng, Juan Shen, Yuanyuan Li","doi":"10.1093/bib/bbae572","DOIUrl":"10.1093/bib/bbae572","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) technology is one of the most cost-effective and efficacious methods for revealing cellular heterogeneity and diversity. Precise identification of cell types is essential for establishing a robust foundation for downstream analyses and is a prerequisite for understanding heterogeneous mechanisms. However, the accuracy of existing methods warrants improvement, and highly accurate methods often impose stringent equipment requirements. Moreover, most unsupervised learning-based approaches are constrained by the need to input the number of cell types a prior, which limits their widespread application. In this paper, we propose a novel algorithm framework named WLGG. Initially, to capture the underlying nonlinear information, we introduce a weighted distance penalty term utilizing the Gaussian kernel function, which maps data from a low-dimensional nonlinear space to a high-dimensional linear space. We subsequently impose a Lasso constraint on the regularized Gaussian graphical model to enhance its ability to capture linear data characteristics. Additionally, we utilize the Eigengap strategy to predict the number of cell types and obtain predicted labels via spectral clustering. The experimental results on 14 test datasets demonstrate the superior clustering accuracy of the WLGG algorithm over 16 alternative methods. Furthermore, downstream analysis, including marker gene identification, pseudotime inference, and functional enrichment analysis based on the similarity matrix and predicted labels from the WLGG algorithm, substantiates the reliability of WLGG and offers valuable insights into biological dynamic biological processes and regulatory mechanisms.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562834/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142614904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose DGCL, a dual-graph neural networks (GNNs)-based contrastive learning (CL) integrated with mixed molecular fingerprints (MFPs) for molecular property prediction. The DGCL-MFP method contains two stages. In the first pretraining stage, we utilize two different GNNs as encoders to construct CL, rather than using the method of generating enhanced graphs as before. Precisely, DGCL aggregates and enhances features of the same molecule by the Graph Isomorphism Network and the Graph Attention Network, with representations extracted from the same molecule serving as positive samples, and others marked as negative ones. In the downstream tasks training stage, features extracted from the two above pretrained graph networks and the meticulously selected MFPs are concated together to predict molecular properties. Our experiments show that DGCL enhances the performance of existing GNNs by achieving or surpassing the state-of-the-art self-supervised learning models on multiple benchmark datasets. Specifically, DGCL increases the average performance of classification tasks by 3.73$%$ and improves the performance of regression task Lipo by 0.126. Through ablation studies, we validate the impact of network fusion strategies and MFPs on model performance. In addition, DGCL's predictive performance is further enhanced by weighting different molecular features based on the Extended Connectivity Fingerprint. The code and datasets of DGCL will be made publicly available.
{"title":"DGCL: dual-graph neural networks contrastive learning for molecular property prediction.","authors":"Xiuyu Jiang, Liqin Tan, Qingsong Zou","doi":"10.1093/bib/bbae474","DOIUrl":"https://doi.org/10.1093/bib/bbae474","url":null,"abstract":"<p><p>In this paper, we propose DGCL, a dual-graph neural networks (GNNs)-based contrastive learning (CL) integrated with mixed molecular fingerprints (MFPs) for molecular property prediction. The DGCL-MFP method contains two stages. In the first pretraining stage, we utilize two different GNNs as encoders to construct CL, rather than using the method of generating enhanced graphs as before. Precisely, DGCL aggregates and enhances features of the same molecule by the Graph Isomorphism Network and the Graph Attention Network, with representations extracted from the same molecule serving as positive samples, and others marked as negative ones. In the downstream tasks training stage, features extracted from the two above pretrained graph networks and the meticulously selected MFPs are concated together to predict molecular properties. Our experiments show that DGCL enhances the performance of existing GNNs by achieving or surpassing the state-of-the-art self-supervised learning models on multiple benchmark datasets. Specifically, DGCL increases the average performance of classification tasks by 3.73$%$ and improves the performance of regression task Lipo by 0.126. Through ablation studies, we validate the impact of network fusion strategies and MFPs on model performance. In addition, DGCL's predictive performance is further enhanced by weighting different molecular features based on the Extended Connectivity Fingerprint. The code and datasets of DGCL will be made publicly available.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11428321/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell cross-modal joint clustering has been extensively utilized to investigate the tumor microenvironment. Although numerous approaches have been suggested, accurate clustering remains the main challenge. First, the gene expression matrix frequently contains numerous missing values due to measurement limitations. The majority of existing clustering methods treat it as a typical multi-modal dataset without further processing. Few methods conduct recovery before clustering and do not sufficiently engage with the underlying research, leading to suboptimal outcomes. Additionally, the existing cross-modal information fusion strategy does not ensure consistency of representations across different modes, potentially leading to the integration of conflicting information, which could degrade performance. To address these challenges, we propose the 'Recover then Aggregate' strategy and introduce the Unified Cross-Modal Deep Clustering model. Specifically, we have developed a data augmentation technique based on neighborhood similarity, iteratively imposing rank constraints on the Laplacian matrix, thus updating the similarity matrix and recovering dropout events. Concurrently, we integrate cross-modal features and employ contrastive learning to align modality-specific representations with consistent ones, enhancing the effective integration of diverse modal information. Comprehensive experiments on five real-world multi-modal datasets have demonstrated this method's superior effectiveness in single-cell clustering tasks.
{"title":"Recover then aggregate: unified cross-modal deep clustering with global structural information for single-cell data.","authors":"Ziyi Wang, Peng Luo, Mingming Xiao, Boyang Wang, Tianyu Liu, Xiangyu Sun","doi":"10.1093/bib/bbae485","DOIUrl":"10.1093/bib/bbae485","url":null,"abstract":"<p><p>Single-cell cross-modal joint clustering has been extensively utilized to investigate the tumor microenvironment. Although numerous approaches have been suggested, accurate clustering remains the main challenge. First, the gene expression matrix frequently contains numerous missing values due to measurement limitations. The majority of existing clustering methods treat it as a typical multi-modal dataset without further processing. Few methods conduct recovery before clustering and do not sufficiently engage with the underlying research, leading to suboptimal outcomes. Additionally, the existing cross-modal information fusion strategy does not ensure consistency of representations across different modes, potentially leading to the integration of conflicting information, which could degrade performance. To address these challenges, we propose the 'Recover then Aggregate' strategy and introduce the Unified Cross-Modal Deep Clustering model. Specifically, we have developed a data augmentation technique based on neighborhood similarity, iteratively imposing rank constraints on the Laplacian matrix, thus updating the similarity matrix and recovering dropout events. Concurrently, we integrate cross-modal features and employ contrastive learning to align modality-specific representations with consistent ones, enhancing the effective integration of diverse modal information. Comprehensive experiments on five real-world multi-modal datasets have demonstrated this method's superior effectiveness in single-cell clustering tasks.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142361070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}