Drug-related cardiotoxicity, most notably arrhythmia, represents a major challenge in drug development. Inhibition of hERG potassium channel by certain compounds has the potential to delay cardiac repolarization, manifested as QT interval prolongation, thereby elevating the risk of severe cardiac arrhythmias like torsades de pointes (TdP). Accurate assessment of compounds' impact on hERG channels is crucial. Traditional methods are costly and inefficient for large-scale screening. Therefore, developing efficient and accurate computational methods for hERG inhibition prediction is critical. In this study, we present a deep learning framework, named hERG-MFFGNN, aimed at accurately predicting hERG channel blockers while providing model interpretability. To improve both accuracy and generalizability, we implement a multi-feature fusion strategy that systematically integrates molecular structural information. Initially, multiple molecular fingerprint features and molecular descriptors are fused to construct an initial feature representation. Then, graph neural networks are used to extract molecular topological features. These two sets of features are weighted and fused using an attention mechanism to form the final compound representation, enabling a more comprehensive expression of molecular features. The performance of hERG-MFFGNN is assessed using fivefold cross-validation on the benchmark dataset and external validation datasets. The results demonstrate that hERG-MFFGNN achieves AUROC of 0.909 and ACC of 0.854, highlighting its robust predictive capabilities for hERG activity across diverse datasets. We believe that may function as an effective instrument for the early prediction of hERG channel blockers in the phases of drug discovery and development. The complete source code is publicly accessible at https://github.com/zhaoqi106/hERG-MFFGNN .
{"title":"hERG-MFFGNN: An Explainable Deep Learning Model for Predicting Cardiotoxicity Using Multi-feature Fusion and Graph Neural Networks.","authors":"Bingyu Jin, Jiarun Wang, Xin Yang, Lijie Na, Qi Zhao","doi":"10.1007/s12539-025-00768-6","DOIUrl":"https://doi.org/10.1007/s12539-025-00768-6","url":null,"abstract":"<p><p>Drug-related cardiotoxicity, most notably arrhythmia, represents a major challenge in drug development. Inhibition of hERG potassium channel by certain compounds has the potential to delay cardiac repolarization, manifested as QT interval prolongation, thereby elevating the risk of severe cardiac arrhythmias like torsades de pointes (TdP). Accurate assessment of compounds' impact on hERG channels is crucial. Traditional methods are costly and inefficient for large-scale screening. Therefore, developing efficient and accurate computational methods for hERG inhibition prediction is critical. In this study, we present a deep learning framework, named hERG-MFFGNN, aimed at accurately predicting hERG channel blockers while providing model interpretability. To improve both accuracy and generalizability, we implement a multi-feature fusion strategy that systematically integrates molecular structural information. Initially, multiple molecular fingerprint features and molecular descriptors are fused to construct an initial feature representation. Then, graph neural networks are used to extract molecular topological features. These two sets of features are weighted and fused using an attention mechanism to form the final compound representation, enabling a more comprehensive expression of molecular features. The performance of hERG-MFFGNN is assessed using fivefold cross-validation on the benchmark dataset and external validation datasets. The results demonstrate that hERG-MFFGNN achieves AUROC of 0.909 and ACC of 0.854, highlighting its robust predictive capabilities for hERG activity across diverse datasets. We believe that may function as an effective instrument for the early prediction of hERG channel blockers in the phases of drug discovery and development. The complete source code is publicly accessible at https://github.com/zhaoqi106/hERG-MFFGNN .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145124720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1007/s12539-025-00755-x
Linconghua Wang, Ju Xiang, Zihao Guo, Kaixin Zeng, Min Li
Melanoma is a highly malignant skin cancer, and identifying its pathogenic genes is crucial for understanding its pathogenesis and developing treatment strategies. Network-based approaches effectively capture the synergistic interactions among genes and their products within biological systems, yet extracting functional insights from these complex networks remains challenging. Here, we propose a novel approach that combines multiple kernel learning and network impulsive dynamics (MKLNID) to predict melanoma-related pathogenic genes. Specifically, we construct similarity kernels of diseases and genes from the original disease-gene heterogeneous network and melanoma expression profiles. These kernels are integrated via multiple kernel learning to generate enhanced similarity networks for diseases and genes, respectively. Impulsive signals are then applied to specific nodes in the enhanced heterogeneous network, and the resulting dynamical response signatures are used to infer potential pathogenic genes. Comprehensive experiments and case analyses demonstrate the effectiveness of MKLNID in identifying melanoma-related genes. By deeply integrating heterogeneous disease networks with omics data and introducing network dynamics to simulate gene responses, MKLNID offers a new strategy for identifying melanoma-related genes, with potential implications for precision diagnosis and therapy.
{"title":"MKLNID: Identifying Melanoma-related Pathogenic Genes Through Multiple Kernel Learning and Network Impulsive Dynamics.","authors":"Linconghua Wang, Ju Xiang, Zihao Guo, Kaixin Zeng, Min Li","doi":"10.1007/s12539-025-00755-x","DOIUrl":"https://doi.org/10.1007/s12539-025-00755-x","url":null,"abstract":"<p><p>Melanoma is a highly malignant skin cancer, and identifying its pathogenic genes is crucial for understanding its pathogenesis and developing treatment strategies. Network-based approaches effectively capture the synergistic interactions among genes and their products within biological systems, yet extracting functional insights from these complex networks remains challenging. Here, we propose a novel approach that combines multiple kernel learning and network impulsive dynamics (MKLNID) to predict melanoma-related pathogenic genes. Specifically, we construct similarity kernels of diseases and genes from the original disease-gene heterogeneous network and melanoma expression profiles. These kernels are integrated via multiple kernel learning to generate enhanced similarity networks for diseases and genes, respectively. Impulsive signals are then applied to specific nodes in the enhanced heterogeneous network, and the resulting dynamical response signatures are used to infer potential pathogenic genes. Comprehensive experiments and case analyses demonstrate the effectiveness of MKLNID in identifying melanoma-related genes. By deeply integrating heterogeneous disease networks with omics data and introducing network dynamics to simulate gene responses, MKLNID offers a new strategy for identifying melanoma-related genes, with potential implications for precision diagnosis and therapy.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145124747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1007/s12539-025-00749-9
Zeming Li, Yu Xu, Debajyoti Chowdhury, Hip Fung Yip, Chonghao Wang, Lu Zhang
Traditional disease risk prediction models predominantly rely on statistical algorithms and often focus on genetic factors or a limited set of lifestyle factors to estimate the risk of disease onset. Recently, more comprehensive approaches have emerged that integrate genetic factors with additional lifestyle factors (e.g., alcohol intake) and physical features (e.g., body mass index, age) to increase predictive accuracy. Since the onset of complex diseases is often accompanied by the occurrence of comorbidities, incorporating medical history records is a critical yet underexplored avenue for improving risk prediction. In this study, we propose a novel framework, MIDRP (Multi-source Integration for Disease Risk Prediction), which incorporates genetic variants, lifestyle factors, physical attributes, and medical history records to achieve more robust and accurate predictions. At the heart of our approach lies a causal Transformer architecture, specifically designed to extract and interpret nuanced patterns from medical history records. In the experiments, we compared MIDRP with several baselines, including LDPred2, random forest, multilayer perception, logistic regression, AdaBoost, DiseaseCapsule, EIR, and Med-Bert, on three complex diseases Coronary Artery Disease, Type 2 Diabetes, and Breast Cancer using data from the UK Biobank. Our method achieved state-of-the-art performance, AUROC scores of 0.783, 0.841, and 0.784, respectively, demonstrating its potential in the field of complex disease risk prediction.
{"title":"Causal Transformer for Learning Embeddings from Structured Medical History Records and Multi-Source Data Integration for Complex Disease Risk Prediction.","authors":"Zeming Li, Yu Xu, Debajyoti Chowdhury, Hip Fung Yip, Chonghao Wang, Lu Zhang","doi":"10.1007/s12539-025-00749-9","DOIUrl":"https://doi.org/10.1007/s12539-025-00749-9","url":null,"abstract":"<p><p>Traditional disease risk prediction models predominantly rely on statistical algorithms and often focus on genetic factors or a limited set of lifestyle factors to estimate the risk of disease onset. Recently, more comprehensive approaches have emerged that integrate genetic factors with additional lifestyle factors (e.g., alcohol intake) and physical features (e.g., body mass index, age) to increase predictive accuracy. Since the onset of complex diseases is often accompanied by the occurrence of comorbidities, incorporating medical history records is a critical yet underexplored avenue for improving risk prediction. In this study, we propose a novel framework, MIDRP (Multi-source Integration for Disease Risk Prediction), which incorporates genetic variants, lifestyle factors, physical attributes, and medical history records to achieve more robust and accurate predictions. At the heart of our approach lies a causal Transformer architecture, specifically designed to extract and interpret nuanced patterns from medical history records. In the experiments, we compared MIDRP with several baselines, including LDPred2, random forest, multilayer perception, logistic regression, AdaBoost, DiseaseCapsule, EIR, and Med-Bert, on three complex diseases Coronary Artery Disease, Type 2 Diabetes, and Breast Cancer using data from the UK Biobank. Our method achieved state-of-the-art performance, AUROC scores of 0.783, 0.841, and 0.784, respectively, demonstrating its potential in the field of complex disease risk prediction.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145080571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17DOI: 10.1007/s12539-025-00769-5
Yutao Wu, Yi Zhou, Wenjing Shi, Siyu Zhou, Min Jiang, Ke Shen, Xingyun Liu, Xiaoyu Li, Jiao Wang, Chi Zhang, Bairong Shen, Weidong Tian
The oral microbiome plays a crucial role in the development and progression of diseases. The complex interactions between the oral microbiome and diseases are challenging for clinicians in clinical decision-making and scientific research. To address this gap, we developed an oral microbiome knowledge database (ORMCKB), to provide evidence for personalized medicine and scientific research in the oral microbiome-disease axis. The current version of ORMCKB contains 11,554 data entries, encompassing 6941 oral microbe taxonomies, 234 diseases, 220 interventions, and 175 bacteriostats extracted from 818 publications. Compared to ChatGPT-4o, ORMCKB demonstrates superior performance in matching questions with responses (10 vs. 9.6), presenting research article details (10 vs. 5.80), and recommended scientific article authenticity ratio (100% vs. 33.63%). The system usability scale (SUS) and the net promoter score (NPS) were 86.07 and 85.71, respectively. As the first knowledge database focused on the oral microbiome-disease axis, ORMCKB provides a comprehensive, accurate, and user-friendly online resource for identifying key microbial players and their associations with oral diseases in personalized medicine. ORMCKB is set to sustain its prominence in cutting-edge research on the oral microbiome-disease axis, paving the way for future artificial intelligence applications in both scientific research and clinical practice. ORMCKB is publicly available at: http://sysbio.org.cn/ormckb.
{"title":"ORMCKB: A Knowledge Database for Personalized Medicine in Deciphering the Oral Microbiome-Disease Axis.","authors":"Yutao Wu, Yi Zhou, Wenjing Shi, Siyu Zhou, Min Jiang, Ke Shen, Xingyun Liu, Xiaoyu Li, Jiao Wang, Chi Zhang, Bairong Shen, Weidong Tian","doi":"10.1007/s12539-025-00769-5","DOIUrl":"https://doi.org/10.1007/s12539-025-00769-5","url":null,"abstract":"<p><p>The oral microbiome plays a crucial role in the development and progression of diseases. The complex interactions between the oral microbiome and diseases are challenging for clinicians in clinical decision-making and scientific research. To address this gap, we developed an oral microbiome knowledge database (ORMCKB), to provide evidence for personalized medicine and scientific research in the oral microbiome-disease axis. The current version of ORMCKB contains 11,554 data entries, encompassing 6941 oral microbe taxonomies, 234 diseases, 220 interventions, and 175 bacteriostats extracted from 818 publications. Compared to ChatGPT-4o, ORMCKB demonstrates superior performance in matching questions with responses (10 vs. 9.6), presenting research article details (10 vs. 5.80), and recommended scientific article authenticity ratio (100% vs. 33.63%). The system usability scale (SUS) and the net promoter score (NPS) were 86.07 and 85.71, respectively. As the first knowledge database focused on the oral microbiome-disease axis, ORMCKB provides a comprehensive, accurate, and user-friendly online resource for identifying key microbial players and their associations with oral diseases in personalized medicine. ORMCKB is set to sustain its prominence in cutting-edge research on the oral microbiome-disease axis, paving the way for future artificial intelligence applications in both scientific research and clinical practice. ORMCKB is publicly available at: http://sysbio.org.cn/ormckb.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145080607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in spatial transcriptomics (ST) have revolutionized our ability to simultaneously profile gene expression, spatial location, and tissue morphology, enabling the precise mapping of cell types and signaling pathways within their native tissue context. However, the high cost of sequencing remains a significant barrier to its widespread adoption. Although existing methods often leverage histopathological images to predict transcriptomic profiles and identify cellular heterogeneity, few approaches directly estimate cell-type abundance from these images. To address this gap, we propose HPCell, a deep learning framework for inferring cell-type abundance directly from H&E-stained histology images. HPCell comprises three key modules: a pathology foundation module, a hypergraph module, and a Transformer module. It begins by dividing whole-slide images (WSIs) into patches, which are processed by the pathology foundation module using a teacher-student framework to extract robust morphological features. These features are used to construct a hypergraph, where each patch (node) connects to its spatial neighbors to model complex many-to-many relationships. The Transformer module applies attention to the hypergraph features to capture long-range dependencies. Finally, features from all modules are integrated to estimate cell-type abundance. Extensive experiments show that HPCell consistently outperforms state-of-the-art methods across multiple spatial transcriptomics datasets, offering a scalable and cost-effective approach for investigating tissue structure and cellular interactions.
{"title":"Accurately Predicting Cell Type Abundance from Spatial Histology Image Through HPCell.","authors":"Yongkang Zhao, Youyang Li, Weijiang Yu, Hongyu Zhang, Zheng Wang, Yuedong Yang, Yuansong Zeng","doi":"10.1007/s12539-025-00757-9","DOIUrl":"https://doi.org/10.1007/s12539-025-00757-9","url":null,"abstract":"<p><p>Recent advancements in spatial transcriptomics (ST) have revolutionized our ability to simultaneously profile gene expression, spatial location, and tissue morphology, enabling the precise mapping of cell types and signaling pathways within their native tissue context. However, the high cost of sequencing remains a significant barrier to its widespread adoption. Although existing methods often leverage histopathological images to predict transcriptomic profiles and identify cellular heterogeneity, few approaches directly estimate cell-type abundance from these images. To address this gap, we propose HPCell, a deep learning framework for inferring cell-type abundance directly from H&E-stained histology images. HPCell comprises three key modules: a pathology foundation module, a hypergraph module, and a Transformer module. It begins by dividing whole-slide images (WSIs) into patches, which are processed by the pathology foundation module using a teacher-student framework to extract robust morphological features. These features are used to construct a hypergraph, where each patch (node) connects to its spatial neighbors to model complex many-to-many relationships. The Transformer module applies attention to the hypergraph features to capture long-range dependencies. Finally, features from all modules are integrated to estimate cell-type abundance. Extensive experiments show that HPCell consistently outperforms state-of-the-art methods across multiple spatial transcriptomics datasets, offering a scalable and cost-effective approach for investigating tissue structure and cellular interactions.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein-protein interactions (PPIs) are essential therapeutic targets, yet their large and relatively flat interfaces hinder the development of small-molecule inhibitors. Traditional computational approaches rely heavily on existing chemical libraries or expert heuristics, restricting exploration of novel chemical space. To address these challenges, we present Hot2Mol, a generative deep learning framework for the de novo design of target-specific and drug-like PPI inhibitors. Hot2Mol captures crucial pharmacophoric features from hot-spot residues, allowing precise targeting of PPI interfaces while eliminating the need for known bioactive ligands. The framework integrates three main components: a conditional transformer for pharmacophore-guided, property-constrained molecular generation; an E(n)-equivariant graph neural network to ensure accurate spatial alignment with PPI hot-spot pharmacophores; a variational autoencoder to sample novel and diverse molecular structures. Comprehensive assessments demonstrate that Hot2Mol outperforms state-of-the-art models in binding affinity, drug-likeness, synthetic accessibility, novelty, and uniqueness. Molecular dynamics simulations further confirm the strong binding stability of generated compounds. Case studies underscore Hot2Mol's ability to design high-affinity and selective PPI inhibitors, highlighting its potential to accelerate rational PPI-targeted drug discovery.
{"title":"Hot-Spot-Guided Generative Deep Learning for Drug-Like PPI Inhibitor Design.","authors":"Heqi Sun, Jiayi Li, Yufang Zhang, Shenggeng Lin, Junwei Chen, Hong Tan, Ruixuan Wang, Xueying Mao, Jianwei Zhao, Rongpei Li, Dong-Qing Wei","doi":"10.1007/s12539-025-00756-w","DOIUrl":"https://doi.org/10.1007/s12539-025-00756-w","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are essential therapeutic targets, yet their large and relatively flat interfaces hinder the development of small-molecule inhibitors. Traditional computational approaches rely heavily on existing chemical libraries or expert heuristics, restricting exploration of novel chemical space. To address these challenges, we present Hot2Mol, a generative deep learning framework for the de novo design of target-specific and drug-like PPI inhibitors. Hot2Mol captures crucial pharmacophoric features from hot-spot residues, allowing precise targeting of PPI interfaces while eliminating the need for known bioactive ligands. The framework integrates three main components: a conditional transformer for pharmacophore-guided, property-constrained molecular generation; an E(n)-equivariant graph neural network to ensure accurate spatial alignment with PPI hot-spot pharmacophores; a variational autoencoder to sample novel and diverse molecular structures. Comprehensive assessments demonstrate that Hot2Mol outperforms state-of-the-art models in binding affinity, drug-likeness, synthetic accessibility, novelty, and uniqueness. Molecular dynamics simulations further confirm the strong binding stability of generated compounds. Case studies underscore Hot2Mol's ability to design high-affinity and selective PPI inhibitors, highlighting its potential to accelerate rational PPI-targeted drug discovery.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144953019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1007/s12539-025-00762-y
Yue Yu, Wei Zhang, Xiaoying Zheng, Juan Shen, Yuanyuan Li
Single-cell RNA sequencing (scRNA-seq) offers significant opportunities to reveal cellular heterogeneity and diversity. Accurate cell type identification is critical for downstream analyses and understanding the mechanisms of heterogeneity. However, challenges arise from the high dimensionality, sparsity, and noise of scRNA-seq data. While various low-rank representation (LRR)-based clustering methods have been developed, many existing approaches may inaccurately capture relationships or conflate true patterns with noise. To address these limitations, we introduce a novel clustering algorithm that integrates low-rank matrix decomposition with local graph regularization (LRMGC). This approach applies a tri-decomposition strategy to the representation matrix to derive an aligned core matrix, and characterizes the "distance" between cells in a lower-dimensional space through a local manifold regularization term. Rather than relying on the kernel norm of the representation matrix, the Schatten p-norm is applied to the core matrix to robustly learn the similarity matrix against noise and outliers, while maintaining the high-dimensional noisy data's underlying subspace structure for accurate and robust clustering. Additionally, the final similarity matrix is obtained by applying the angular alignment strategy on the similarity matrix. Comprehensive experiments and comparisons with advanced methods on scRNA-seq datasets demonstrate LRMGC's superior performance and reliability in uncovering cell type composition. Furthermore, a variety of downstream analyses, such as marker gene identification, functional enrichment analysis, rare cell recognition, and cell-cell communication, also demonstrate the effectiveness of LRMGC.
{"title":"Clustering Single-Cell RNA-Seq Data with Low-Rank Matrix Factorization and Local Graph Regularization.","authors":"Yue Yu, Wei Zhang, Xiaoying Zheng, Juan Shen, Yuanyuan Li","doi":"10.1007/s12539-025-00762-y","DOIUrl":"10.1007/s12539-025-00762-y","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) offers significant opportunities to reveal cellular heterogeneity and diversity. Accurate cell type identification is critical for downstream analyses and understanding the mechanisms of heterogeneity. However, challenges arise from the high dimensionality, sparsity, and noise of scRNA-seq data. While various low-rank representation (LRR)-based clustering methods have been developed, many existing approaches may inaccurately capture relationships or conflate true patterns with noise. To address these limitations, we introduce a novel clustering algorithm that integrates low-rank matrix decomposition with local graph regularization (LRMGC). This approach applies a tri-decomposition strategy to the representation matrix to derive an aligned core matrix, and characterizes the \"distance\" between cells in a lower-dimensional space through a local manifold regularization term. Rather than relying on the kernel norm of the representation matrix, the Schatten p-norm is applied to the core matrix to robustly learn the similarity matrix against noise and outliers, while maintaining the high-dimensional noisy data's underlying subspace structure for accurate and robust clustering. Additionally, the final similarity matrix is obtained by applying the angular alignment strategy on the similarity matrix. Comprehensive experiments and comparisons with advanced methods on scRNA-seq datasets demonstrate LRMGC's superior performance and reliability in uncovering cell type composition. Furthermore, a variety of downstream analyses, such as marker gene identification, functional enrichment analysis, rare cell recognition, and cell-cell communication, also demonstrate the effectiveness of LRMGC.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":""},"PeriodicalIF":3.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144952966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2024-12-23DOI: 10.1007/s12539-024-00673-4
Lun Zhu, Zehua Chen, Sen Yang
Cell-Penetrating Peptides (CPPs) are a crucial carrier for drug delivery. Since the process of synthesizing new CPPs in the laboratory is both time- and resource-consuming, computational methods to predict potential CPPs can be used to find CPPs to enhance the development of CPPs in therapy. In this study, EnDM-CPP is proposed, which combines machine learning algorithms (SVM and CatBoost) with convolutional neural networks (CNN and TextCNN). For dataset construction, three previous CPP benchmark datasets, including CPPsite 2.0, MLCPP 2.0, and CPP924, are merged to improve the diversity and reduce homology. For feature generation, two language model-based features obtained from the Transformer architecture, including ProtT5 and ESM-2, are employed in CNN and TextCNN. Additionally, sequence features, such as CPRS, Hybrid PseAAC, KSC, etc., are input to SVM and CatBoost. Based on the result of each predictor, Logistic Regression (LR) is built to predict the final decision. The experiment results indicate that ProtT5 and ESM-2 fusion features significantly contribute to predicting CPP and that combining employed features and models demonstrates better association. On an independent test dataset comparison, EnDM-CPP achieved an accuracy of 0.9495 and a Matthews correlation coefficient of 0.9008 with an improvement of 2.23%-9.48% and 4.32%-19.02%, respectively, compared with other state-of-the-art methods. Code and data are available at https://github.com/tudou1231/EnDM-CPP.git .
{"title":"EnDM-CPP: A Multi-view Explainable Framework Based on Deep Learning and Machine Learning for Identifying Cell-Penetrating Peptides with Transformers and Analyzing Sequence Information.","authors":"Lun Zhu, Zehua Chen, Sen Yang","doi":"10.1007/s12539-024-00673-4","DOIUrl":"10.1007/s12539-024-00673-4","url":null,"abstract":"<p><p>Cell-Penetrating Peptides (CPPs) are a crucial carrier for drug delivery. Since the process of synthesizing new CPPs in the laboratory is both time- and resource-consuming, computational methods to predict potential CPPs can be used to find CPPs to enhance the development of CPPs in therapy. In this study, EnDM-CPP is proposed, which combines machine learning algorithms (SVM and CatBoost) with convolutional neural networks (CNN and TextCNN). For dataset construction, three previous CPP benchmark datasets, including CPPsite 2.0, MLCPP 2.0, and CPP924, are merged to improve the diversity and reduce homology. For feature generation, two language model-based features obtained from the Transformer architecture, including ProtT5 and ESM-2, are employed in CNN and TextCNN. Additionally, sequence features, such as CPRS, Hybrid PseAAC, KSC, etc., are input to SVM and CatBoost. Based on the result of each predictor, Logistic Regression (LR) is built to predict the final decision. The experiment results indicate that ProtT5 and ESM-2 fusion features significantly contribute to predicting CPP and that combining employed features and models demonstrates better association. On an independent test dataset comparison, EnDM-CPP achieved an accuracy of 0.9495 and a Matthews correlation coefficient of 0.9008 with an improvement of 2.23%-9.48% and 4.32%-19.02%, respectively, compared with other state-of-the-art methods. Code and data are available at https://github.com/tudou1231/EnDM-CPP.git .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"744-769"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142876973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of methicillin-resistant Staphylococcus aureus (MRSA) as a recognized cause of community-acquired and hospital infections has brought about a need for the efficient and accurate identification of peptides with anti-MRSA properties in drug discovery and development pipelines. However, current experimental methods often tend to be labor- and resource-intensive. Thus, there is an immediate requirement to develop practical computational solutions for identifying sequence-based anti-MRSA peptides. Lately, pre-trained protein language models (pLMs) have emerged as a remarkable advancement for encoding peptide sequences as discriminative feature embeddings, uncovering plentiful protein-level information and successfully repurposing it for in silico peptide property prediction. In this study, we present pLM4MRSA, a framework based on pLMs designed to enhance the accuracy of predicting anti-MRSA peptides. In this framework, we combine feature embeddings from various pLMs, such as ProtTrans, and evolutionary-scale modeling (ESM-2) which provide complementary information for prediction. These individual pLM strengths are integrated to form hybrid feature embeddings. Next, we apply principal component analysis (PCA) to process these hybrid embeddings. The resulting PCA-transformed feature vectors are then used as inputs for constructing the predictive model. Experimental results on the independent test dataset showed that the proposed pLM4MRSA approach achieved a balanced accuracy and Matthew correlation coefficient of 0.983 and 0.980, respectively, representing remarkable improvements over the state-of-the-art methods by 2.53%-4.83% and 7.73%-13.23%, respectively. This indicates that pLM4MRSA is a high-performance prediction model with excellent scope of applicability. Additionally, comparison with well-known hand-crafted features demonstrated that the proposed hybrid feature embeddings complement each other effectively, capturing discriminative patterns for more accurate anti-MRSA peptide prediction. We anticipate that pLM4MRSA will serve as an effective solution for accurate and high-capacity prediction of anti-MRSA peptides from peptide sequences.
{"title":"Advancing the Accuracy of Anti-MRSA Peptide Prediction Through Integrating Multi-Source Protein Language Models.","authors":"Watshara Shoombuatong, Pakpoom Mookdarsanit, Lawankorn Mookdarsanit, Nalini Schaduangrat, Saeed Ahmed, Muhammad Kabir, Pramote Chumnanpuen","doi":"10.1007/s12539-025-00696-5","DOIUrl":"10.1007/s12539-025-00696-5","url":null,"abstract":"<p><p>The emergence of methicillin-resistant Staphylococcus aureus (MRSA) as a recognized cause of community-acquired and hospital infections has brought about a need for the efficient and accurate identification of peptides with anti-MRSA properties in drug discovery and development pipelines. However, current experimental methods often tend to be labor- and resource-intensive. Thus, there is an immediate requirement to develop practical computational solutions for identifying sequence-based anti-MRSA peptides. Lately, pre-trained protein language models (pLMs) have emerged as a remarkable advancement for encoding peptide sequences as discriminative feature embeddings, uncovering plentiful protein-level information and successfully repurposing it for in silico peptide property prediction. In this study, we present pLM4MRSA, a framework based on pLMs designed to enhance the accuracy of predicting anti-MRSA peptides. In this framework, we combine feature embeddings from various pLMs, such as ProtTrans, and evolutionary-scale modeling (ESM-2) which provide complementary information for prediction. These individual pLM strengths are integrated to form hybrid feature embeddings. Next, we apply principal component analysis (PCA) to process these hybrid embeddings. The resulting PCA-transformed feature vectors are then used as inputs for constructing the predictive model. Experimental results on the independent test dataset showed that the proposed pLM4MRSA approach achieved a balanced accuracy and Matthew correlation coefficient of 0.983 and 0.980, respectively, representing remarkable improvements over the state-of-the-art methods by 2.53%-4.83% and 7.73%-13.23%, respectively. This indicates that pLM4MRSA is a high-performance prediction model with excellent scope of applicability. Additionally, comparison with well-known hand-crafted features demonstrated that the proposed hybrid feature embeddings complement each other effectively, capturing discriminative patterns for more accurate anti-MRSA peptide prediction. We anticipate that pLM4MRSA will serve as an effective solution for accurate and high-capacity prediction of anti-MRSA peptides from peptide sequences.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"716-729"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143604811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01Epub Date: 2025-04-03DOI: 10.1007/s12539-025-00700-y
Qiguo Dai, Wuhao Liu, Xianhai Yu, Xiaodong Duan, Ziqiang Liu
Accurately identifying cell types in single-cell RNA sequencing data is critical for understanding cellular differentiation and pathological mechanisms in downstream analysis. As traditional biological approaches are laborious and time-intensive, it is imperative to develop computational biology methods for cell classification. However, it remains a challenge for existing methods to adequately utilize the potential gene expression information within the vast amount of unlabeled cell data, which limits their classification and generalization performance. Therefore, we propose a novel self-supervised graph representation learning framework for single-cell classification, named scSSGC. Specifically, in the pre-training stage of self-supervised learning, multiple K-means clustering tasks conducted on unlabeled cell data are jointly employed for model training, thereby mitigating the issue of limited labeled data. To effectively capture the potential interactions among cells, we introduce a locally augmented graph neural network to enhance the information aggregation capability for nodes with fewer neighbors in the cell graph. A range of benchmark experiments demonstrates that scSSGC outperforms existing state-of-the-art cell classification methods. More importantly, scSSGC provides stable performance when faced with cross-datasets, indicating better generalization ability.
{"title":"Self-Supervised Graph Representation Learning for Single-Cell Classification.","authors":"Qiguo Dai, Wuhao Liu, Xianhai Yu, Xiaodong Duan, Ziqiang Liu","doi":"10.1007/s12539-025-00700-y","DOIUrl":"10.1007/s12539-025-00700-y","url":null,"abstract":"<p><p>Accurately identifying cell types in single-cell RNA sequencing data is critical for understanding cellular differentiation and pathological mechanisms in downstream analysis. As traditional biological approaches are laborious and time-intensive, it is imperative to develop computational biology methods for cell classification. However, it remains a challenge for existing methods to adequately utilize the potential gene expression information within the vast amount of unlabeled cell data, which limits their classification and generalization performance. Therefore, we propose a novel self-supervised graph representation learning framework for single-cell classification, named scSSGC. Specifically, in the pre-training stage of self-supervised learning, multiple K-means clustering tasks conducted on unlabeled cell data are jointly employed for model training, thereby mitigating the issue of limited labeled data. To effectively capture the potential interactions among cells, we introduce a locally augmented graph neural network to enhance the information aggregation capability for nodes with fewer neighbors in the cell graph. A range of benchmark experiments demonstrates that scSSGC outperforms existing state-of-the-art cell classification methods. More importantly, scSSGC provides stable performance when faced with cross-datasets, indicating better generalization ability.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"566-575"},"PeriodicalIF":3.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143780053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}