Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy
Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.
{"title":"Detection of alternative splicing: deep sequencing or deep learning?","authors":"Lena Maria Hackl, Fabian Neuhaus, Sabine Ameling, Uwe Völker, Jan Baumbach, Olga Tsoy","doi":"10.1093/bib/bbaf705","DOIUrl":"10.1093/bib/bbaf705","url":null,"abstract":"<p><p>Alternative splicing is a crucial mechanism of gene regulation that enables condition- and tissue-specific expression of gene isoforms. Its dysregulation plays a role in various diseases such as cancer, neurological disorders, and metabolic conditions. Despite its importance, accurate detection of alternative splicing events remains challenging. Comprehensive alternative splicing event detection typically requires deep sequencing with over 100 million reads; however, much of the publicly accessible RNA sequencing data is of lower sequencing depth. Recent advances, particularly deep learning models working with genomic sequences, offer new avenues for predicting alternative splicing without reliance on high sequencing depth data. Our study addresses the question: Can we utilize the vast repository of publicly available RNA sequencing data for comprehensive alternative splicing detection, despite the low sequencing depth? Our results demonstrate the potential of sequence-based deep learning tools such as AlphaGenome, SpliceAI and DeepSplice for initial hypothesis development and as additional filters in standard RNA sequencing pipelines, especially when sequencing depth is limited. Nonetheless, validation with higher sequencing depths remains essential for confirmation of splice events. Overall, our findings underscore the need for integrative methods combining genomic sequence data and RNA sequencing data for the prediction of tissue- and condition-specific alternative splicing in resource-limited settings.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790623/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenli Zhai, Lingyun Sun, Wenwei Fang, Yidan Dong, Chunxiao Cheng, Yuanjiao Liu, Yuan Zhou, Jiadong Ji, Lang Wu, An Pan, Eric R Gamazon, Xiong-Fei Pan, Dan Zhou
Genetics-informed proteome-wide association studies (PWASs) provide an effective way to uncover proteomic mechanisms underlying complex diseases. PWAS relies on an ancestry-matched reference panel to model the impact of genetically determined protein expression on phenotype. However, reference panels from underrepresented populations remain relatively limited. We developed a multi-ancestry framework to enhance protein prediction in these populations by integrating diverse information-sharing strategies into a Multi-Ancestry Best-performing Model (MABM). Results indicated that MABM increased the prediction performance with higher performance observed in both cross-validation and an external dataset. Leveraging the Biobank Japan, we identified three times as many significant PWAS associations using MABM as using Lasso model. Notably, 47.5% of the MABM specific associations were reproduced in independent East Asian datasets with concordant effect sizes. Furthermore, MABM enhanced decision-making in gene/protein prioritization for functional validation for complex traits by validating well-established associations and uncovering novel trait-related candidates. The benefits of MABM were further validated in additional ancestries and demonstrated in brain tissue-based PWAS, underscoring its broad applicability. Our findings close critical gaps in multi-omics research among underrepresented populations and facilitate trait-relevant protein discovery in underrepresented populations.
{"title":"Cross-ancestry information transfer framework improves protein abundance prediction and protein-trait association identification.","authors":"Wenli Zhai, Lingyun Sun, Wenwei Fang, Yidan Dong, Chunxiao Cheng, Yuanjiao Liu, Yuan Zhou, Jiadong Ji, Lang Wu, An Pan, Eric R Gamazon, Xiong-Fei Pan, Dan Zhou","doi":"10.1093/bib/bbaf707","DOIUrl":"10.1093/bib/bbaf707","url":null,"abstract":"<p><p>Genetics-informed proteome-wide association studies (PWASs) provide an effective way to uncover proteomic mechanisms underlying complex diseases. PWAS relies on an ancestry-matched reference panel to model the impact of genetically determined protein expression on phenotype. However, reference panels from underrepresented populations remain relatively limited. We developed a multi-ancestry framework to enhance protein prediction in these populations by integrating diverse information-sharing strategies into a Multi-Ancestry Best-performing Model (MABM). Results indicated that MABM increased the prediction performance with higher performance observed in both cross-validation and an external dataset. Leveraging the Biobank Japan, we identified three times as many significant PWAS associations using MABM as using Lasso model. Notably, 47.5% of the MABM specific associations were reproduced in independent East Asian datasets with concordant effect sizes. Furthermore, MABM enhanced decision-making in gene/protein prioritization for functional validation for complex traits by validating well-established associations and uncovering novel trait-related candidates. The benefits of MABM were further validated in additional ancestries and demonstrated in brain tissue-based PWAS, underscoring its broad applicability. Our findings close critical gaps in multi-omics research among underrepresented populations and facilitate trait-relevant protein discovery in underrepresented populations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777707/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruhai Chen, Jiekai Chen, Lingling Shi, Jiangping He
Chromatin topological structure is critical for gene regulation. Hi-C based experiments have significantly advanced our understanding chromatin organization. Numerous computational tools have been developed to identify various structural levels of chromatin, ranging from compartments to loops. However, there remains a lack of specialized tools for identifying non-homologous inter-chromatin contacts (NHCCs), which play important roles in chromosome territories. In this study, we present iceDP, a tool that leverages the Density Peaks clustering algorithm to identify local high-density regions within inter-chromatin. These regions undergo two subsequent filtering steps to eliminate obvious false positives. When applied to three Hi-C datasets, iceDP accurately identified known NHCCs, including olfactory receptor genes in mature olfactory sensory neurons and Polycomb repressive complex-regulated developmental genes in mouse embryonic stem cells (mESCs). Notably, iceDP also uncovered previously unreported transcriptionally active NHCCs. Compared to diffHiC and FitHiC, iceDP exhibited superior performance with the highest positive rate. Moreover, iceDP is compatible with a wide range of chromatin conformation capture techniques, including in-situ Hi-C, Micro-C, HiChIP, and BL-HiC, demonstrating its versatility and utility.
{"title":"iceDP: identifying inter-chromatin engagement via density peaks clustering algorithm.","authors":"Ruhai Chen, Jiekai Chen, Lingling Shi, Jiangping He","doi":"10.1093/bib/bbaf704","DOIUrl":"10.1093/bib/bbaf704","url":null,"abstract":"<p><p>Chromatin topological structure is critical for gene regulation. Hi-C based experiments have significantly advanced our understanding chromatin organization. Numerous computational tools have been developed to identify various structural levels of chromatin, ranging from compartments to loops. However, there remains a lack of specialized tools for identifying non-homologous inter-chromatin contacts (NHCCs), which play important roles in chromosome territories. In this study, we present iceDP, a tool that leverages the Density Peaks clustering algorithm to identify local high-density regions within inter-chromatin. These regions undergo two subsequent filtering steps to eliminate obvious false positives. When applied to three Hi-C datasets, iceDP accurately identified known NHCCs, including olfactory receptor genes in mature olfactory sensory neurons and Polycomb repressive complex-regulated developmental genes in mouse embryonic stem cells (mESCs). Notably, iceDP also uncovered previously unreported transcriptionally active NHCCs. Compared to diffHiC and FitHiC, iceDP exhibited superior performance with the highest positive rate. Moreover, iceDP is compatible with a wide range of chromatin conformation capture techniques, including in-situ Hi-C, Micro-C, HiChIP, and BL-HiC, demonstrating its versatility and utility.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.
{"title":"CircRM: profiling circular RNA modifications from nanopore direct RNA sequencing.","authors":"Jiayi Li, Shenglun Chen, Zhixing Wu, Haozhe Wang, Rong Xia, Jia Meng, Yuxin Zhang","doi":"10.1093/bib/bbaf726","DOIUrl":"10.1093/bib/bbaf726","url":null,"abstract":"<p><p>Circular RNA (circRNA) represents a critical class of regulatory RNAs with distinctive structural and functional features. The functions of circRNAs are modulated by various RNA modifications. Here, we present CircRM, a nanopore direct RNA sequencing-based computational method for profiling RNA modifications in circRNAs at single-base and single-molecule resolution. By integrating circRNA detection, read-level modification detection, and quantitative assessment of methylation rates, CircRM identified 427 high-confidence circRNAs and enables systematic characterization of three major modifications, m5C (AUC = 0.855), m6A (AUC = 0.817) and m1A (AUC = 0.769). It revealed distinct modification patterns compared with linear RNAs, highlighting RNA-type-specific regulations. We also identified the key features of circRNA-specific modifications, such as the enrichment near the back-splice junctions. Cross-cell line analyses further demonstrated conserved and cell-type-specific modification patterns. Together, these findings reveal, at the computational level, a unique epitranscriptomic landscape associated with circRNAs and establish CircRM as a powerful tool for advancing the study of RNA modifications in circular RNA biology. CircRM is free accessible at: https://github.com/jiayiAnnie17/CircRM.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798809/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145965377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan
Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).
{"title":"A novel two-sample Mendelian randomization framework integrating common and rare variants: application to assess the effect of HDL-C on preeclampsia risk.","authors":"Yu Zhang, Ming Li, David M Haas, C Noel Bairey Merz, Tsegaselassie Workalemahu, Kelli Ryckman, Janet M Catov, Lisa D Levine, Alexa Freedman, George R Saade, Jiaqi Hu, Hongyu Zhao, Xihao Li, Nianjun Liu, Qi Yan","doi":"10.1093/bib/bbaf649","DOIUrl":"10.1093/bib/bbaf649","url":null,"abstract":"<p><p>Mendelian randomization (MR) has become an important technique for establishing causal relationships between risk factors and health outcomes. By using genetic variants as instrumental variables, it can mitigate bias due to confounding and reverse causation in observational studies. Current MR analyses have predominantly used common genetic variants as instruments, which represent only part of the genetic architecture of complex traits. Rare variants, which can have larger effect sizes and provide unique biological insights, have been understudied due to statistical and methodological challenges. We introduce MR-common and annotation-informed rare variants (MR-CARV), a novel framework integrating common and rare genetic variants in two-sample MR. This method leverages comprehensive genetic data made available by high-throughput sequencing technologies and large-scale consortia. Rare variants are aggregated into functional categories, such as gene-coding, gene-noncoding, and nongene regions, by leveraging variant annotations and biological impact as weights. The effects of rare variant sets are then estimated with STAARpipeline and combined with the estimated effects of common variants by the existing MR methods. Simulation studies demonstrate that MR-CARV maintains robust type I error and achieves higher statistical power, with up to a 66.3% relative increase compared with existing methods only based on common variants. Consistent with these findings, application to real data on high-density lipoprotein cholesterol (HDL-C) and preeclampsia showed that MR-CARV [inverse variance weighted (IVW)] yielded a more precise and statistically significant effect estimate (-0.020, SE = 0.0102, $P$ =.0470) than IVW using only common variants (-0.023, SE = 0.0123, $P$ =.0659).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cancer is a highly heterogeneous disease characterized by complex molecular changes. Subtypes identified through multi-omics data hold significant promise for improving prognosis and facilitating personalized precision treatment. Recent multi-omics integration methods have mostly focused on capturing complementary information from different data types, often overlooking potential interactions between omics data. Here we develop a novel method named interactive multi-kernel learning (iMKL), which incorporates omics-omics interactions alongside heterogeneous data types under the unsupervised multi-kernel learning framework, to improve subtype identification. Using the sample-similarity kernel for each dataset, we propose a joint Hadamard product strategy to capture higher-order interactive effects from different omics data types. We applied iMKL to two renal cell carcinoma (RCC) datasets-clear renal cell carcinoma (ccRCC) and type II papillary renal cell carcinoma (type II pRCC)-both including miRNA expression, mRNA expression, and DNA methylation data. Stability analysis through random sampling of patients or features demonstrated that iMKL exhibits strong robustness and accuracy in identifying patient subtypes. The identified subtypes revealed dramatic differences in patient survival, with both ccRCC and type II pRCC classified into three distinct subtypes. The findings in the real application highlight potential biomarkers associated with adverse patient outcomes and demonstrate substantial advancement in cancer subtype identification. The iMKL method effectively identifies tumor molecular subtypes that are strongly associated with clinical features and survival rates, providing valuable insights for accurate cancer subtyping, clinical decision-making, and the realization of personalized treatment strategies.
{"title":"Multi-omics data integration for enhanced cancer subtyping via interactive multi-kernel learning.","authors":"Hongyan Cao, Tong Wang, Zhaoyang Xu, Xin Zhao, Gaiqin Liu, Xiaoling Yang, Ruiling Fang, Yanhong Luo, Ping Zeng, Hongmei Yu, Yanbo Zhang, Yuehua Cui","doi":"10.1093/bib/bbaf687","DOIUrl":"10.1093/bib/bbaf687","url":null,"abstract":"<p><p>Cancer is a highly heterogeneous disease characterized by complex molecular changes. Subtypes identified through multi-omics data hold significant promise for improving prognosis and facilitating personalized precision treatment. Recent multi-omics integration methods have mostly focused on capturing complementary information from different data types, often overlooking potential interactions between omics data. Here we develop a novel method named interactive multi-kernel learning (iMKL), which incorporates omics-omics interactions alongside heterogeneous data types under the unsupervised multi-kernel learning framework, to improve subtype identification. Using the sample-similarity kernel for each dataset, we propose a joint Hadamard product strategy to capture higher-order interactive effects from different omics data types. We applied iMKL to two renal cell carcinoma (RCC) datasets-clear renal cell carcinoma (ccRCC) and type II papillary renal cell carcinoma (type II pRCC)-both including miRNA expression, mRNA expression, and DNA methylation data. Stability analysis through random sampling of patients or features demonstrated that iMKL exhibits strong robustness and accuracy in identifying patient subtypes. The identified subtypes revealed dramatic differences in patient survival, with both ccRCC and type II pRCC classified into three distinct subtypes. The findings in the real application highlight potential biomarkers associated with adverse patient outcomes and demonstrate substantial advancement in cancer subtype identification. The iMKL method effectively identifies tumor molecular subtypes that are strongly associated with clinical features and survival rates, providing valuable insights for accurate cancer subtyping, clinical decision-making, and the realization of personalized treatment strategies.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12710476/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145773732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Ali Hossain, Tania Akter Asa, Md Shofiqul Islam, Mohammad Zahidur Rahman, Mohammad Ali Moni
Ovarian cancer (OC) is a highly lethal malignancy worldwide, necessitating the identification of key genes to uncover its molecular mechanisms and improve diagnostic and therapeutic strategies. This study utilized statistical and machine learning approaches to identify key candidate genes for OC. Three microarray datasets were obtained from the gene expression omnibus database, and analysis began with normalization and differential gene expression analysis using the Limma package. Highly discriminative differentially expressed genes (HDDEGs) were identified through a support vector machine-based approach, yielding 84 overlapping HDDEGs across the datasets. Enrichment analysis of HDDEGs was conducted using DAVID. A protein-protein interaction network constructed via STRING pinpointed central hub genes using CytoHubba metrics. Significant modules were analyzed with molecular complex detection, identifying 18 central hub genes, 11 hub module genes, and 54 meta-hub genes. The intersection of these three gene sets revealed eight shared key genes (FANCD2, BUB1B, BUB1, KIF4A, DTL, NCAPG, KIF20A, and UBE2C). Weighted gene co-expression network analysis identified key modules linked to clinical traits and confirmed grouping eight key candidate genes into a single cluster. These genes were validated using two independent datasets (GSE38666 and TCGA-OC), with area under the curve and survival analyses underscoring their predictive and prognostic significance in OC. This integrative approach advances understanding of OC's molecular basis, identifies potential biomarkers, and emphasizes the clinical relevance of the eight key candidate genes for OC diagnosis, prognosis, and treatment.
{"title":"Identification of key candidate genes for ovarian cancer using integrated statistical and machine learning approaches.","authors":"Md Ali Hossain, Tania Akter Asa, Md Shofiqul Islam, Mohammad Zahidur Rahman, Mohammad Ali Moni","doi":"10.1093/bib/bbaf602","DOIUrl":"10.1093/bib/bbaf602","url":null,"abstract":"<p><p>Ovarian cancer (OC) is a highly lethal malignancy worldwide, necessitating the identification of key genes to uncover its molecular mechanisms and improve diagnostic and therapeutic strategies. This study utilized statistical and machine learning approaches to identify key candidate genes for OC. Three microarray datasets were obtained from the gene expression omnibus database, and analysis began with normalization and differential gene expression analysis using the Limma package. Highly discriminative differentially expressed genes (HDDEGs) were identified through a support vector machine-based approach, yielding 84 overlapping HDDEGs across the datasets. Enrichment analysis of HDDEGs was conducted using DAVID. A protein-protein interaction network constructed via STRING pinpointed central hub genes using CytoHubba metrics. Significant modules were analyzed with molecular complex detection, identifying 18 central hub genes, 11 hub module genes, and 54 meta-hub genes. The intersection of these three gene sets revealed eight shared key genes (FANCD2, BUB1B, BUB1, KIF4A, DTL, NCAPG, KIF20A, and UBE2C). Weighted gene co-expression network analysis identified key modules linked to clinical traits and confirmed grouping eight key candidate genes into a single cluster. These genes were validated using two independent datasets (GSE38666 and TCGA-OC), with area under the curve and survival analyses underscoring their predictive and prognostic significance in OC. This integrative approach advances understanding of OC's molecular basis, identifies potential biomarkers, and emphasizes the clinical relevance of the eight key candidate genes for OC diagnosis, prognosis, and treatment.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12710472/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145773744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The prediction of binding free energy changes ($Delta Delta G$) caused by mutations in protein complexes is crucial for understanding disease mechanisms and designing antibodies. Approximately 60% of pathogenic missense mutations lead to functional abnormalities by disrupting molecular interactions. However, although existing $Delta Delta G$ predictors exhibit strong performance in benchmarks, they suffer from inadequate generalization, a misalignment between evaluation metrics and practical needs, and poor adaptability to complex mutation scenarios. This study systematically assessed eight mainstream predictors, covering both physical energy function-based and machine learning-based methods, and constructed an independent evaluation set. This study employed multi-dimensional metrics, including regression accuracy and classification capability, while also analyzing the performance variations of predictors across different mutation types, stability categories, and microenvironments of protein mutation sites. The results indicate that >60% of predictors (5 out of 8) predictors exhibit a systematic bias toward overestimating mutational instability. In the three-class classification task, predictors demonstrate a limited ability to identify stabilizing mutations ($Delta Delta G< -0.5$ kcal/mol), with recall rates <0.1 for this class, and overall predictive efficacy depends on the protein local structure. In summary, this study reveals the limitations of current $Delta Delta G$ predictors in terms of generalization and adaptability to complex scenarios, thus providing a reference for the optimization and practical application of $Delta Delta G$ prediction methods. It suggests that future breakthroughs can be achieved by constructing balanced and standardized datasets alongside developing local-global fusion algorithms.
{"title":"Systematic evaluation of predictors for binding free energy changes upon mutations in protein complexes.","authors":"Yu Zhang, Yunjiong Liu, Yulin Zhang, Ziyang Wang, Xiaoli Lu, Shengxiang Ge, Xiaoping Min","doi":"10.1093/bib/bbaf645","DOIUrl":"10.1093/bib/bbaf645","url":null,"abstract":"<p><p>The prediction of binding free energy changes ($Delta Delta G$) caused by mutations in protein complexes is crucial for understanding disease mechanisms and designing antibodies. Approximately 60% of pathogenic missense mutations lead to functional abnormalities by disrupting molecular interactions. However, although existing $Delta Delta G$ predictors exhibit strong performance in benchmarks, they suffer from inadequate generalization, a misalignment between evaluation metrics and practical needs, and poor adaptability to complex mutation scenarios. This study systematically assessed eight mainstream predictors, covering both physical energy function-based and machine learning-based methods, and constructed an independent evaluation set. This study employed multi-dimensional metrics, including regression accuracy and classification capability, while also analyzing the performance variations of predictors across different mutation types, stability categories, and microenvironments of protein mutation sites. The results indicate that >60% of predictors (5 out of 8) predictors exhibit a systematic bias toward overestimating mutational instability. In the three-class classification task, predictors demonstrate a limited ability to identify stabilizing mutations ($Delta Delta G< -0.5$ kcal/mol), with recall rates <0.1 for this class, and overall predictive efficacy depends on the protein local structure. In summary, this study reveals the limitations of current $Delta Delta G$ predictors in terms of generalization and adaptability to complex scenarios, thus providing a reference for the optimization and practical application of $Delta Delta G$ prediction methods. It suggests that future breakthroughs can be achieved by constructing balanced and standardized datasets alongside developing local-global fusion algorithms.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684732/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingxu Wang, Kunyu Zhang, Jiaxin Huang, Nan Yin, Siwei Liu, Eran Segal
Multimodal molecular representation learning, which jointly models molecular graphs and their textual descriptions, enhances predictive accuracy and interpretability by enabling more robust and reliable predictions of drug toxicity, bioactivity, and physicochemical properties through the integration of structural and semantic information. However, existing multimodal methods suffer from two key limitations: (i) they typically perform cross-modal interaction only at the final encoder layer, thus overlooking hierarchical semantic dependencies; (ii) they lack a unified prototype space for robust alignment between modalities. To address these limitations, we propose ProtoMol, a prototype-guided multimodal framework that enables fine-grained integration and consistent semantic alignment between molecular graphs and textual descriptions. ProtoMol incorporates dual-branch hierarchical encoders, utilizing Graph Neural Networks to process structured molecular graphs and Transformers to encode unstructured texts, resulting in comprehensive layer-wise representations. Then, ProtoMol introduces a layer-wise bidirectional cross-modal attention mechanism that progressively aligns semantic features across layers. Furthermore, a shared prototype space with learnable, class-specific anchors is constructed to guide both modalities toward coherent and discriminative representations. Extensive experiments on multiple benchmark datasets demonstrate that ProtoMol consistently outperforms state-of-the-art baselines across a variety of molecular property prediction tasks. Our source code is available at: https://github.com/zky04/Protomol.
{"title":"ProtoMol: enhancing molecular property prediction via prototype-guided multimodal learning.","authors":"Yingxu Wang, Kunyu Zhang, Jiaxin Huang, Nan Yin, Siwei Liu, Eran Segal","doi":"10.1093/bib/bbaf629","DOIUrl":"10.1093/bib/bbaf629","url":null,"abstract":"<p><p>Multimodal molecular representation learning, which jointly models molecular graphs and their textual descriptions, enhances predictive accuracy and interpretability by enabling more robust and reliable predictions of drug toxicity, bioactivity, and physicochemical properties through the integration of structural and semantic information. However, existing multimodal methods suffer from two key limitations: (i) they typically perform cross-modal interaction only at the final encoder layer, thus overlooking hierarchical semantic dependencies; (ii) they lack a unified prototype space for robust alignment between modalities. To address these limitations, we propose ProtoMol, a prototype-guided multimodal framework that enables fine-grained integration and consistent semantic alignment between molecular graphs and textual descriptions. ProtoMol incorporates dual-branch hierarchical encoders, utilizing Graph Neural Networks to process structured molecular graphs and Transformers to encode unstructured texts, resulting in comprehensive layer-wise representations. Then, ProtoMol introduces a layer-wise bidirectional cross-modal attention mechanism that progressively aligns semantic features across layers. Furthermore, a shared prototype space with learnable, class-specific anchors is constructed to guide both modalities toward coherent and discriminative representations. Extensive experiments on multiple benchmark datasets demonstrate that ProtoMol consistently outperforms state-of-the-art baselines across a variety of molecular property prediction tasks. Our source code is available at: https://github.com/zky04/Protomol.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12684735/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saloni Bhatia, Matt A Field, Lionel Hebbard, Ulf Schmitz
Alternative splicing (AS) plays a key role in regulating gene expression, and its dysregulation is implicated in numerous human diseases, including cancer. While bulk RNA sequencing has advanced our understanding of AS, it cannot capture cellular heterogeneity or reliably reconstruct full-length isoforms, both of which underpin disease mechanisms and therapeutic responses. Single-cell RNA sequencing (scRNA-seq) is an established and a powerful approach to examine AS landscapes at single-cell resolution, enabling the identification of cell-specific aberrant splicing events that may contribute to disease. However, conventional scRNA-seq is limited by short read lengths, often preventing an accurate reconstruction of full-length transcript isoforms. This limitation is addressed by long-read RNA-seq (lrRNA-seq), which can sequence full-length RNA molecules, some exceeding 100 000 nucleotides in length. Thereby, lrRNA-seq enables more accurate characterization of isoform diversity, identification of novel splice variants, quantification of percent spliced-in values, and detection of fusion transcripts. The convergence of single-cell resolution and third-generation sequencing technologies has led to the development of single-cell long-read sequencing (SCLR-seq), a powerful approach that addresses the key constraints of bulk short-read RNA-Seq by providing isoform-level resolution and cell-type specificity. This review explores the growing utility of SCLR-seq, highlighting recent developments in bioinformatics tools and pipelines designed for SCLR-seq data analysis. We discuss how this emerging technology is transforming our understanding of isoform regulation and aberrant splicing in human diseases, and its potential to uncover novel diagnostic and therapeutic targets.
{"title":"Bioinformatics frameworks for single-cell long-read sequencing: unlocking isoform-level resolution.","authors":"Saloni Bhatia, Matt A Field, Lionel Hebbard, Ulf Schmitz","doi":"10.1093/bib/bbaf655","DOIUrl":"10.1093/bib/bbaf655","url":null,"abstract":"<p><p>Alternative splicing (AS) plays a key role in regulating gene expression, and its dysregulation is implicated in numerous human diseases, including cancer. While bulk RNA sequencing has advanced our understanding of AS, it cannot capture cellular heterogeneity or reliably reconstruct full-length isoforms, both of which underpin disease mechanisms and therapeutic responses. Single-cell RNA sequencing (scRNA-seq) is an established and a powerful approach to examine AS landscapes at single-cell resolution, enabling the identification of cell-specific aberrant splicing events that may contribute to disease. However, conventional scRNA-seq is limited by short read lengths, often preventing an accurate reconstruction of full-length transcript isoforms. This limitation is addressed by long-read RNA-seq (lrRNA-seq), which can sequence full-length RNA molecules, some exceeding 100 000 nucleotides in length. Thereby, lrRNA-seq enables more accurate characterization of isoform diversity, identification of novel splice variants, quantification of percent spliced-in values, and detection of fusion transcripts. The convergence of single-cell resolution and third-generation sequencing technologies has led to the development of single-cell long-read sequencing (SCLR-seq), a powerful approach that addresses the key constraints of bulk short-read RNA-Seq by providing isoform-level resolution and cell-type specificity. This review explores the growing utility of SCLR-seq, highlighting recent developments in bioinformatics tools and pipelines designed for SCLR-seq data analysis. We discuss how this emerging technology is transforming our understanding of isoform regulation and aberrant splicing in human diseases, and its potential to uncover novel diagnostic and therapeutic targets.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 6","pages":""},"PeriodicalIF":7.7,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12696714/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145720917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}