Bingyan Wang, Heng Hu, Runtian Gao, Guohua Wang, Tao Jiang
Gene fusions are critical oncogenic drivers and therapeutic targets in diverse cancers. Long-read ribonucleic acid sequencing (RNA-seq) offers an unprecedented opportunity to resolve the full-length structure of fusion isoforms, but its high intrinsic error rates pose significant challenges to the precise identification of true fusion events. Here, we developed GFSeeker, an innovative splicing-graph-based computational framework for accurate gene fusion detection from long-read RNA-seq. GFSeeker employs a unique pipeline based on a splicing graph reference and a dual re-alignment validation to effectively overcome data noise from high error rates. Benchmarking across simulated, non-tumor, and cancer cell line datasets demonstrated GFSeeker's state-of-the-art performance, achieving 6%-15% higher F1 score compared to existing methods. Notably, GFSeeker successfully identified the known fusion event, MATN2-POP1, in the MCF-7 cancer cell line, missed by other tools, highlighting its superior sensitivity in resolving complex fusion events. These results validate GFSeeker as a powerful and reliable tool for gene fusion discovery, heralding its significant potential to advance cancer research and precision diagnostics.
{"title":"GFSeeker: a splicing-graph-based approach for accurate gene fusion detection from long-read RNA sequencing data.","authors":"Bingyan Wang, Heng Hu, Runtian Gao, Guohua Wang, Tao Jiang","doi":"10.1093/bib/bbaf702","DOIUrl":"10.1093/bib/bbaf702","url":null,"abstract":"<p><p>Gene fusions are critical oncogenic drivers and therapeutic targets in diverse cancers. Long-read ribonucleic acid sequencing (RNA-seq) offers an unprecedented opportunity to resolve the full-length structure of fusion isoforms, but its high intrinsic error rates pose significant challenges to the precise identification of true fusion events. Here, we developed GFSeeker, an innovative splicing-graph-based computational framework for accurate gene fusion detection from long-read RNA-seq. GFSeeker employs a unique pipeline based on a splicing graph reference and a dual re-alignment validation to effectively overcome data noise from high error rates. Benchmarking across simulated, non-tumor, and cancer cell line datasets demonstrated GFSeeker's state-of-the-art performance, achieving 6%-15% higher F1 score compared to existing methods. Notably, GFSeeker successfully identified the known fusion event, MATN2-POP1, in the MCF-7 cancer cell line, missed by other tools, highlighting its superior sensitivity in resolving complex fusion events. These results validate GFSeeker as a powerful and reliable tool for gene fusion discovery, heralding its significant potential to advance cancer research and precision diagnostics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777712/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145917105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Identifying cancer driver genes is essential for precision oncology, but existing computational methods are often limited by their reliance on single biological networks and their inability to capture long-range molecular dependencies. To address these challenges, we propose GRAFT, a Graph-Aware Fusion Transformer. This framework learns modality-specific features from protein-protein interactions, pathway co-occurrence, and gene semantic similarity using a multi-view graph encoder. These representations are further enriched with two auxiliary feature types: structural encodings derived from network topology and functional embeddings guided by curated gene sets. The integrated features are then processed by a transformer backbone, where a novel edge-attention bias makes the model explicitly sensitive to the underlying graph topologies, enabling the effective modeling of both local and global dependencies. Extensive evaluations demonstrate that GRAFT achieves competitive performance with leading state-of-the-art methods in pan-cancer analysis, while consistently delivering superior predictive accuracy across numerous specific cancer types. More importantly, a functional enrichment analysis of the novel candidate driver genes predicted by our model confirms their strong associations with key cancer-related processes, demonstrating the model's ability to make biologically plausible discoveries. By delivering a powerful and interpretable framework, our model not only advances the identification of cancer driver genes but also establishes a robust paradigm for multimodal data integration in systems biology. The source codes and datasets are publicly accessible at https://github.com/spcho-dev/GRAFT.
{"title":"GRAFT: a graph-aware fusion transformer for cancer driver gene prediction.","authors":"Sang-Pil Cho, Young-Rae Cho","doi":"10.1093/bib/bbaf706","DOIUrl":"10.1093/bib/bbaf706","url":null,"abstract":"<p><p>Identifying cancer driver genes is essential for precision oncology, but existing computational methods are often limited by their reliance on single biological networks and their inability to capture long-range molecular dependencies. To address these challenges, we propose GRAFT, a Graph-Aware Fusion Transformer. This framework learns modality-specific features from protein-protein interactions, pathway co-occurrence, and gene semantic similarity using a multi-view graph encoder. These representations are further enriched with two auxiliary feature types: structural encodings derived from network topology and functional embeddings guided by curated gene sets. The integrated features are then processed by a transformer backbone, where a novel edge-attention bias makes the model explicitly sensitive to the underlying graph topologies, enabling the effective modeling of both local and global dependencies. Extensive evaluations demonstrate that GRAFT achieves competitive performance with leading state-of-the-art methods in pan-cancer analysis, while consistently delivering superior predictive accuracy across numerous specific cancer types. More importantly, a functional enrichment analysis of the novel candidate driver genes predicted by our model confirms their strong associations with key cancer-related processes, demonstrating the model's ability to make biologically plausible discoveries. By delivering a powerful and interpretable framework, our model not only advances the identification of cancer driver genes but also establishes a robust paradigm for multimodal data integration in systems biology. The source codes and datasets are publicly accessible at https://github.com/spcho-dev/GRAFT.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.
{"title":"A comprehensive survey of genome language models in bioinformatics.","authors":"Liyuan Shu, Jiao Tang, Xiaoyu Guan, Daoqiang Zhang","doi":"10.1093/bib/bbaf724","DOIUrl":"10.1093/bib/bbaf724","url":null,"abstract":"<p><p>Large language models have revolutionized natural language processing by effectively modeling complex semantics and capturing long-range contextual relationships. Inspired by these advancements, genome language models (gLMs) have recently emerged, conceptualizing DNA and RNA sequences as biological texts and enabling the identification of intricate genomic grammar and distant regulatory interactions. This review examines the need for gLMs, emphasizing their capacity to overcome the limitations of traditional deep learning approaches in genomic sequence characterization. We comprehensively survey contemporary gLM architectures, including Transformer models, Hyena convolutions, and state space models, as well as various sequence tokenization strategies, assessing their applicability, and effectiveness across diverse genomic applications. Additionally, we discuss foundational pretraining strategies and provide an overview of genomic pretraining datasets spanning multiple species and functional domains. We critically analyze evaluation methodologies, including supervised, zero-shot, and few-shot learning paradigms, as well as fine-tuning approaches. An extensive taxonomy of downstream tasks is presented, alongside a summary of existing benchmarks and emerging trends. Finally, we contemplate key challenges such as data scarcity, interpretability, and the computational demands of genomic modeling, and propose a roadmap to guide future advances in genome language modeling.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu
Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.
{"title":"A systematic review of molecular representation learning foundation models.","authors":"Bosheng Song, Jiayi Zhang, Ying Liu, Yuansheng Liu, Jing Jiang, Sisi Yuan, Xia Zhen, Yiping Liu","doi":"10.1093/bib/bbaf703","DOIUrl":"10.1093/bib/bbaf703","url":null,"abstract":"<p><p>Molecular representation learning (MRL) is afoundation in leveraging computational methods for drug discovery, enabling the transformation of molecular structure and properties into numerical vectors. These vectors serve as input for machine learning models and facilitate the prediction and analysis of molecular attributes, functions, and reactions. The advent of foundation models has introduced both new opportunities and challenges to MRL. These models have improved generalizability and migration in scarce data. Through pretraining and fine-tuning, foundation models can be adapted to various domains. Their robust encoding and generative abilities also allow the transformation of molecular data into more expressive forms. This paper provides a detailed review of current mainstream molecular descriptors and datasets, focusing primarily on the representation of small molecules while excluding larger molecules such as proteins and peptides. It classifies foundation models into two primary categories based on the form of input: unimodal-based and multimodal-based models. For each category, representative models are identified and their advantages and disadvantages evaluated. Moreover, we systematically summarize four core pretraining strategies for MRL foundation models, analyzing their task designs, applicable scenarios, and impacts on downstream performance. In addition, the application of molecular representation foundation models in drug discovery and development is discussed, together with the current status of model interpretability. The paper concludes with insights into the future directions of MRL foundation models.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenjie Feng, Xiaowen Sun, Xintao Song, Lei Bao, Weikang Gong, Renmin Han
Understanding RNA conformational dynamics is essential to understand its roles in complex biological processes. While computational methods have revolutionized the prediction of static 3D RNA structures, predicting local flexibility directly from structure remains a significant challenge. We developed DeepRMSF, a deep learning-based method that leverages atomic-level descriptions of RNA to predict vibrational flexibility given a tertiary structure. Trained on MD-derived root-mean-square fluctuations(RMSF), DeepRMSF was benchmarked on 371 nonredundant RNAs, with 311 RNAs used for five-fold cross-validation (PCC = 0.7219-0.7464) and 60 RNAs as an independent test set (PCC = 0.734), ensuring minimal sequence/structural similarity between sets. DeepRMSF predicts the local flexibility of medium-sized RNAs (~75 nucleotides) in ~8.2 s, achieving >3000-fold speed-up over MD simulations while maintaining strong extrapolative accuracy. Rather than replacing MD, DeepRMSF offers a scalable and practical alternative for transcriptome-scale screening of RNA flexibility, facilitating studies on RNA structure-dynamics-function relationships and supporting computational modeling in RNA biology.
{"title":"DeepRMSF: a deep learning-based automated approach for predicting atomic-level flexibility in RNA structure.","authors":"Chenjie Feng, Xiaowen Sun, Xintao Song, Lei Bao, Weikang Gong, Renmin Han","doi":"10.1093/bib/bbaf720","DOIUrl":"10.1093/bib/bbaf720","url":null,"abstract":"<p><p>Understanding RNA conformational dynamics is essential to understand its roles in complex biological processes. While computational methods have revolutionized the prediction of static 3D RNA structures, predicting local flexibility directly from structure remains a significant challenge. We developed DeepRMSF, a deep learning-based method that leverages atomic-level descriptions of RNA to predict vibrational flexibility given a tertiary structure. Trained on MD-derived root-mean-square fluctuations(RMSF), DeepRMSF was benchmarked on 371 nonredundant RNAs, with 311 RNAs used for five-fold cross-validation (PCC = 0.7219-0.7464) and 60 RNAs as an independent test set (PCC = 0.734), ensuring minimal sequence/structural similarity between sets. DeepRMSF predicts the local flexibility of medium-sized RNAs (~75 nucleotides) in ~8.2 s, achieving >3000-fold speed-up over MD simulations while maintaining strong extrapolative accuracy. Rather than replacing MD, DeepRMSF offers a scalable and practical alternative for transcriptome-scale screening of RNA flexibility, facilitating studies on RNA structure-dynamics-function relationships and supporting computational modeling in RNA biology.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798811/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145965339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.
{"title":"GPCRact: a hierarchical framework for predicting ligand-induced GPCR activity via allosteric communication modeling.","authors":"Hyojin Son, Gwan-Su Yi","doi":"10.1093/bib/bbaf719","DOIUrl":"10.1093/bib/bbaf719","url":null,"abstract":"<p><p>Accurate prediction of ligand-induced activity for G-protein-coupled receptors (GPCRs) is a cornerstone of drug discovery, yet it is challenged by the need to model allosteric communication-the long-range signaling linking ligand binding to distal conformational changes. Prevailing sequence-based models often fail to capture these three-dimensional dynamics, a limitation frequently masked by averaged performance on simpler Class A targets. To address this, we introduce GPCRact, a novel framework that models the biophysical principles of allosteric modulation in GPCR activation. It first constructs a high-resolution, three-dimensional structure-aware graph from the heavy-atom coordinates of functionally critical residues at binding and allosteric sites. A dual attention architecture then captures the activation process: cross-attention encodes the initial ligand-protein interaction at the binding site, whereas self-attention learns the subsequent intra-protein signal propagation. This hierarchical architecture is built upon an E(n)-Equivariant Graph Neural Network (EGNN) to explicitly model conformational consequences of ligand binding, and is further refined with a tailored loss function and inference logic to mitigate error propagation. Underpinned by GPCRactDB, a comprehensive database we constructed for this study, GPCRact not only achieves state-of-the-art performance but also demonstrates robustly superior accuracy on a curated benchmark of allosterically complex receptors where existing models systematically underperform. Crucially, analysis of the learned attention weights confirms that the model identifies biologically validated allosteric pathways, offering a significant step toward resolving the black box nature of previous methods. Thus, GPCRact provides a more accurate, interpretable, and mechanistically-grounded solution to a long-standing challenge, paving the way for effective structure-guided drug discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805254/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inas Bosch, Barbara Gravel, Alexandre Renaux, Ann Nowé, Maris Laan, Tom Lenaerts
Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to $0.93$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.
{"title":"Benchmarking knowledge graph embedding models for the prediction of oligogenic combinations.","authors":"Inas Bosch, Barbara Gravel, Alexandre Renaux, Ann Nowé, Maris Laan, Tom Lenaerts","doi":"10.1093/bib/bbaf712","DOIUrl":"10.1093/bib/bbaf712","url":null,"abstract":"<p><p>Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to $0.93$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.
{"title":"Artificial intelligence in mitotic checkpoint modeling: transforming our understanding of cellular division through machine learning and predictive biology.","authors":"Bashar Ibrahim","doi":"10.1093/bib/bbaf729","DOIUrl":"10.1093/bib/bbaf729","url":null,"abstract":"<p><p>Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in "pan-cancer" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.
{"title":"CNAttention: an attention-based deep multiple-instance method for uncovering copy number aberration signatures across cancers.","authors":"Ziying Yang, Michael Baudis","doi":"10.1093/bib/bbaf696","DOIUrl":"10.1093/bib/bbaf696","url":null,"abstract":"<p><p>Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in \"pan-cancer\" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805253/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minhao Yao, Peixin Tian, Xihao Li, Shijia Bian, Gao Wang, Yian Gu, Ana Navas-Acien, Badri N Vardarajan, Daniel W Belsky, Gary W Miller, Andrea A Baccarelli, Zhonghua Liu
Causal mediation analysis investigates whether the effect of an exposure on an outcome operates through intermediate variables known as mediators. Although progress has been made in high-dimensional mediation analysis, current methods do not reliably control the false discovery rate (FDR) in finite samples, especially when mediators are moderately to highly correlated or follow non-Gaussian distributions. These challenges frequently arise in DNA methylation studies. We introduce CoxMDS, a multiple data splitting method that uses Cox proportional hazards models to identify putative causal mediators for survival outcomes. CoxMDS ensures finite-sample FDR control even in the presence of correlated or non-Gaussian mediators. Through simulations, CoxMDS is shown to maintain FDR control and achieve higher statistical power compared with existing approaches. In applications to DNA methylation data with survival outcomes, CoxMDS identified eight CpG sites in The Cancer Genome Atlas that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on lung cancer survival, and two CpG sites in the Alzheimer's Disease Neuroimaging Initiative that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on time to Alzheimer's disease conversion.
{"title":"CoxMDS: multiple data splitting for high-dimensional mediation analysis with survival outcomes in epigenome-wide studies.","authors":"Minhao Yao, Peixin Tian, Xihao Li, Shijia Bian, Gao Wang, Yian Gu, Ana Navas-Acien, Badri N Vardarajan, Daniel W Belsky, Gary W Miller, Andrea A Baccarelli, Zhonghua Liu","doi":"10.1093/bib/bbaf730","DOIUrl":"10.1093/bib/bbaf730","url":null,"abstract":"<p><p>Causal mediation analysis investigates whether the effect of an exposure on an outcome operates through intermediate variables known as mediators. Although progress has been made in high-dimensional mediation analysis, current methods do not reliably control the false discovery rate (FDR) in finite samples, especially when mediators are moderately to highly correlated or follow non-Gaussian distributions. These challenges frequently arise in DNA methylation studies. We introduce CoxMDS, a multiple data splitting method that uses Cox proportional hazards models to identify putative causal mediators for survival outcomes. CoxMDS ensures finite-sample FDR control even in the presence of correlated or non-Gaussian mediators. Through simulations, CoxMDS is shown to maintain FDR control and achieve higher statistical power compared with existing approaches. In applications to DNA methylation data with survival outcomes, CoxMDS identified eight CpG sites in The Cancer Genome Atlas that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on lung cancer survival, and two CpG sites in the Alzheimer's Disease Neuroimaging Initiative that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on time to Alzheimer's disease conversion.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805255/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}