Inas Bosch, Barbara Gravel, Alexandre Renaux, Ann Nowé, Maris Laan, Tom Lenaerts
Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to $0.93$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.
{"title":"Benchmarking knowledge graph embedding models for the prediction of oligogenic combinations.","authors":"Inas Bosch, Barbara Gravel, Alexandre Renaux, Ann Nowé, Maris Laan, Tom Lenaerts","doi":"10.1093/bib/bbaf712","DOIUrl":"10.1093/bib/bbaf712","url":null,"abstract":"<p><p>Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to $0.93$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145948436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.
{"title":"Artificial intelligence in mitotic checkpoint modeling: transforming our understanding of cellular division through machine learning and predictive biology.","authors":"Bashar Ibrahim","doi":"10.1093/bib/bbaf729","DOIUrl":"10.1093/bib/bbaf729","url":null,"abstract":"<p><p>Mitotic checkpoints safeguard genomic integrity by orchestrating the precise segregation of chromosomes during cell division. Yet their complex, nonlinear dynamics have long defied full understanding through traditional experimental and computational approaches. In recent years, artificial intelligence (AI) has begun to transform this landscape. Machine learning and deep learning methods now achieve substantial accuracies in predicting cellular behaviors and uncovering novel regulatory mechanisms within checkpoint networks. Advances include transformer architectures capable of predicting spindle assembly checkpoint engagement with >95% accuracy, graph neural networks that decode kinetochore-microtubule dynamics at subpixel resolution, and hybrid AI-mechanistic models that reveal previously hidden feedback circuits. By integrating multi-omics data and bridging molecular mechanisms with clinical applications, AI-driven approaches are opening significant opportunities for precision medicine in cancer and other proliferative diseases. This review synthesizes emerging computational frameworks, highlights transformative AI-driven discoveries, and proposes a roadmap for developing predictive, personalized models of mitotic checkpoint control-charting a path from computational insight to clinical impact.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in "pan-cancer" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.
{"title":"CNAttention: an attention-based deep multiple-instance method for uncovering copy number aberration signatures across cancers.","authors":"Ziying Yang, Michael Baudis","doi":"10.1093/bib/bbaf696","DOIUrl":"10.1093/bib/bbaf696","url":null,"abstract":"<p><p>Somatic copy number aberrations (CNAs) represent a distinct class of genomic mutations associated with oncogenetic effects. Over the past three decades, significant volumes of CNA data have been generated through molecular-cytogenetic and genome sequencing-based techniques. These data have been pivotal in identifying cancer-related genes and advancing research on the relationship between CNAs and histopathologically defined cancer types. However, comprehensive studies of CNA landscapes and disease parameters are challenging due to the vast diagnostic and genomic heterogeneity encountered in \"pan-cancer\" approaches. In this study, we introduce CNAttention, an attention-based deep multiple instance learning method designed to comprehensively analyze CNAs across different cancers and uncover specific CNA patterns within integrated gene-level CNA profiles of 30 cancer types. CNAttention effectively learns CNA features unique to each cancer type and generates CNA signatures for 30 cancer types using attention mechanisms, highlighting the distinctiveness of their CNA landscapes. CNAttention demonstrates high accuracy and exhibits stable performance even with the incorporation of external datasets or parameter adjustments, underscoring its effectiveness in tumor identification. Expanding these signatures to cancer classification trees reveals common patterns not only among physiologically related cancer types but also among clinico-pathologically distant types, such as different cancers originating from neural crest derived cells. Additionally, detected signatures also uncover genomic heterogeneity in individual cancer types, for instance in brain lower grade glioma. Additional experiments with classification models underscore the efficacy of these signatures in representing various cancer types and their potential utility in clinical diagnosis.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805253/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tandem repeats (TRs) play essential roles in a variety of biological functions, and their abnormal expansions are significantly implicated in phenotypic variation and cause >60 human diseases. However, long TR regions cannot be reliably detected using short-read sequencing, and long-read sequencing enables accurate genome-wide detection of TRs. In recent years, various computational tools have been developed to detect and genotype TRs from long-read data. In this survey, we systematically categorize and review 39 computational tools designed for TR detection, visualization and functional interpretation. We discuss their strengths and limitations for TR detection from long-read sequencing data, highlighting current challenges and future directions to advance long-read TR detection methodologies.
{"title":"Computational tools for tandem repeat detection using long-read sequencing.","authors":"Qian Liu, Jincheng Li","doi":"10.1093/bib/bbag031","DOIUrl":"10.1093/bib/bbag031","url":null,"abstract":"<p><p>Tandem repeats (TRs) play essential roles in a variety of biological functions, and their abnormal expansions are significantly implicated in phenotypic variation and cause >60 human diseases. However, long TR regions cannot be reliably detected using short-read sequencing, and long-read sequencing enables accurate genome-wide detection of TRs. In recent years, various computational tools have been developed to detect and genotype TRs from long-read data. In this survey, we systematically categorize and review 39 computational tools designed for TR detection, visualization and functional interpretation. We discuss their strengths and limitations for TR detection from long-read sequencing data, highlighting current challenges and future directions to advance long-read TR detection methodologies.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874885/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid advancement of artificial intelligence has positioned drug-target interaction (DTI) prediction as a promising approach in drug screening and drug discovery. Recent research has attempted to use pharmacological multimodal information to increase prediction accuracy. However, existing approaches are limited in fully utilizing more than three modalities, primarily due to information loss during the modality integration process. To overcome this challenge, we propose TriDTI, a novel framework that incorporates three modalities for both drugs and proteins. Specifically, TriDTI integrates structural, sequential, and relational modalities from both entities. To mitigate information loss during integration, we employ projection and cross-modal contrastive learning for modality alignment. Furthermore, we design a fusion strategy that combines soft attention and cross-attention to effectively integrate multimodal representations. Extensive experiments on three benchmark datasets demonstrate that TriDTI consistently achieves superior performance to existing state-of-the-art approaches in DTI prediction. Moreover, TriDTI exhibits a robust generalization ability across three challenging cold-start scenarios, effectively predicting interactions involving novel drugs, targets, and bindings. These results highlight the potential of TriDTI as a robust and practical framework for facilitating drug discovery. The source codes and datasets are publicly accessible at https://github.com/knhc1234/TriDTI.
{"title":"TriDTI: tri-modal representation learning with cross-modal alignment for drug-target interaction prediction.","authors":"Gwang-Hyeon Yun, Jong-Hoon Park, Young-Rae Cho","doi":"10.1093/bib/bbag034","DOIUrl":"10.1093/bib/bbag034","url":null,"abstract":"<p><p>The rapid advancement of artificial intelligence has positioned drug-target interaction (DTI) prediction as a promising approach in drug screening and drug discovery. Recent research has attempted to use pharmacological multimodal information to increase prediction accuracy. However, existing approaches are limited in fully utilizing more than three modalities, primarily due to information loss during the modality integration process. To overcome this challenge, we propose TriDTI, a novel framework that incorporates three modalities for both drugs and proteins. Specifically, TriDTI integrates structural, sequential, and relational modalities from both entities. To mitigate information loss during integration, we employ projection and cross-modal contrastive learning for modality alignment. Furthermore, we design a fusion strategy that combines soft attention and cross-attention to effectively integrate multimodal representations. Extensive experiments on three benchmark datasets demonstrate that TriDTI consistently achieves superior performance to existing state-of-the-art approaches in DTI prediction. Moreover, TriDTI exhibits a robust generalization ability across three challenging cold-start scenarios, effectively predicting interactions involving novel drugs, targets, and bindings. These results highlight the potential of TriDTI as a robust and practical framework for facilitating drug discovery. The source codes and datasets are publicly accessible at https://github.com/knhc1234/TriDTI.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874921/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146123843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziqi Yang, Ziyang Song, Shadi Zabad, Marc-André Legault, Yue Li
Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.
{"title":"PheCode-guided multi-modal topic modeling of electronic health records improves disease incidence prediction and GWAS discovery from UK Biobank.","authors":"Ziqi Yang, Ziyang Song, Shadi Zabad, Marc-André Legault, Yue Li","doi":"10.1093/bib/bbag030","DOIUrl":"10.1093/bib/bbag030","url":null,"abstract":"<p><p>Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12862981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146103854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rare cell types in single-cell RNA sequencing (scRNA-seq) data often encode essential biological signals, such as early disease markers or key immune regulators. With advancing technologies, large-scale scRNA-seq cohorts from multiple subjects now enable population-level analyses of the prevalence, heterogeneity, and disease associations of rare cell populations. However, existing methods for rare cell detection are typically limited to single datasets and cannot effectively leverage cross-subject information. To tackle this challenge, we present BayesRare, a hierarchical Bayesian framework for population-level rare cell discovery in multi-subject scRNA-seq data. The method augments a Bayesian mixture model with a rare cluster indicator, supporting joint cell-type clustering and rare-population identification. By explicitly characterizing the statistical properties of rare cell types, BayesRare integrates evidence across subjects, quantifies uncertainty via posterior probabilities, and enables inference of group-level differences (e.g. patients versus controls). Across synthetic and three real datasets, BayesRare achieves superior precision, reduces false positives, and uncovers biologically meaningful disease-specific rare subtypes. The R package of BayesRare is available at https://github.com/yinqiaoyan/BayesRare.
{"title":"BayesRare: Bayesian mixture model for population-level rare cell type detection in multi-subject single-cell RNA sequencing data.","authors":"Yinqiao Yan, Hao Wu","doi":"10.1093/bib/bbag024","DOIUrl":"10.1093/bib/bbag024","url":null,"abstract":"<p><p>Rare cell types in single-cell RNA sequencing (scRNA-seq) data often encode essential biological signals, such as early disease markers or key immune regulators. With advancing technologies, large-scale scRNA-seq cohorts from multiple subjects now enable population-level analyses of the prevalence, heterogeneity, and disease associations of rare cell populations. However, existing methods for rare cell detection are typically limited to single datasets and cannot effectively leverage cross-subject information. To tackle this challenge, we present BayesRare, a hierarchical Bayesian framework for population-level rare cell discovery in multi-subject scRNA-seq data. The method augments a Bayesian mixture model with a rare cluster indicator, supporting joint cell-type clustering and rare-population identification. By explicitly characterizing the statistical properties of rare cell types, BayesRare integrates evidence across subjects, quantifies uncertainty via posterior probabilities, and enables inference of group-level differences (e.g. patients versus controls). Across synthetic and three real datasets, BayesRare achieves superior precision, reduces false positives, and uncovers biologically meaningful disease-specific rare subtypes. The R package of BayesRare is available at https://github.com/yinqiaoyan/BayesRare.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867491/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146112323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ali Forooghi, Shaghayegh Sadeghi, Luis Rueda, Alioune Ngom
Molecular representation is fundamental to the field of cheminformatics, facilitating accurate prediction and exploration of molecular properties. Since the nineteenth century, methods for representing molecules have evolved significantly, with recent advances in deep learning offering state-of-the-art performance across various tasks. Among these, contrastive learning (CL) has emerged as one of the most powerful techniques for training deep learning models. CL aims to optimize the representation of similar molecules by reducing the distance between their vector embeddings, while simultaneously increasing the distance between dissimilar ones. Driven by the growing success of CL in enhancing representation learning, this paper presents the first comprehensive review dedicated to CL methods for molecular representation. We begin by surveying existing literature in the field, providing context for the evolution of molecular representation. Next, we introduce the core principles of the CL framework and examine its application to molecular representation learning tasks. Finally, we highlight the key challenges faced by CL-based approaches and discuss potential future directions for advancing molecular representation with these methods.
{"title":"A survey of contrastive learning methods in molecular representation.","authors":"Ali Forooghi, Shaghayegh Sadeghi, Luis Rueda, Alioune Ngom","doi":"10.1093/bib/bbaf731","DOIUrl":"10.1093/bib/bbaf731","url":null,"abstract":"<p><p>Molecular representation is fundamental to the field of cheminformatics, facilitating accurate prediction and exploration of molecular properties. Since the nineteenth century, methods for representing molecules have evolved significantly, with recent advances in deep learning offering state-of-the-art performance across various tasks. Among these, contrastive learning (CL) has emerged as one of the most powerful techniques for training deep learning models. CL aims to optimize the representation of similar molecules by reducing the distance between their vector embeddings, while simultaneously increasing the distance between dissimilar ones. Driven by the growing success of CL in enhancing representation learning, this paper presents the first comprehensive review dedicated to CL methods for molecular representation. We begin by surveying existing literature in the field, providing context for the evolution of molecular representation. Next, we introduce the core principles of the CL framework and examine its application to molecular representation learning tasks. Finally, we highlight the key challenges faced by CL-based approaches and discuss potential future directions for advancing molecular representation with these methods.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12893218/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146164172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minhao Yao, Peixin Tian, Xihao Li, Shijia Bian, Gao Wang, Yian Gu, Ana Navas-Acien, Badri N Vardarajan, Daniel W Belsky, Gary W Miller, Andrea A Baccarelli, Zhonghua Liu
Causal mediation analysis investigates whether the effect of an exposure on an outcome operates through intermediate variables known as mediators. Although progress has been made in high-dimensional mediation analysis, current methods do not reliably control the false discovery rate (FDR) in finite samples, especially when mediators are moderately to highly correlated or follow non-Gaussian distributions. These challenges frequently arise in DNA methylation studies. We introduce CoxMDS, a multiple data splitting method that uses Cox proportional hazards models to identify putative causal mediators for survival outcomes. CoxMDS ensures finite-sample FDR control even in the presence of correlated or non-Gaussian mediators. Through simulations, CoxMDS is shown to maintain FDR control and achieve higher statistical power compared with existing approaches. In applications to DNA methylation data with survival outcomes, CoxMDS identified eight CpG sites in The Cancer Genome Atlas that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on lung cancer survival, and two CpG sites in the Alzheimer's Disease Neuroimaging Initiative that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on time to Alzheimer's disease conversion.
{"title":"CoxMDS: multiple data splitting for high-dimensional mediation analysis with survival outcomes in epigenome-wide studies.","authors":"Minhao Yao, Peixin Tian, Xihao Li, Shijia Bian, Gao Wang, Yian Gu, Ana Navas-Acien, Badri N Vardarajan, Daniel W Belsky, Gary W Miller, Andrea A Baccarelli, Zhonghua Liu","doi":"10.1093/bib/bbaf730","DOIUrl":"10.1093/bib/bbaf730","url":null,"abstract":"<p><p>Causal mediation analysis investigates whether the effect of an exposure on an outcome operates through intermediate variables known as mediators. Although progress has been made in high-dimensional mediation analysis, current methods do not reliably control the false discovery rate (FDR) in finite samples, especially when mediators are moderately to highly correlated or follow non-Gaussian distributions. These challenges frequently arise in DNA methylation studies. We introduce CoxMDS, a multiple data splitting method that uses Cox proportional hazards models to identify putative causal mediators for survival outcomes. CoxMDS ensures finite-sample FDR control even in the presence of correlated or non-Gaussian mediators. Through simulations, CoxMDS is shown to maintain FDR control and achieve higher statistical power compared with existing approaches. In applications to DNA methylation data with survival outcomes, CoxMDS identified eight CpG sites in The Cancer Genome Atlas that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on lung cancer survival, and two CpG sites in the Alzheimer's Disease Neuroimaging Initiative that are consistent with the hypothesis that DNA methylation may mediate the effect of smoking on time to Alzheimer's disease conversion.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12805255/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145970588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Marangoni, Federica Furia, Debora Charrance, Agata Fant, Salvatore Di Dio, Sara Trova, Giovanni Spirito, Francesco Musacchia, Alessandro Coppe, Stefano Gustincich, Manuela Vecchi, Fabio Landuzzi, Andrea Cavalli
Next-generation sequencing (NGS) has revolutionized genome biology by enabling rapid whole-genome sequencing (WGS) and driving its adoption in research and clinical settings. However, the high-throughput nature of NGS and the complexity of downstream analyses demand robust computational solutions. We present GeNePi, a modular bioinformatic pipeline for efficient and accurate analysis of WGS short paired-end reads. GeNePi is a genomics analysis pipeline built on the Nextflow framework, integrating graphics processing unit (GPU)-accelerated algorithms from NVIDIA Clara Parabricks to enable high-performance variant discovery. The pipeline supports multiple workflow configurations and automates the detection of a broad range of genomic variants, including single-nucleotide variants and small insertions/deletions via GPU-accelerated HaplotypeCaller, copy number variants (CNVs) using CNVkit, and structural variants through a consensus approach combining Manta, Lumpy, BreakDancer, and CNVnator. Additionally, GeNePi incorporates MELT for the detection of mobile element insertions, providing a comprehensive framework for variant discovery and characterization. Benchmarking on synthetic and real datasets demonstrates high accuracy and performance comparable to state-of-the-art tools such as Genome Analysis ToolKit (GATK), establishing GeNePi as a scalable solution for comprehensive WGS analysis. These features make GeNePi a valuable instrument for large-scale analyses in both research and clinical contexts, representing a key step towards the establishment of National Centers for Computational and Technological Medicine.
下一代测序(NGS)通过实现快速全基因组测序(WGS)并推动其在研究和临床环境中的应用,彻底改变了基因组生物学。然而,NGS的高通量特性和下游分析的复杂性需要强大的计算解决方案。我们提出了GeNePi,一个模块化的生物信息学管道,用于有效和准确地分析WGS短对端reads。GeNePi是建立在Nextflow框架上的基因组学分析管道,集成了NVIDIA Clara Parabricks的图形处理单元(GPU)加速算法,以实现高性能的变体发现。该管道支持多种工作流程配置,并通过gpu加速的HaplotypeCaller自动检测广泛的基因组变异,包括单核苷酸变异和小插入/删除,使用CNVkit使用拷贝数变异(cnv),以及通过结合Manta, Lumpy, BreakDancer和CNVnator的共识方法自动检测结构变异。此外,GeNePi结合了MELT来检测移动元素插入,为变体发现和表征提供了一个全面的框架。对合成和真实数据集的基准测试表明,与基因组分析工具包(GATK)等最先进的工具相比,GeNePi具有很高的准确性和性能,使其成为全面WGS分析的可扩展解决方案。这些特点使GeNePi成为在研究和临床环境中进行大规模分析的有价值的工具,代表着建立国家计算和技术医学中心的关键一步。
{"title":"GeNePi: a graphics processing unit enhanced next-generation bioinformatics pipeline for whole-genome sequencing analysis.","authors":"Stefano Marangoni, Federica Furia, Debora Charrance, Agata Fant, Salvatore Di Dio, Sara Trova, Giovanni Spirito, Francesco Musacchia, Alessandro Coppe, Stefano Gustincich, Manuela Vecchi, Fabio Landuzzi, Andrea Cavalli","doi":"10.1093/bib/bbag001","DOIUrl":"10.1093/bib/bbag001","url":null,"abstract":"<p><p>Next-generation sequencing (NGS) has revolutionized genome biology by enabling rapid whole-genome sequencing (WGS) and driving its adoption in research and clinical settings. However, the high-throughput nature of NGS and the complexity of downstream analyses demand robust computational solutions. We present GeNePi, a modular bioinformatic pipeline for efficient and accurate analysis of WGS short paired-end reads. GeNePi is a genomics analysis pipeline built on the Nextflow framework, integrating graphics processing unit (GPU)-accelerated algorithms from NVIDIA Clara Parabricks to enable high-performance variant discovery. The pipeline supports multiple workflow configurations and automates the detection of a broad range of genomic variants, including single-nucleotide variants and small insertions/deletions via GPU-accelerated HaplotypeCaller, copy number variants (CNVs) using CNVkit, and structural variants through a consensus approach combining Manta, Lumpy, BreakDancer, and CNVnator. Additionally, GeNePi incorporates MELT for the detection of mobile element insertions, providing a comprehensive framework for variant discovery and characterization. Benchmarking on synthetic and real datasets demonstrates high accuracy and performance comparable to state-of-the-art tools such as Genome Analysis ToolKit (GATK), establishing GeNePi as a scalable solution for comprehensive WGS analysis. These features make GeNePi a valuable instrument for large-scale analyses in both research and clinical contexts, representing a key step towards the establishment of National Centers for Computational and Technological Medicine.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12832024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146046176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}