Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag032
Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, John Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri
Motivation: Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.
Results: We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2068 phenotypes from 635 969 participants in the Million Veteran Program, including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.
Availability and implementation: Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C++ and runs on Linux systems.
{"title":"SAIGE-GPU: accelerating genome- and phenome-wide association studies using GPUs.","authors":"Alex Rodriguez, Youngdae Kim, Tarak Nath Nandi, Karl Keat, Rachit Kumar, Mitchell Conery, Rohan Bhukar, Molei Liu, John Hessington, Ketan Maheshwari, Edmon Begoli, Georgia Tourassi, Sumitra Muralidhar, Pradeep Natarajan, Benjamin F Voight, Kelly Cho, John Michael Gaziano, Scott M Damrauer, Katherine P Liao, Wei Zhou, Jennifer E Huffman, Anurag Verma, Ravi K Madduri","doi":"10.1093/bioinformatics/btag032","DOIUrl":"10.1093/bioinformatics/btag032","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) at biobank scale are computationally intensive, especially for admixed populations requiring robust statistical models. SAIGE is a widely used method for generalized linear mixed-model GWAS but is limited by its CPU-based implementation, making phenome-wide association studies impractical for many research groups.</p><p><strong>Results: </strong>We developed SAIGE-GPU, a GPU-accelerated version of SAIGE that replaces CPU-intensive matrix operations with GPU-optimized kernels. The core innovation is distributing genetic relationship matrix calculations across GPUs and communication layers. Applied to 2068 phenotypes from 635 969 participants in the Million Veteran Program, including diverse and admixed populations, SAIGE-GPU achieved a 5-fold speedup in mixed model fitting on supercomputing infrastructure and cloud platforms. We further optimized the variant association testing step through multi-core and multi-trait parallelization. Deployed on Google Cloud Platform and Azure, the method provided substantial cost and time savings.</p><p><strong>Availability and implementation: </strong>Source code and binaries are available for download at https://github.com/saigegit/SAIGE/tree/SAIGE-GPU-1.3.3. A code snapshot is archived at Zenodo for reproducibility (DOI: [10.5281/zenodo.17642591]). SAIGE-GPU is available in a containerized format for use across HPC and cloud environments and is implemented in R/C++ and runs on Linux systems.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960912/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag072
Sanjoy Dey, Zhaonan Sun, John Warner, Eileen Koski, Elif Eyigoz, Swati Sathe, Cristina Sampaio, Jianying Hu
Motivation: There are many diseases with established genetic factors, such as Huntington's disease (HD), that are characterized by variable rates of progression. However, beyond the contribution of the known genetic factors - in this case the Huntingtin (HTT) gene - the impact of the full human genome on the natural progression of such diseases throughout a patient's life remains largely unknown. The increased availability of genome wide association (GWA) data in HD gene expansion carriers (HDGECs), combined with the clinical assessment scores on the same set of patients, has provided a perfect opportunity to assess the potentially broader genetic impact on the natural progression of HD.
Results: We present a genetics-driven, probabilistic disease progression model designed to identify and investigate the ways in which a range of genetic factors affect the natural progression of HD. When applied to a clinico-genomic HD dataset, our model identified several single nucleotide polymorphisms (SNPs) with previously unreported effects on disease progression that act at distinct stages and with varying magnitudes. This discovery may shed light on the potential mechanistic impact of previously unidentified genes on HD that may have implications for clinical management. As increasing amounts of GWA data become available more generally, we anticipate that this modeling framework will be broadly applicable to other diseases with strong genetic components.
Availability and implementation: The source code for IHDPM is available at https://github.com/BiomedSciAI/IHDPM.
{"title":"From genes to trajectories: mapping genetic influences on Huntington's disease progression.","authors":"Sanjoy Dey, Zhaonan Sun, John Warner, Eileen Koski, Elif Eyigoz, Swati Sathe, Cristina Sampaio, Jianying Hu","doi":"10.1093/bioinformatics/btag072","DOIUrl":"10.1093/bioinformatics/btag072","url":null,"abstract":"<p><strong>Motivation: </strong>There are many diseases with established genetic factors, such as Huntington's disease (HD), that are characterized by variable rates of progression. However, beyond the contribution of the known genetic factors - in this case the Huntingtin (HTT) gene - the impact of the full human genome on the natural progression of such diseases throughout a patient's life remains largely unknown. The increased availability of genome wide association (GWA) data in HD gene expansion carriers (HDGECs), combined with the clinical assessment scores on the same set of patients, has provided a perfect opportunity to assess the potentially broader genetic impact on the natural progression of HD.</p><p><strong>Results: </strong>We present a genetics-driven, probabilistic disease progression model designed to identify and investigate the ways in which a range of genetic factors affect the natural progression of HD. When applied to a clinico-genomic HD dataset, our model identified several single nucleotide polymorphisms (SNPs) with previously unreported effects on disease progression that act at distinct stages and with varying magnitudes. This discovery may shed light on the potential mechanistic impact of previously unidentified genes on HD that may have implications for clinical management. As increasing amounts of GWA data become available more generally, we anticipate that this modeling framework will be broadly applicable to other diseases with strong genetic components.</p><p><strong>Availability and implementation: </strong>The source code for IHDPM is available at https://github.com/BiomedSciAI/IHDPM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13003314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag076
Tien-Cuong Bui, Injae Chung, Wonjun Lee, Junsu Ko, Juyong Lee
Motivation: Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally resolved complexes and the limited accuracy of de novo Ig structure prediction.
Results: We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks-IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation-and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking.
Availability and implementation: IgPose is available on GitHub (https://github.com/arontier/igpose).
{"title":"IgPose: a generative data-augmented pipeline for robust immunoglobulin-antigen binding prediction.","authors":"Tien-Cuong Bui, Injae Chung, Wonjun Lee, Junsu Ko, Juyong Lee","doi":"10.1093/bioinformatics/btag076","DOIUrl":"10.1093/bioinformatics/btag076","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting immunoglobulin-antigen (Ig-Ag) binding remains a significant challenge due to the paucity of experimentally resolved complexes and the limited accuracy of de novo Ig structure prediction.</p><p><strong>Results: </strong>We introduce IgPose, a generalizable framework for Ig-Ag pose identification and scoring, built on a generative data-augmentation pipeline. To mitigate data scarcity, we constructed the Structural Immunoglobulin Decoy Database (SIDD), a comprehensive repository of high-fidelity synthetic decoys. IgPose integrates equivariant graph neural networks, ESM-2 embeddings, and gated recurrent units to synergistically capture both geometric and evolutionary features. We implemented interface-focused k-hop sampling with biologically guided pooling to enhance generalization across diverse interfaces. The framework comprises two sub-networks-IgPoseClassifier for binding pose discrimination and IgPoseScore for DockQ score estimation-and achieves robust performance on curated internal test sets and the CASP-16 benchmark compared to physics and deep learning baselines. IgPose serves as a versatile computational tool for high-throughput antibody discovery pipelines by providing accurate pose filtering and ranking.</p><p><strong>Availability and implementation: </strong>IgPose is available on GitHub (https://github.com/arontier/igpose).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12989135/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag038
Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu
Motivation: Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.
Results: Motivated by the TEDDY study, we propose "JM-NCC", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.
Availability and implementation: Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).
{"title":"Joint modeling of longitudinal biomarker and survival outcomes with the presence of competing risk in the nested case-control studies with application to the TEDDY microbiome dataset.","authors":"Yanan Zhao, Ting-Fang Lee, Boyan Zhou, Chan Wang, Ann Marie Schmidt, Mengling Liu, Huilin Li, Jiyuan Hu","doi":"10.1093/bioinformatics/btag038","DOIUrl":"10.1093/bioinformatics/btag038","url":null,"abstract":"<p><strong>Motivation: </strong>Large-scale prospective cohort studies collect longitudinal biospecimens alongside time-to-event outcomes to investigate biomarker dynamics in relation to disease risk. The nested case-control (NCC) design provides a cost-effective alternative to full cohort biomarker studies while preserving statistical efficiency. Despite advances in joint modeling for longitudinal and time-to-event outcomes, few approaches address the unique challenges posed by NCC sampling, non-normally distributed biomarkers, and competing survival outcomes.</p><p><strong>Results: </strong>Motivated by the TEDDY study, we propose \"JM-NCC\", a joint modeling framework designed for NCC studies with competing events. It integrates a generalized linear mixed-effects model for potentially non-normally distributed biomarkers with a cause-specific hazard model for competing risks. Two estimation methods are developed. fJM-NCC leverages NCC sub-cohort longitudinal biomarker data and full cohort survival and clinical metadata, while wJM-NCC uses only NCC sub-cohort data. Both simulation studies and an application to TEDDY microbiome dataset demonstrate the robustness and efficiency of the proposed methods.</p><p><strong>Availability and implementation: </strong>Software is available at https://github.com/Zhaoyn-oss/JMNCC and archived on Zenodo at https://zenodo.org/records/18199759 (DOI: 10.5281/zenodo.18199759).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005730/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146032072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag080
Shangjin Han, Dongsup Kim
Motivation: Understanding temporal gene expression is fundamental in the study of cellular development and differentiation. In practice, temporal single-cell datasets tend to contain only a limited number of measured time points, which are often unevenly spaced, resulting in irregular intervals between observations due to experimental constraints. Existing methods typically address these intervals by sequentially predicting one time point after another, yet lack mechanisms to explicitly model time intervals, leading to error accumulation.
Results: In this work, we introduce scMix, a language-model-based framework for predicting single-cell gene expression, which enables prediction from multiple historical time points. We build scMix on the Receptance Weighted Key Value architecture and use its time decay mechanism to model temporal dependencies over time. Moreover, scMix proposes a delta-time mechanism that allows the model to bypass unmeasured time points, reducing error accumulation and improving robustness. In addition, we incorporate a trend regularization strategy to enhance the temporal coherence of predicted gene expression trajectories. scMix demonstrates state-of-the-art performance in predicting gene expression at unmeasured time points, surpassing existing methods, and also achieves outstanding results on downstream tasks.
Availability and implementation: The code used for this study is available at https://doi.org/10.5281/zenodo.18287184.
{"title":"scMix: learning temporal dynamics of gene expression under irregular time intervals.","authors":"Shangjin Han, Dongsup Kim","doi":"10.1093/bioinformatics/btag080","DOIUrl":"10.1093/bioinformatics/btag080","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding temporal gene expression is fundamental in the study of cellular development and differentiation. In practice, temporal single-cell datasets tend to contain only a limited number of measured time points, which are often unevenly spaced, resulting in irregular intervals between observations due to experimental constraints. Existing methods typically address these intervals by sequentially predicting one time point after another, yet lack mechanisms to explicitly model time intervals, leading to error accumulation.</p><p><strong>Results: </strong>In this work, we introduce scMix, a language-model-based framework for predicting single-cell gene expression, which enables prediction from multiple historical time points. We build scMix on the Receptance Weighted Key Value architecture and use its time decay mechanism to model temporal dependencies over time. Moreover, scMix proposes a delta-time mechanism that allows the model to bypass unmeasured time points, reducing error accumulation and improving robustness. In addition, we incorporate a trend regularization strategy to enhance the temporal coherence of predicted gene expression trajectories. scMix demonstrates state-of-the-art performance in predicting gene expression at unmeasured time points, surpassing existing methods, and also achieves outstanding results on downstream tasks.</p><p><strong>Availability and implementation: </strong>The code used for this study is available at https://doi.org/10.5281/zenodo.18287184.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12970592/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag074
Wenjing Song, Yesen Sun, Le Ou-Yang
Motivation: Accurate cancer subtyping is critically important for cancer treatment due to significant molecular heterogeneity. While existing methods with multi-omics integration have achieved some success in cancer subtype identification by leveraging the rich information provided by multi-omics data, most approaches remain limited by an overemphasis on cross-omics consistency at the expense of intra-omics specificity. Furthermore, a two-step scheme is often adopted to extract cluster structure from a consistency matrix or a continuous indicator matrix by k-means, which inevitably leads to information loss and unstable clusters.
Results: To overcome these issues, we propose seOMLR, a one-step multi-view latent representation method with self-weighted ensemble learning for cancer subtyping. Using relaxed exclusivity constraints and consistency regularization terms, seOMLR exploits the specificity and consistency of multi-omics data by building a sparse low-rank self-representation framework. Simultaneously, a self-weighted ensemble strategy is introduced to adaptively incorporate prior subtyping information from other methods, indirectly promoting specificity and consistency learning. Moreover, the discrete clustering structure is subsequently extracted via spectral rotation to avoid information loss and cluster instability. Through joint iterative optimization of fusion and clustering, seOMLR enhances subtyping accuracy. Experiments on both simulated datasets and eight real multi-omics cancer datasets from TCGA demonstrate that seOMLR outperforms competing methods, achieving efficient multi-omics data fusion and providing computational framework support for cancer subtyping research.
Availability and implementation: Supplementary data are available at Bioinformatics online.
{"title":"SeOMLR: one-step multi-view latent representation with self-weighted ensemble learning for multi-omics cancer subtyping.","authors":"Wenjing Song, Yesen Sun, Le Ou-Yang","doi":"10.1093/bioinformatics/btag074","DOIUrl":"10.1093/bioinformatics/btag074","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate cancer subtyping is critically important for cancer treatment due to significant molecular heterogeneity. While existing methods with multi-omics integration have achieved some success in cancer subtype identification by leveraging the rich information provided by multi-omics data, most approaches remain limited by an overemphasis on cross-omics consistency at the expense of intra-omics specificity. Furthermore, a two-step scheme is often adopted to extract cluster structure from a consistency matrix or a continuous indicator matrix by k-means, which inevitably leads to information loss and unstable clusters.</p><p><strong>Results: </strong>To overcome these issues, we propose seOMLR, a one-step multi-view latent representation method with self-weighted ensemble learning for cancer subtyping. Using relaxed exclusivity constraints and consistency regularization terms, seOMLR exploits the specificity and consistency of multi-omics data by building a sparse low-rank self-representation framework. Simultaneously, a self-weighted ensemble strategy is introduced to adaptively incorporate prior subtyping information from other methods, indirectly promoting specificity and consistency learning. Moreover, the discrete clustering structure is subsequently extracted via spectral rotation to avoid information loss and cluster instability. Through joint iterative optimization of fusion and clustering, seOMLR enhances subtyping accuracy. Experiments on both simulated datasets and eight real multi-omics cancer datasets from TCGA demonstrate that seOMLR outperforms competing methods, achieving efficient multi-omics data fusion and providing computational framework support for cancer subtyping research.</p><p><strong>Availability and implementation: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12980331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147367671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag102
Marie Brinkmann, Michael Bonelli, Anela Tosevska
Motivation: Enrichment analysis across multiple databases often results in a high level of redundancy due to overlapping terms, complicating the interpretation of biological data. To address this, we developed SummArIzeR, an R package to cluster and annotate enrichment results across multiple databases, enabling fast, intuitive interpretation and comparison across multiple conditions. SummArIzeR clusters enrichment results based on shared genes, calculates a pooled P-value for each cluster and facilitates the cluster annotation using large-language models. It further allows an easily interpretable visualization of the results.
Results: Compared to existing tools, SummArIzeR provides unbiased and fast cluster annotation using large language models. We demonstrate that SummArIzeR achieves clustering comparable to manual curation while offering superior grouping based on shared underlying genes.
Availability and implementation: The SummArIzeR package is available as an open-source R package, with a comprehensive user manual provided in its GitHub repository: https://github.com/bonellilab/SummArIzeR.
{"title":"SummArIzeR: simplifying cross-database enrichment result clustering and annotation via large language models.","authors":"Marie Brinkmann, Michael Bonelli, Anela Tosevska","doi":"10.1093/bioinformatics/btag102","DOIUrl":"10.1093/bioinformatics/btag102","url":null,"abstract":"<p><strong>Motivation: </strong>Enrichment analysis across multiple databases often results in a high level of redundancy due to overlapping terms, complicating the interpretation of biological data. To address this, we developed SummArIzeR, an R package to cluster and annotate enrichment results across multiple databases, enabling fast, intuitive interpretation and comparison across multiple conditions. SummArIzeR clusters enrichment results based on shared genes, calculates a pooled P-value for each cluster and facilitates the cluster annotation using large-language models. It further allows an easily interpretable visualization of the results.</p><p><strong>Results: </strong>Compared to existing tools, SummArIzeR provides unbiased and fast cluster annotation using large language models. We demonstrate that SummArIzeR achieves clustering comparable to manual curation while offering superior grouping based on shared underlying genes.</p><p><strong>Availability and implementation: </strong>The SummArIzeR package is available as an open-source R package, with a comprehensive user manual provided in its GitHub repository: https://github.com/bonellilab/SummArIzeR.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005729/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.
Results: Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.
Availability and implementation: The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.
{"title":"Mamba6mA: a Mamba-based DNA N6-methyladenine site prediction model.","authors":"Qi Zhao, Zhen Zhang, Tingwei Chen, Qian Mao, Haoxuan Shi, Jingjing Chen, Zheng Zhao, Xiaoya Fan","doi":"10.1093/bioinformatics/btag060","DOIUrl":"10.1093/bioinformatics/btag060","url":null,"abstract":"<p><strong>Motivation: </strong>N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.</p><p><strong>Results: </strong>Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.</p><p><strong>Availability and implementation: </strong>The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Accurate detection of copy number variations (CNVs) from targeted panel sequencing remains challenging due to limited genomic coverage and pronounced sample-specific biases. Existing normalization strategies, including baseline-cohort, matched-control, and single-sample approaches, often struggle to balance noise suppression with adaptability, leading to inconsistent performance across heterogeneous samples.
Results: We present PScnv, a personalized self-normalizing framework for robust CNV detection from panel sequencing data. PScnv integrates a pre-built panel-of-normals (PoN) with sample-intrinsic stable chromosomes through ridge-regression normalization to generate individualized log2 ratio profiles with reduced systematic variation. CNVs are then identified using a hierarchical multi-phase segmentation pipeline incorporating z-score pre-partitioning, kernel-based correction, and circular binary segmentation. In 139 clinical tumor samples with orthogonal FISH validation at MET, ERBB2, and MTAP, PScnv showed improved accuracy and robustness over existing methods that do not require patient-matched normal samples, provided that a pre-built PoN cohort is available.
Availability: Source code is available for academic use at https://github.com/lvws/PScnv.
{"title":"PScnv: personalized self-normalizing CNV detection with a hierarchical multi-phase framework.","authors":"Xuwen Wang, Zhili Chang, Wansheng Lv, Akhatov Akmal, Xamidov Munis, Xunbiao Liu, Shenjie Wang, Xiaoyan Zhu, Chong Du, Shuqun Zhang, Jiayin Wang","doi":"10.1093/bioinformatics/btag099","DOIUrl":"10.1093/bioinformatics/btag099","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate detection of copy number variations (CNVs) from targeted panel sequencing remains challenging due to limited genomic coverage and pronounced sample-specific biases. Existing normalization strategies, including baseline-cohort, matched-control, and single-sample approaches, often struggle to balance noise suppression with adaptability, leading to inconsistent performance across heterogeneous samples.</p><p><strong>Results: </strong>We present PScnv, a personalized self-normalizing framework for robust CNV detection from panel sequencing data. PScnv integrates a pre-built panel-of-normals (PoN) with sample-intrinsic stable chromosomes through ridge-regression normalization to generate individualized log2 ratio profiles with reduced systematic variation. CNVs are then identified using a hierarchical multi-phase segmentation pipeline incorporating z-score pre-partitioning, kernel-based correction, and circular binary segmentation. In 139 clinical tumor samples with orthogonal FISH validation at MET, ERBB2, and MTAP, PScnv showed improved accuracy and robustness over existing methods that do not require patient-matched normal samples, provided that a pre-built PoN cohort is available.</p><p><strong>Availability: </strong>Source code is available for academic use at https://github.com/lvws/PScnv.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005925/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag065
Robert A McDonald, Helen M Byrne, Heather A Harrington, Thomas Thorne, Bernadette J Stolz
Motivation: Comparing mathematical models offers a means to evaluate competing scientific theories. However, exact methods of model calibration are not applicable to many probabilistic models which simulate high-dimensional spatio-temporal data. Approximate Bayesian Computation is a widely used method for parameter inference and model selection in such scenarios, and it may be combined with Topological Data Analysis to study models which simulate data with fine spatial structure.
Results: We develop a flexible pipeline for parameter inference and model selection in spatio-temporal models. Our pipeline identifies topological summary statistics which quantify spatio-temporal data and uses them to approximate parameter and model posterior distributions. We validate our pipeline on models of tumour-induced angiogenesis, inferring four parameters in three established models and identifying the correct model in synthetic test-cases.
Availability and implementation: Simulation code for all models, data analyses, parameter inference and model selection is available online at https://github.com/rmcdomaths/tms/ and archived at https://doi.org/10.5281/zenodo.17392787.
{"title":"Topological model selection: a case-study in tumour-induced angiogenesis.","authors":"Robert A McDonald, Helen M Byrne, Heather A Harrington, Thomas Thorne, Bernadette J Stolz","doi":"10.1093/bioinformatics/btag065","DOIUrl":"10.1093/bioinformatics/btag065","url":null,"abstract":"<p><strong>Motivation: </strong>Comparing mathematical models offers a means to evaluate competing scientific theories. However, exact methods of model calibration are not applicable to many probabilistic models which simulate high-dimensional spatio-temporal data. Approximate Bayesian Computation is a widely used method for parameter inference and model selection in such scenarios, and it may be combined with Topological Data Analysis to study models which simulate data with fine spatial structure.</p><p><strong>Results: </strong>We develop a flexible pipeline for parameter inference and model selection in spatio-temporal models. Our pipeline identifies topological summary statistics which quantify spatio-temporal data and uses them to approximate parameter and model posterior distributions. We validate our pipeline on models of tumour-induced angiogenesis, inferring four parameters in three established models and identifying the correct model in synthetic test-cases.</p><p><strong>Availability and implementation: </strong>Simulation code for all models, data analyses, parameter inference and model selection is available online at https://github.com/rmcdomaths/tms/ and archived at https://doi.org/10.5281/zenodo.17392787.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147446297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}