Pub Date : 2026-02-09DOI: 10.1093/bioinformatics/btag056
Jiaying Hu, Yihang Du, Suyang Hou, Yueyang Ding, Jinyan Li, Hao Wu, Xiaobo Sun
Motivation: Spatial clustering is a critical analytical task in spatial transcriptomics (ST) that aids in uncovering the spatial molecular mechanisms underlying biological phenotypes. Along with the numerous spatial clustering methods, there comes the imperative need for an effective metric to evaluate their performance. An ideal metric should consider three factors: label agreement, spatial organization, and error severity. However, existing evaluation metrics focus solely on either label agreement or spatial organization, leading to biased and misleading evaluations.
Results: To fill this gap, we propose CEMUSA, a novel graph-based metric that integrates these factors into a unified evaluation framework. Extensive testing on both simulated and real datasets demonstrate CEMUSA's superiority over conventional metrics in differentiating clustering results with subtle differences in topology and error severity, while maintaining computational efficiency.
Availability and implementation: The source code and data is freely available at https://github.com/YihDu/CEMUSA. CEMUSA is implemented as an R package at https://yihdu.github.io/CEMUSA.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"CEMUSA: A Graph-based Integrative Metric for Evaluating Clusters in Spatial Transcriptomics.","authors":"Jiaying Hu, Yihang Du, Suyang Hou, Yueyang Ding, Jinyan Li, Hao Wu, Xiaobo Sun","doi":"10.1093/bioinformatics/btag056","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag056","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial clustering is a critical analytical task in spatial transcriptomics (ST) that aids in uncovering the spatial molecular mechanisms underlying biological phenotypes. Along with the numerous spatial clustering methods, there comes the imperative need for an effective metric to evaluate their performance. An ideal metric should consider three factors: label agreement, spatial organization, and error severity. However, existing evaluation metrics focus solely on either label agreement or spatial organization, leading to biased and misleading evaluations.</p><p><strong>Results: </strong>To fill this gap, we propose CEMUSA, a novel graph-based metric that integrates these factors into a unified evaluation framework. Extensive testing on both simulated and real datasets demonstrate CEMUSA's superiority over conventional metrics in differentiating clustering results with subtle differences in topology and error severity, while maintaining computational efficiency.</p><p><strong>Availability and implementation: </strong>The source code and data is freely available at https://github.com/YihDu/CEMUSA. CEMUSA is implemented as an R package at https://yihdu.github.io/CEMUSA.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.
Results: Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.
Availability and implementation: The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.
Contact: Xiaoya Fan (xiaoyafan@dlut.edu.cn), Zheng Zhao (zhaozheng@dlmu.edu.cn).
Supplementary information: Supplementary information are available at Bioinformatics online.
{"title":"Mamba6mA: A Mamba-based DNA N6-methyladenine Site Prediction Model.","authors":"Qi Zhao, Zhen Zhang, Tingwei Chen, Qian Mao, Haoxuan Shi, Jingjing Chen, Zheng Zhao, Xiaoya Fan","doi":"10.1093/bioinformatics/btag060","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag060","url":null,"abstract":"<p><strong>Motivation: </strong>N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.</p><p><strong>Results: </strong>Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.</p><p><strong>Availability and implementation: </strong>The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.</p><p><strong>Contact: </strong>Xiaoya Fan (xiaoyafan@dlut.edu.cn), Zheng Zhao (zhaozheng@dlmu.edu.cn).</p><p><strong>Supplementary information: </strong>Supplementary information are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-05DOI: 10.1093/bioinformatics/btag059
Weiqiang Lin, Xinyi Xiao, Chuan Qiu, Hui Shen, Hongwen Deng
Motivation: Understanding spatial organization, intercellular interactions and regulatory networks within the spatial context of tissues is crucial for uncovering complex biological processes and disease mechanisms. Spatial transcriptomics technologies have revolutionized this field by enabling the spatially resolved profiling of gene expression. 10X Genomics Visium has emerged as the predominant spatial technology, but its low resolution and the complexity of integrating multimodal datasets present significant analytical challenges, particularly for researchers with limited computational and statistical expertise. Current spatial transcriptomics analysis platforms generally fall short of effectively integrating multi-modal data and maximizing the utility of spatial information-such as uncovering complex cellular spatial dependencies, multimodal gradient patterns and spatial co-expression of ligand-receptor pairs and regulatory networks related to disease or biological states-thereby limiting their ability to provide comprehensive end-to-end analytical workflows when analyzing 10X Genomics Visium data.
Results: To address these limitations, we developed transFusion, a novel, advanced web-based platform specializing in the most comprehensive and effective integration analysis of scRNA-seq and 10X Visium spatial transcriptomics data. transFusion offers 12 key functions, from basic visualization to advanced analyses, including intercellular dependency analysis, ligand-receptor co-expression identification and visualization, and spatial multimodal gradient variation patterns. Two case studies were used to demonstrate transFusion's capabilities in exploring tissue architecture, intercellular communication, dependency networks and multimodal gradient variation patterns with minimal computational skills and statistical expertise. transFusion provides a flexible and powerful framework for multi-modal data integration analysis.
Availability: transFusion is freely available at https://github.com/WQLin8/transFusion.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"transFusion: a Novel Comprehensive Platform for integration Analysis of Single-Cell and Spatial Transcriptomics.","authors":"Weiqiang Lin, Xinyi Xiao, Chuan Qiu, Hui Shen, Hongwen Deng","doi":"10.1093/bioinformatics/btag059","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag059","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding spatial organization, intercellular interactions and regulatory networks within the spatial context of tissues is crucial for uncovering complex biological processes and disease mechanisms. Spatial transcriptomics technologies have revolutionized this field by enabling the spatially resolved profiling of gene expression. 10X Genomics Visium has emerged as the predominant spatial technology, but its low resolution and the complexity of integrating multimodal datasets present significant analytical challenges, particularly for researchers with limited computational and statistical expertise. Current spatial transcriptomics analysis platforms generally fall short of effectively integrating multi-modal data and maximizing the utility of spatial information-such as uncovering complex cellular spatial dependencies, multimodal gradient patterns and spatial co-expression of ligand-receptor pairs and regulatory networks related to disease or biological states-thereby limiting their ability to provide comprehensive end-to-end analytical workflows when analyzing 10X Genomics Visium data.</p><p><strong>Results: </strong>To address these limitations, we developed transFusion, a novel, advanced web-based platform specializing in the most comprehensive and effective integration analysis of scRNA-seq and 10X Visium spatial transcriptomics data. transFusion offers 12 key functions, from basic visualization to advanced analyses, including intercellular dependency analysis, ligand-receptor co-expression identification and visualization, and spatial multimodal gradient variation patterns. Two case studies were used to demonstrate transFusion's capabilities in exploring tissue architecture, intercellular communication, dependency networks and multimodal gradient variation patterns with minimal computational skills and statistical expertise. transFusion provides a flexible and powerful framework for multi-modal data integration analysis.</p><p><strong>Availability: </strong>transFusion is freely available at https://github.com/WQLin8/transFusion.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1093/bioinformatics/btag057
Chengye Li, Hongwei Ma, Mingyang Ren
Motivation: Heterogeneity is a hallmark of both macroscopic complex diseases and microscopic single-cell distribution. Gaussian Graphical Models (GGM)-based heterogeneity analysis highlights its important role in capturing the essential characteristics of biological regulatory networks, but faces instability with scarce samples from rare subgroups. Transfer learning offers promise by leveraging auxiliary data, yet existing approaches rely on unrealistic overall similarity between domains, requiring the same subgroup number and similar parameters. Numerous biological problems call for local similarities, where only some subgroups share statistical structures.
Results: In this article, we propose LtransHeteroGGM, a novel local transfer learning framework for GGM-based heterogeneity analysis. It can achieve powerful subgroup-level local knowledge transfer between target and informative auxiliary domains, despite unknown subgroup structures and numbers, while mitigating the negative interference of non-informative domains. The effectiveness and robustness of the proposed approach are demonstrated through comprehensive numerical simulations and real-world T cell heterogeneity analysis.
Availability and implementation: The R implementation of LtransHeteroGGM is available at https://github.com/Ren-Mingyang/LtransHeteroGGM.
{"title":"LtransHeteroGGM: Local transfer learning for Gaussian graphical model-based heterogeneity analysis.","authors":"Chengye Li, Hongwei Ma, Mingyang Ren","doi":"10.1093/bioinformatics/btag057","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag057","url":null,"abstract":"<p><strong>Motivation: </strong>Heterogeneity is a hallmark of both macroscopic complex diseases and microscopic single-cell distribution. Gaussian Graphical Models (GGM)-based heterogeneity analysis highlights its important role in capturing the essential characteristics of biological regulatory networks, but faces instability with scarce samples from rare subgroups. Transfer learning offers promise by leveraging auxiliary data, yet existing approaches rely on unrealistic overall similarity between domains, requiring the same subgroup number and similar parameters. Numerous biological problems call for local similarities, where only some subgroups share statistical structures.</p><p><strong>Results: </strong>In this article, we propose LtransHeteroGGM, a novel local transfer learning framework for GGM-based heterogeneity analysis. It can achieve powerful subgroup-level local knowledge transfer between target and informative auxiliary domains, despite unknown subgroup structures and numbers, while mitigating the negative interference of non-informative domains. The effectiveness and robustness of the proposed approach are demonstrated through comprehensive numerical simulations and real-world T cell heterogeneity analysis.</p><p><strong>Availability and implementation: </strong>The R implementation of LtransHeteroGGM is available at https://github.com/Ren-Mingyang/LtransHeteroGGM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1093/bioinformatics/btaf679
Nguyen Khoa Tran, My Ky Huynh, Alexander D Kotman, Martin Jürgens, Thomas Kurz, Sascha Dietrich, Gunnar W Klau, Nan Qin
Motivation: Live-cell imaging-based drug screening increases the likelihood of identifying effective and safe drugs by providing dynamic, high-content, and physiologically relevant data. As a result, it improves the success rate of drug development and facilitates the translation of benchside discoveries to bedside applications. Despite these advantages, no comprehensive metrics currently exist to evaluate dose-time-dependent drug responses. To address this gap, we established a systematic framework to assess drug effects across a range of concentrations and exposure durations simultaneously. This metric enables more accurate evaluation of drug responses measured by live-cell imaging.
Results: We employed treatment concentrations ranging from 0 to 10 μM and performed live-cell imaging-based measurements over a 120-hour incubation period. To analyze the experimental data, we developed VUScope, a new mathematical model combining the 4-parameter logistic curve and a logistic function to characterize dose-time-dependent responses. This enabled us to calculate the Growth Rate Inhibition Volume Under the dose-time-response Surface (GRIVUS), which serves as a critical metric for assessing dynamic drug responses. Furthermore, our mathematical model allowed us to predict long-term treatment responses based on short-term drug responses. We validated the predictive capabilities of our model using independent datasets and observed that VUScope enhances prediction accuracy and offers deeper insights into drug effects than previously possible. By integrating VUScope into high-throughput drug screening platforms, we can further improve the efficacy of drug development and treatment selection.
Availability and implementation: We have made VUScope more accessible to users conducting pharmacological studies by uploading a detailed description, example datasets, and the source code to vuscope.albi.hhu.de, https://github.com/AlBi-HHU/VUScope, and https://doi.org/10.5281/zenodo.17610533.
Supplementary information: A: Time-independence of GR metrics; B: Key resources; C, D, E: Supplementary figures; F: Modeling choices.
{"title":"VUScope: a mathematical model for evaluating image-based drug response measurements and predicting long-term incubation outcomes.","authors":"Nguyen Khoa Tran, My Ky Huynh, Alexander D Kotman, Martin Jürgens, Thomas Kurz, Sascha Dietrich, Gunnar W Klau, Nan Qin","doi":"10.1093/bioinformatics/btaf679","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf679","url":null,"abstract":"<p><strong>Motivation: </strong>Live-cell imaging-based drug screening increases the likelihood of identifying effective and safe drugs by providing dynamic, high-content, and physiologically relevant data. As a result, it improves the success rate of drug development and facilitates the translation of benchside discoveries to bedside applications. Despite these advantages, no comprehensive metrics currently exist to evaluate dose-time-dependent drug responses. To address this gap, we established a systematic framework to assess drug effects across a range of concentrations and exposure durations simultaneously. This metric enables more accurate evaluation of drug responses measured by live-cell imaging.</p><p><strong>Results: </strong>We employed treatment concentrations ranging from 0 to 10 μM and performed live-cell imaging-based measurements over a 120-hour incubation period. To analyze the experimental data, we developed VUScope, a new mathematical model combining the 4-parameter logistic curve and a logistic function to characterize dose-time-dependent responses. This enabled us to calculate the Growth Rate Inhibition Volume Under the dose-time-response Surface (GRIVUS), which serves as a critical metric for assessing dynamic drug responses. Furthermore, our mathematical model allowed us to predict long-term treatment responses based on short-term drug responses. We validated the predictive capabilities of our model using independent datasets and observed that VUScope enhances prediction accuracy and offers deeper insights into drug effects than previously possible. By integrating VUScope into high-throughput drug screening platforms, we can further improve the efficacy of drug development and treatment selection.</p><p><strong>Availability and implementation: </strong>We have made VUScope more accessible to users conducting pharmacological studies by uploading a detailed description, example datasets, and the source code to vuscope.albi.hhu.de, https://github.com/AlBi-HHU/VUScope, and https://doi.org/10.5281/zenodo.17610533.</p><p><strong>Supplementary information: </strong>A: Time-independence of GR metrics; B: Key resources; C, D, E: Supplementary figures; F: Modeling choices.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1093/bioinformatics/btag043
R C M Kuin, A T Julian, J Chander, S Lee, G J P van Westen
Motivation: Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.
Results: To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyse the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.
Availability: PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"PyTEA-O: a Python implementation of Two-Entropies Analysis for protein sequence variation analysis.","authors":"R C M Kuin, A T Julian, J Chander, S Lee, G J P van Westen","doi":"10.1093/bioinformatics/btag043","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag043","url":null,"abstract":"<p><strong>Motivation: </strong>Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.</p><p><strong>Results: </strong>To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyse the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.</p><p><strong>Availability: </strong>PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1093/bioinformatics/btag058
Joan Segura, Ruben Sanchez-Garcia, Sebastian Bittrich, Yana Rose, Stephen K Burley, Jose M Duarte
Motivation: The rapid expansion of three-dimensional (3D) biomolecular structure information, driven by breakthroughs in artificial intelligence/deep learning (AI/DL)-based structure predictions, has created an urgent need for scalable and efficient structure similarity search methods. Traditional alignment-based approaches, such as structural superposition tools, are computationally expensive and challenging to scale with the vast number of available macromolecular structures.
Results: Herein, we present a scalable structure similarity search strategy designed to navigate extensive repositories of experimentally determined structures and computed structure models predicted using AI/DL methods. Our approach leverages protein language models and a deep neural network architecture to transform 3D structures into fixed-length vectors, enabling efficient large-scale comparisons. Although trained to predict TM-scores between single-domain structures, our model generalizes beyond the domain level, accurately identifying 3D similarity for full-length polypeptide chains and multimeric assemblies. By integrating vector databases, our method facilitates efficient large-scale structure retrieval, addressing the growing challenges posed by the expanding volume of 3D biostructure information.
Availability: Source code available at https://github.com/bioinsilico/rcsb-embedding-search.Source code DOI: https://doi.org/10.6084/m9.figshare.30546698.v1.Benchmark datasets DOI: https://doi.org/10.6084/m9.figshare.30546650.v1.Web server prototype available at: http://embedding-search.rcsb.org/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Multi-scale structural similarity embedding search across entire proteomes.","authors":"Joan Segura, Ruben Sanchez-Garcia, Sebastian Bittrich, Yana Rose, Stephen K Burley, Jose M Duarte","doi":"10.1093/bioinformatics/btag058","DOIUrl":"10.1093/bioinformatics/btag058","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid expansion of three-dimensional (3D) biomolecular structure information, driven by breakthroughs in artificial intelligence/deep learning (AI/DL)-based structure predictions, has created an urgent need for scalable and efficient structure similarity search methods. Traditional alignment-based approaches, such as structural superposition tools, are computationally expensive and challenging to scale with the vast number of available macromolecular structures.</p><p><strong>Results: </strong>Herein, we present a scalable structure similarity search strategy designed to navigate extensive repositories of experimentally determined structures and computed structure models predicted using AI/DL methods. Our approach leverages protein language models and a deep neural network architecture to transform 3D structures into fixed-length vectors, enabling efficient large-scale comparisons. Although trained to predict TM-scores between single-domain structures, our model generalizes beyond the domain level, accurately identifying 3D similarity for full-length polypeptide chains and multimeric assemblies. By integrating vector databases, our method facilitates efficient large-scale structure retrieval, addressing the growing challenges posed by the expanding volume of 3D biostructure information.</p><p><strong>Availability: </strong>Source code available at https://github.com/bioinsilico/rcsb-embedding-search.Source code DOI: https://doi.org/10.6084/m9.figshare.30546698.v1.Benchmark datasets DOI: https://doi.org/10.6084/m9.figshare.30546650.v1.Web server prototype available at: http://embedding-search.rcsb.org/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag051
Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean
Motivation: Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.
Results: We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.
Availability and implementation: iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.
Supplementary information: Supplementary data are available at Bioinformaticss online.
{"title":"Inference of marker genes of subtle cell state changes via iLR: iterative logistic regression.","authors":"Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean","doi":"10.1093/bioinformatics/btag051","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag051","url":null,"abstract":"<p><strong>Motivation: </strong>Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.</p><p><strong>Results: </strong>We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.</p><p><strong>Availability and implementation: </strong>iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformaticss online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag054
Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos
Motivation: Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.
Results: We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.
{"title":"PYRAMA: An open-source tool for advanced meta-analysis of genome wide association studies.","authors":"Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos","doi":"10.1093/bioinformatics/btag054","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag054","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.</p><p><strong>Results: </strong>We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.</p><p><strong>Availability: </strong>https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1093/bioinformatics/btag053
Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo
Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.
Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.
Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Using semantic search to find publicly available gene-expression datasets.","authors":"Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo","doi":"10.1093/bioinformatics/btag053","DOIUrl":"10.1093/bioinformatics/btag053","url":null,"abstract":"<p><strong>Motivation: </strong>Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.</p><p><strong>Results: </strong>We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.</p><p><strong>Availability: </strong>Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}