Bioinformatics (Oxford, England)最新文献

CEMUSA: A Graph-based Integrative Metric for Evaluating Clusters in Spatial Transcriptomics. CEMUSA：一种基于图的综合度量，用于评估空间转录组学中的簇。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-09 DOI: 10.1093/bioinformatics/btag056

Jiaying Hu, Yihang Du, Suyang Hou, Yueyang Ding, Jinyan Li, Hao Wu, Xiaobo Sun

Motivation: Spatial clustering is a critical analytical task in spatial transcriptomics (ST) that aids in uncovering the spatial molecular mechanisms underlying biological phenotypes. Along with the numerous spatial clustering methods, there comes the imperative need for an effective metric to evaluate their performance. An ideal metric should consider three factors: label agreement, spatial organization, and error severity. However, existing evaluation metrics focus solely on either label agreement or spatial organization, leading to biased and misleading evaluations.

Results: To fill this gap, we propose CEMUSA, a novel graph-based metric that integrates these factors into a unified evaluation framework. Extensive testing on both simulated and real datasets demonstrate CEMUSA's superiority over conventional metrics in differentiating clustering results with subtle differences in topology and error severity, while maintaining computational efficiency.

Availability and implementation: The source code and data is freely available at https://github.com/YihDu/CEMUSA. CEMUSA is implemented as an R package at https://yihdu.github.io/CEMUSA.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：空间聚类是空间转录组学（ST）中的一项关键分析任务，有助于揭示生物学表型背后的空间分子机制。随着空间聚类方法的出现，迫切需要一个有效的度量来评估它们的性能。理想的度量应该考虑三个因素：标签一致性、空间组织和错误严重性。然而，现有的评价指标只关注标签一致性或空间组织，导致有偏见和误导性的评价。结果：为了填补这一空白，我们提出了CEMUSA，这是一种基于图形的新指标，将这些因素整合到统一的评估框架中。在模拟和真实数据集上的广泛测试表明，CEMUSA在区分拓扑和错误严重程度的细微差异的聚类结果方面优于传统指标，同时保持了计算效率。可用性和实现：源代码和数据可以在https://github.com/YihDu/CEMUSA上免费获得。CEMUSA以R软件包的形式在https://yihdu.github.io/CEMUSA.Supplementary上实现：补充数据可在Bioinformatics上在线获得。

{"title":"CEMUSA: A Graph-based Integrative Metric for Evaluating Clusters in Spatial Transcriptomics.","authors":"Jiaying Hu, Yihang Du, Suyang Hou, Yueyang Ding, Jinyan Li, Hao Wu, Xiaobo Sun","doi":"10.1093/bioinformatics/btag056","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag056","url":null,"abstract":"Motivation: Spatial clustering is a critical analytical task in spatial transcriptomics (ST) that aids in uncovering the spatial molecular mechanisms underlying biological phenotypes. Along with the numerous spatial clustering methods, there comes the imperative need for an effective metric to evaluate their performance. An ideal metric should consider three factors: label agreement, spatial organization, and error severity. However, existing evaluation metrics focus solely on either label agreement or spatial organization, leading to biased and misleading evaluations.Results: To fill this gap, we propose CEMUSA, a novel graph-based metric that integrates these factors into a unified evaluation framework. Extensive testing on both simulated and real datasets demonstrate CEMUSA's superiority over conventional metrics in differentiating clustering results with subtle differences in topology and error severity, while maintaining computational efficiency.Availability and implementation: The source code and data is freely available at https://github.com/YihDu/CEMUSA. CEMUSA is implemented as an R package at https://yihdu.github.io/CEMUSA.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mamba6mA: A Mamba-based DNA N6-methyladenine Site Prediction Model. Mamba6mA：一个基于mamba的DNA n6 -甲基腺嘌呤位点预测模型。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-05 DOI: 10.1093/bioinformatics/btag060

Qi Zhao, Zhen Zhang, Tingwei Chen, Qian Mao, Haoxuan Shi, Jingjing Chen, Zheng Zhao, Xiaoya Fan

Motivation: N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.

Results: Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.

Availability and implementation: The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.

Contact: Xiaoya Fan (xiaoyafan@dlut.edu.cn), Zheng Zhao (zhaozheng@dlmu.edu.cn).

Supplementary information: Supplementary information are available at Bioinformatics online.

动机：n6 -甲基腺嘌呤（6ma）是一种重要的DNA表观遗传修饰，在不改变DNA序列的情况下调节基因表达、转录、复制、DNA修复和细胞周期等生物过程。它在包括癌症和自身免疫性疾病在内的许多疾病中也起着关键作用。虽然SMRT测序和甲基化DNA免疫沉淀等实验方法可以识别6ma位点，但它们存在测序质量不理想、信噪比低、成本高和耗时等缺点。近年来，深度学习方法在预测6个mA位点方面显示出显著的优势；但其泛化能力还有待进一步提高。结果：受状态空间模型Mamba的启发，我们提出了一种新的6ma位点预测模型Mamba6mA。在Mamba6mA模型中，我们设计了位置特定的线性层来取代传统的卷积层，以方便捕获特定的位置信息。同时，我们构建了一个多尺度特征提取模块，将不同尺度滑动窗捕获的特征整合到分类器中进行预测。实验结果表明，Mamba6mA在11个物种数据集中的9个上达到了最佳MCC，超过了现有的最先进模型。烧蚀研究证实，位置特定线性层和多尺度融合模块对MCC性能的贡献分别为2.36%和2.31%。特征可视化分析进一步表明，该模型有效捕获了6ma位点上下游的序列模式，为研究表观遗传修饰机制提供了新的技术途径。获取和实现：Mamba6mA的源代码可在：https://github.com/XploreAI-Lab/Mamba6mA.Contact；范小雅（xiaoyafan@dlut.edu.cn），赵征（zhaozheng@dlmu.edu.cn）。补充信息：补充信息可在Bioinformatics online获取。

{"title":"Mamba6mA: A Mamba-based DNA N6-methyladenine Site Prediction Model.","authors":"Qi Zhao, Zhen Zhang, Tingwei Chen, Qian Mao, Haoxuan Shi, Jingjing Chen, Zheng Zhao, Xiaoya Fan","doi":"10.1093/bioinformatics/btag060","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag060","url":null,"abstract":"Motivation: N6-methyladenine (6 mA) is an important epigenetic modification of DNA that regulates biological processes such as gene expression, transcription, replication, DNA repair, and cell cycle without altering the DNA sequence. It also plays a key role in many diseases including cancer and autoimmune diseases. Although experimental approaches such as SMRT sequencing and methylated DNA immunoprecipitation can identify 6 mA sites, they suffer from drawbacks including suboptimal sequencing quality, low signal-to-noise ratios, high costs, and time-consuming procedures. In recent years, deep learning approaches have demonstrated significant advantages in predicting 6 mA sites; however, their generalization ability still requires further improvement.Results: Inspired by the state space model Mamba, we propose a novel model for 6 mA site prediction, named Mamba6mA. In the Mamba6mA model, we design position-specific linear layers to replace traditional convolutional layers to facilitate capture specific positional information. Meanwhile, we construct a multi-scale feature extraction module and integrate features captured by sliding windows of different scales, feeding them into the classifier for prediction. Experimental results show that Mamba6mA achieves the best MCC on 9 out of 11 species datasets, surpassing existing state-of-the-art models. Ablation studies confirm that the position-specific linear layers and the multi-scale fusion module contribute MCC performance gains of 2.36% and 2.31%, respectively. Feature visualization analysis further reveals that the model effectively captures sequence patterns upstream and downstream of 6 mA sites providing a new technical approach for studying epigenetic modification mechanisms.Availability and implementation: The source code for Mamba6mA is available at: https://github.com/XploreAI-Lab/Mamba6mA.Contact: Xiaoya Fan (xiaoyafan@dlut.edu.cn), Zheng Zhao (zhaozheng@dlmu.edu.cn).Supplementary information: Supplementary information are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

transFusion: a Novel Comprehensive Platform for integration Analysis of Single-Cell and Spatial Transcriptomics. 输血：单细胞和空间转录组学整合分析的新型综合平台。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-05 DOI: 10.1093/bioinformatics/btag059

Weiqiang Lin, Xinyi Xiao, Chuan Qiu, Hui Shen, Hongwen Deng

Motivation: Understanding spatial organization, intercellular interactions and regulatory networks within the spatial context of tissues is crucial for uncovering complex biological processes and disease mechanisms. Spatial transcriptomics technologies have revolutionized this field by enabling the spatially resolved profiling of gene expression. 10X Genomics Visium has emerged as the predominant spatial technology, but its low resolution and the complexity of integrating multimodal datasets present significant analytical challenges, particularly for researchers with limited computational and statistical expertise. Current spatial transcriptomics analysis platforms generally fall short of effectively integrating multi-modal data and maximizing the utility of spatial information-such as uncovering complex cellular spatial dependencies, multimodal gradient patterns and spatial co-expression of ligand-receptor pairs and regulatory networks related to disease or biological states-thereby limiting their ability to provide comprehensive end-to-end analytical workflows when analyzing 10X Genomics Visium data.

Results: To address these limitations, we developed transFusion, a novel, advanced web-based platform specializing in the most comprehensive and effective integration analysis of scRNA-seq and 10X Visium spatial transcriptomics data. transFusion offers 12 key functions, from basic visualization to advanced analyses, including intercellular dependency analysis, ligand-receptor co-expression identification and visualization, and spatial multimodal gradient variation patterns. Two case studies were used to demonstrate transFusion's capabilities in exploring tissue architecture, intercellular communication, dependency networks and multimodal gradient variation patterns with minimal computational skills and statistical expertise. transFusion provides a flexible and powerful framework for multi-modal data integration analysis.

Availability: transFusion is freely available at https://github.com/WQLin8/transFusion.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：了解空间组织、细胞间相互作用和组织空间背景下的调控网络对于揭示复杂的生物过程和疾病机制至关重要。空间转录组学技术通过实现基因表达的空间解析分析，彻底改变了这一领域。10X Genomics Visium已成为主要的空间技术，但其低分辨率和集成多模态数据集的复杂性给分析带来了重大挑战，特别是对于计算和统计专业知识有限的研究人员。目前的空间转录组学分析平台通常缺乏有效整合多模态数据和最大化空间信息的利用-例如揭示复杂的细胞空间依赖性，多模态梯度模式和配体-受体对的空间共表达以及与疾病或生物状态相关的调节网络-从而限制了它们在分析10X Genomics Visium数据时提供全面的端到端分析工作流程的能力。结果：为了解决这些限制，我们开发了输血，这是一个新颖、先进的基于网络的平台，专门用于最全面、最有效的scRNA-seq和10X Visium空间转录组学数据的整合分析。输血提供从基本可视化到高级分析的12个关键功能，包括细胞间依赖性分析、配体-受体共表达识别和可视化以及空间多模态梯度变化模式。两个案例研究被用来证明输血在探索组织结构、细胞间通信、依赖网络和多模态梯度变化模式方面的能力，只需最少的计算技能和统计专业知识。输血为多模态数据集成分析提供了一个灵活而强大的框架。可获得性：输血可在https://github.com/WQLin8/transFusion.Supplementary免费获得信息；补充数据可在Bioinformatics在线获得。

{"title":"transFusion: a Novel Comprehensive Platform for integration Analysis of Single-Cell and Spatial Transcriptomics.","authors":"Weiqiang Lin, Xinyi Xiao, Chuan Qiu, Hui Shen, Hongwen Deng","doi":"10.1093/bioinformatics/btag059","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag059","url":null,"abstract":"Motivation: Understanding spatial organization, intercellular interactions and regulatory networks within the spatial context of tissues is crucial for uncovering complex biological processes and disease mechanisms. Spatial transcriptomics technologies have revolutionized this field by enabling the spatially resolved profiling of gene expression. 10X Genomics Visium has emerged as the predominant spatial technology, but its low resolution and the complexity of integrating multimodal datasets present significant analytical challenges, particularly for researchers with limited computational and statistical expertise. Current spatial transcriptomics analysis platforms generally fall short of effectively integrating multi-modal data and maximizing the utility of spatial information-such as uncovering complex cellular spatial dependencies, multimodal gradient patterns and spatial co-expression of ligand-receptor pairs and regulatory networks related to disease or biological states-thereby limiting their ability to provide comprehensive end-to-end analytical workflows when analyzing 10X Genomics Visium data.Results: To address these limitations, we developed transFusion, a novel, advanced web-based platform specializing in the most comprehensive and effective integration analysis of scRNA-seq and 10X Visium spatial transcriptomics data. transFusion offers 12 key functions, from basic visualization to advanced analyses, including intercellular dependency analysis, ligand-receptor co-expression identification and visualization, and spatial multimodal gradient variation patterns. Two case studies were used to demonstrate transFusion's capabilities in exploring tissue architecture, intercellular communication, dependency networks and multimodal gradient variation patterns with minimal computational skills and statistical expertise. transFusion provides a flexible and powerful framework for multi-modal data integration analysis.Availability: transFusion is freely available at https://github.com/WQLin8/transFusion.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146127726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LtransHeteroGGM: Local transfer learning for Gaussian graphical model-based heterogeneity analysis. LtransHeteroGGM：基于高斯图模型的局部迁移学习异质性分析。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-04 DOI: 10.1093/bioinformatics/btag057

Chengye Li, Hongwei Ma, Mingyang Ren

Motivation: Heterogeneity is a hallmark of both macroscopic complex diseases and microscopic single-cell distribution. Gaussian Graphical Models (GGM)-based heterogeneity analysis highlights its important role in capturing the essential characteristics of biological regulatory networks, but faces instability with scarce samples from rare subgroups. Transfer learning offers promise by leveraging auxiliary data, yet existing approaches rely on unrealistic overall similarity between domains, requiring the same subgroup number and similar parameters. Numerous biological problems call for local similarities, where only some subgroups share statistical structures.

Results: In this article, we propose LtransHeteroGGM, a novel local transfer learning framework for GGM-based heterogeneity analysis. It can achieve powerful subgroup-level local knowledge transfer between target and informative auxiliary domains, despite unknown subgroup structures and numbers, while mitigating the negative interference of non-informative domains. The effectiveness and robustness of the proposed approach are demonstrated through comprehensive numerical simulations and real-world T cell heterogeneity analysis.

Availability and implementation: The R implementation of LtransHeteroGGM is available at https://github.com/Ren-Mingyang/LtransHeteroGGM.

动机：异质性是宏观复杂疾病和微观单细胞分布的标志。基于高斯图形模型（Gaussian Graphical Models， GGM）的异质性分析在捕捉生物调控网络的本质特征方面发挥了重要作用，但由于样本较少、亚群较少，异质性分析存在不稳定性。迁移学习通过利用辅助数据提供了希望，然而现有的方法依赖于不切实际的领域之间的总体相似性，需要相同的子群数量和相似的参数。许多生物学问题需要局部相似性，只有一些亚群共享统计结构。结果：在本文中，我们提出了一种新的局部迁移学习框架LtransHeteroGGM，用于基于gmm的异质性分析。它可以在未知子群结构和数量的情况下，在目标和信息辅助领域之间实现强大的子群级局部知识转移，同时减轻非信息辅助领域的负面干扰。通过全面的数值模拟和真实世界的T细胞异质性分析，证明了所提出方法的有效性和鲁棒性。可用性和实现：LtransHeteroGGM的R实现可从https://github.com/Ren-Mingyang/LtransHeteroGGM获得。

{"title":"LtransHeteroGGM: Local transfer learning for Gaussian graphical model-based heterogeneity analysis.","authors":"Chengye Li, Hongwei Ma, Mingyang Ren","doi":"10.1093/bioinformatics/btag057","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag057","url":null,"abstract":"Motivation: Heterogeneity is a hallmark of both macroscopic complex diseases and microscopic single-cell distribution. Gaussian Graphical Models (GGM)-based heterogeneity analysis highlights its important role in capturing the essential characteristics of biological regulatory networks, but faces instability with scarce samples from rare subgroups. Transfer learning offers promise by leveraging auxiliary data, yet existing approaches rely on unrealistic overall similarity between domains, requiring the same subgroup number and similar parameters. Numerous biological problems call for local similarities, where only some subgroups share statistical structures.Results: In this article, we propose LtransHeteroGGM, a novel local transfer learning framework for GGM-based heterogeneity analysis. It can achieve powerful subgroup-level local knowledge transfer between target and informative auxiliary domains, despite unknown subgroup structures and numbers, while mitigating the negative interference of non-informative domains. The effectiveness and robustness of the proposed approach are demonstrated through comprehensive numerical simulations and real-world T cell heterogeneity analysis.Availability and implementation: The R implementation of LtransHeteroGGM is available at https://github.com/Ren-Mingyang/LtransHeteroGGM.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VUScope: a mathematical model for evaluating image-based drug response measurements and predicting long-term incubation outcomes. VUScope：用于评估基于图像的药物反应测量和预测长期潜伏期结果的数学模型。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-04 DOI: 10.1093/bioinformatics/btaf679

Nguyen Khoa Tran, My Ky Huynh, Alexander D Kotman, Martin Jürgens, Thomas Kurz, Sascha Dietrich, Gunnar W Klau, Nan Qin

Motivation: Live-cell imaging-based drug screening increases the likelihood of identifying effective and safe drugs by providing dynamic, high-content, and physiologically relevant data. As a result, it improves the success rate of drug development and facilitates the translation of benchside discoveries to bedside applications. Despite these advantages, no comprehensive metrics currently exist to evaluate dose-time-dependent drug responses. To address this gap, we established a systematic framework to assess drug effects across a range of concentrations and exposure durations simultaneously. This metric enables more accurate evaluation of drug responses measured by live-cell imaging.

Results: We employed treatment concentrations ranging from 0 to 10 μM and performed live-cell imaging-based measurements over a 120-hour incubation period. To analyze the experimental data, we developed VUScope, a new mathematical model combining the 4-parameter logistic curve and a logistic function to characterize dose-time-dependent responses. This enabled us to calculate the Growth Rate Inhibition Volume Under the dose-time-response Surface (GRIVUS), which serves as a critical metric for assessing dynamic drug responses. Furthermore, our mathematical model allowed us to predict long-term treatment responses based on short-term drug responses. We validated the predictive capabilities of our model using independent datasets and observed that VUScope enhances prediction accuracy and offers deeper insights into drug effects than previously possible. By integrating VUScope into high-throughput drug screening platforms, we can further improve the efficacy of drug development and treatment selection.

Availability and implementation: We have made VUScope more accessible to users conducting pharmacological studies by uploading a detailed description, example datasets, and the source code to vuscope.albi.hhu.de, https://github.com/AlBi-HHU/VUScope, and https://doi.org/10.5281/zenodo.17610533.

Supplementary information: A: Time-independence of GR metrics; B: Key resources; C, D, E: Supplementary figures; F: Modeling choices.

动机：基于活细胞成像的药物筛选通过提供动态、高含量和生理学相关的数据，增加了识别有效和安全药物的可能性。因此，它提高了药物开发的成功率，并促进了实验室发现到床边应用的转化。尽管有这些优势，目前还没有全面的指标来评估剂量-时间依赖性药物反应。为了解决这一差距，我们建立了一个系统的框架来评估药物在不同浓度和暴露时间范围内的影响。该指标能够更准确地评估通过活细胞成像测量的药物反应。结果：我们使用的处理浓度范围为0至10 μM，并在120小时的潜伏期内进行了基于活细胞成像的测量。为了分析实验数据，我们开发了一种新的数学模型VUScope，该模型结合了4参数logistic曲线和logistic函数来表征剂量-时间相关的反应。这使我们能够计算剂量-时间-反应表面下的生长速率抑制体积（GRIVUS），这是评估动态药物反应的关键指标。此外，我们的数学模型使我们能够根据短期药物反应预测长期治疗反应。我们使用独立数据集验证了模型的预测能力，并观察到VUScope提高了预测准确性，并比以前更深入地了解药物效应。通过将VUScope集成到高通量药物筛选平台中，我们可以进一步提高药物开发和治疗选择的有效性。可用性和实施：通过将详细描述、示例数据集和源代码上传到VUScope .albi.hhu.de、https://github.com/AlBi-HHU/VUScope和https://doi.org/10.5281/zenodo.17610533.Supplementary，我们使VUScope更易于用户进行药理学研究：a: GR指标的时间独立性；B：关键资源；C， D， E：补充数据；F：建模选择。

{"title":"VUScope: a mathematical model for evaluating image-based drug response measurements and predicting long-term incubation outcomes.","authors":"Nguyen Khoa Tran, My Ky Huynh, Alexander D Kotman, Martin Jürgens, Thomas Kurz, Sascha Dietrich, Gunnar W Klau, Nan Qin","doi":"10.1093/bioinformatics/btaf679","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf679","url":null,"abstract":"Motivation: Live-cell imaging-based drug screening increases the likelihood of identifying effective and safe drugs by providing dynamic, high-content, and physiologically relevant data. As a result, it improves the success rate of drug development and facilitates the translation of benchside discoveries to bedside applications. Despite these advantages, no comprehensive metrics currently exist to evaluate dose-time-dependent drug responses. To address this gap, we established a systematic framework to assess drug effects across a range of concentrations and exposure durations simultaneously. This metric enables more accurate evaluation of drug responses measured by live-cell imaging.Results: We employed treatment concentrations ranging from 0 to 10 μM and performed live-cell imaging-based measurements over a 120-hour incubation period. To analyze the experimental data, we developed VUScope, a new mathematical model combining the 4-parameter logistic curve and a logistic function to characterize dose-time-dependent responses. This enabled us to calculate the Growth Rate Inhibition Volume Under the dose-time-response Surface (GRIVUS), which serves as a critical metric for assessing dynamic drug responses. Furthermore, our mathematical model allowed us to predict long-term treatment responses based on short-term drug responses. We validated the predictive capabilities of our model using independent datasets and observed that VUScope enhances prediction accuracy and offers deeper insights into drug effects than previously possible. By integrating VUScope into high-throughput drug screening platforms, we can further improve the efficacy of drug development and treatment selection.Availability and implementation: We have made VUScope more accessible to users conducting pharmacological studies by uploading a detailed description, example datasets, and the source code to vuscope.albi.hhu.de, https://github.com/AlBi-HHU/VUScope, and https://doi.org/10.5281/zenodo.17610533.Supplementary information: A: Time-independence of GR metrics; B: Key resources; C, D, E: Supplementary figures; F: Modeling choices.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PyTEA-O: a Python implementation of Two-Entropies Analysis for protein sequence variation analysis. 用于蛋白质序列变异分析的双熵分析的Python实现。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-04 DOI: 10.1093/bioinformatics/btag043

R C M Kuin, A T Julian, J Chander, S Lee, G J P van Westen

Motivation: Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.

Results: To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyse the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.

Availability: PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：蛋白质序列变异分析是药物发现和蛋白质工程中广泛关注的话题，以支持多种生物技术和治疗应用中蛋白质功能的调节。为了协助分析多序列比对（msa）并识别解释蛋白质功能特异性的残基，已经开发了计算工具。然而，现有的程序通常忽略了氨基酸的特性，除了固定的web服务器接口之外的灵活性，可访问的源代码，或者与小型msa的兼容性。结果：为了解决这些限制，我们提出了PyTEA-O，这是一个Python实现的双熵分析，它已经被开发成易于用于分析蛋白质序列变化。为了帮助用户分析MSA并筛选感兴趣的残馀，我们生成了可修改的直观可视化。这些可视化，加上用于识别具有（非）相似物理化学性质的比对位置的评分方法，为序列变异性分析提供了一个强大的工具。为了证明它的能力，我们提出了一个基于去泛素酶OTUD7B（塞尚）的案例研究，在那里我们确定了一个调节其对底物亲和力的关键位置。可用性：PyTEA-O可在https://github.com/CDDLeiden/PyTEA-O/上获得，并通过Zenodo (https://doi.org/10.5281/zenodo.15914598).Supplementary)存档。信息：补充数据可在Bioinformatics在线获得。

{"title":"PyTEA-O: a Python implementation of Two-Entropies Analysis for protein sequence variation analysis.","authors":"R C M Kuin, A T Julian, J Chander, S Lee, G J P van Westen","doi":"10.1093/bioinformatics/btag043","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag043","url":null,"abstract":"Motivation: Protein sequence variation analysis is a topic of broad interest in drug discovery and protein engineering to support modulation of protein function for diverse biotechnological and therapeutic applications. To assist in the analysis of multiple sequence alignments (MSAs) and identify residues that account for protein function specificity, computational tools have been developed. Yet, existing programs often omit consideration of amino acid properties, flexibility beyond fixed webserver interfaces, accessible source code, or compatibility with small MSAs.Results: To address these limitations, we present PyTEA-O, a Python implementation of Two-Entropies Analysis that has been developed to be easy to use for the analysis of protein sequence variation. To help users analyse the MSA and screen for residues of interest, we generate modifiable and intuitive visualizations. These visualizations, together with a scoring approach for identifying alignment positions with (dis-)similar physicochemical properties, presents a powerful tool for sequence variability analysis. To demonstrate its capabilities, we present a case study based on the deubiquitinase OTUD7B (Cezanne) where we identify a crucial position that modulates its affinity for its substrate.Availability: PyTEA-O is available at https://github.com/CDDLeiden/PyTEA-O/ and archived via Zenodo (https://doi.org/10.5281/zenodo.15914598).Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-scale structural similarity embedding search across entire proteomes. 跨整个蛋白质组的多尺度结构相似性嵌入搜索。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-03 DOI: 10.1093/bioinformatics/btag058

Joan Segura, Ruben Sanchez-Garcia, Sebastian Bittrich, Yana Rose, Stephen K Burley, Jose M Duarte

Motivation: The rapid expansion of three-dimensional (3D) biomolecular structure information, driven by breakthroughs in artificial intelligence/deep learning (AI/DL)-based structure predictions, has created an urgent need for scalable and efficient structure similarity search methods. Traditional alignment-based approaches, such as structural superposition tools, are computationally expensive and challenging to scale with the vast number of available macromolecular structures.

Results: Herein, we present a scalable structure similarity search strategy designed to navigate extensive repositories of experimentally determined structures and computed structure models predicted using AI/DL methods. Our approach leverages protein language models and a deep neural network architecture to transform 3D structures into fixed-length vectors, enabling efficient large-scale comparisons. Although trained to predict TM-scores between single-domain structures, our model generalizes beyond the domain level, accurately identifying 3D similarity for full-length polypeptide chains and multimeric assemblies. By integrating vector databases, our method facilitates efficient large-scale structure retrieval, addressing the growing challenges posed by the expanding volume of 3D biostructure information.

Availability: Source code available at https://github.com/bioinsilico/rcsb-embedding-search.Source code DOI: https://doi.org/10.6084/m9.figshare.30546698.v1.Benchmark datasets DOI: https://doi.org/10.6084/m9.figshare.30546650.v1.Web server prototype available at: http://embedding-search.rcsb.org/.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：基于人工智能/深度学习（AI/DL）的结构预测技术的突破推动了三维（3D）生物分子结构信息的快速扩展，迫切需要可扩展且高效的结构相似性搜索方法。传统的基于排列的方法，如结构叠加工具，在计算上是昂贵的，并且很难与大量可用的大分子结构进行扩展。在此，我们提出了一种可扩展的结构相似性搜索策略，旨在导航大量的实验确定的结构库和使用AI/DL方法预测的计算结构模型。我们的方法利用蛋白质语言模型和深度神经网络架构将3D结构转换为固定长度的向量，从而实现高效的大规模比较。虽然经过训练可以预测单域结构之间的tm分数，但我们的模型可以推广到域水平之外，准确识别全长多肽链和多聚体组装的3D相似性。通过整合矢量数据库，我们的方法促进了高效的大规模结构检索，解决了三维生物结构信息量不断扩大所带来的日益增长的挑战。可用性：源代码可在https://github.com/bioinsilico/rcsb-embedding-search.Source获得代码DOI: https://doi.org/10.6084/m9.figshare.30546698.v1.Benchmark数据集DOI: https://doi.org/10.6084/m9.figshare.30546650.v1.Web服务器原型可在http://embedding-search.rcsb.org/.Supplementary获得信息：补充数据可在Bioinformatics online获得。

{"title":"Multi-scale structural similarity embedding search across entire proteomes.","authors":"Joan Segura, Ruben Sanchez-Garcia, Sebastian Bittrich, Yana Rose, Stephen K Burley, Jose M Duarte","doi":"10.1093/bioinformatics/btag058","DOIUrl":"10.1093/bioinformatics/btag058","url":null,"abstract":"Motivation: The rapid expansion of three-dimensional (3D) biomolecular structure information, driven by breakthroughs in artificial intelligence/deep learning (AI/DL)-based structure predictions, has created an urgent need for scalable and efficient structure similarity search methods. Traditional alignment-based approaches, such as structural superposition tools, are computationally expensive and challenging to scale with the vast number of available macromolecular structures.Results: Herein, we present a scalable structure similarity search strategy designed to navigate extensive repositories of experimentally determined structures and computed structure models predicted using AI/DL methods. Our approach leverages protein language models and a deep neural network architecture to transform 3D structures into fixed-length vectors, enabling efficient large-scale comparisons. Although trained to predict TM-scores between single-domain structures, our model generalizes beyond the domain level, accurately identifying 3D similarity for full-length polypeptide chains and multimeric assemblies. By integrating vector databases, our method facilitates efficient large-scale structure retrieval, addressing the growing challenges posed by the expanding volume of 3D biostructure information.Availability: Source code available at https://github.com/bioinsilico/rcsb-embedding-search.Source code DOI: https://doi.org/10.6084/m9.figshare.30546698.v1.Benchmark datasets DOI: https://doi.org/10.6084/m9.figshare.30546650.v1.Web server prototype available at: http://embedding-search.rcsb.org/.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146115223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inference of marker genes of subtle cell state changes via iLR: iterative logistic regression. 通过iLR推断细胞状态微妙变化的标记基因：迭代逻辑回归。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-02 DOI: 10.1093/bioinformatics/btag051

Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean

Motivation: Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.

Results: We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.

Availability and implementation: iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.

Supplementary information: Supplementary data are available at Bioinformaticss online.

动机：单细胞RNA测序（scRNA-seq）数据的差异表达和标记基因选择方法可能难以识别小组信息基因，特别是对于可能由疾病或治疗引起的细胞状态之间的细微差异。结果：我们提出了迭代逻辑回归（iLR）来鉴定小组信息标记基因。iLR采用逻辑回归迭代和Pareto前优化来平衡基因集大小和分类性能。我们在计算机数据集上对iLR进行基准测试，证明了仅使用一小部分基因在单细胞分类方面与最先进的性能相当。我们测试了iLR在健康和自闭症谱系障碍患者中区分神经元细胞亚型的能力，发现它在小组疾病相关基因上达到了很高的准确性。我们应用iLR来研究来自不同肿瘤微环境的细胞类型的免疫治疗效果，并发现iLR推断出跨器官甚至物种（小鼠到人类）比较翻译的信息基因。我们通过iLR预测，eninostat部分通过调节肺微环境中的骨髓细胞分化途径起作用。总的来说，iLR提供了从具有预后或治疗潜力的复杂数据集推断可解释的转录特征的方法。可用性和实施：iLR可在GitHub https://github.com/maclean-lab/iLR和Zenodo https://zenodo.org/records/17728797.Supplementary免费获得。信息：补充数据可在bioinformatics在线获取。

{"title":"Inference of marker genes of subtle cell state changes via iLR: iterative logistic regression.","authors":"Yingtong Liu, Aaron G Baugh, Evanthia T Roussos Torres, Adam L MacLean","doi":"10.1093/bioinformatics/btag051","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag051","url":null,"abstract":"Motivation: Differential expression and marker gene selection methods for single-cell RNA sequencing (scRNA-seq) data can struggle to identify small sets of informative genes, especially for subtle differences between cell states, as can be induced by disease or treatment.Results: We present iterative logistic regression (iLR) for the identification of small sets of informative marker genes. iLR applied logistic regression iteratively with a Pareto front optimization to balance gene set size with classification performance. We benchmark iLR on in silico datasets, demonstrating comparable performance to the state-of-the-art at single-cell classification using only a fraction of the genes. We test iLR on its ability to distinguish neuronal cell subtypes in healthy vs. autism spectrum disorder patients and find that it achieves high accuracy with small sets of disease-relevant genes. We apply iLR to investigate immunotherapeutic effects in cell types from different tumor microenvironments and find that iLR infers informative genes that translate across organs and even species (mouse-to-human) comparison. We predicted via iLR that entinostat acts in part through the modulation of myeloid cell differentiation routes in the lung microenvironment. Overall, iLR provides means to infer interpretable transcriptional signatures from complex datasets with prognostic or therapeutic potential.Availability and implementation: iLR is freely available at GitHub https://github.com/maclean-lab/iLR and Zenodo https://zenodo.org/records/17728797.Supplementary information: Supplementary data are available at Bioinformaticss online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PYRAMA: An open-source tool for advanced meta-analysis of genome wide association studies. 一个开源工具，用于基因组广泛关联研究的高级荟萃分析。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-02 DOI: 10.1093/bioinformatics/btag054

Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos

Motivation: Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.

Results: We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.

Availability: https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：全基因组关联研究（GWAS）荟萃分析工具对于整合跨多个队列的汇总统计数据至关重要，从而提高统计能力并验证遗传关联。被广泛引用的工具，如METAL、PLINK和GWAMA，促进了GWAS领域的许多重大发现。然而，这些工具提供了一组有限的元分析方法，并且通常要求用户具有先前使用命令行工具的经验。结果：我们在这里提出PYRAMA，一个开源工具，设计用于基因组全关联研究的荟萃分析。这项工作介绍了一个易于使用的软件包，其中包括在类似软件包中不存在的几个元分析方法。与其他工具相比，PYRAMA更快，支持稳健的分析和元分析方法，固定效应，随机效应和贝叶斯元分析，它是目前唯一支持汇总统计输入的元分析的工具。它既可以作为独立工具，也可以作为免费的web服务器。可用性：https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.Supplementary信息：补充数据可在Bioinformatics在线获取。

{"title":"PYRAMA: An open-source tool for advanced meta-analysis of genome wide association studies.","authors":"Georgios A Manios, Sophia Nteli, Panagiota I Kontou, Pantelis G Bagos","doi":"10.1093/bioinformatics/btag054","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag054","url":null,"abstract":"Motivation: Genome-wide association study (GWAS) meta-analysis tools are essential for integrating summary statistics across multiple cohorts, thereby increasing statistical power and validating genetic associations. Widely cited tools, such as METAL, PLINK, and GWAMA, have facilitated numerous significant discoveries in the field of GWAS. Nevertheless, these tools offer a limited set of meta-analysis methods and typically require users to have prior experience with command-line tools to be executed.Results: We present here PYRAMA, an open-source tool which is designed for meta-analysis of genome wide association studies. This work introduces an easy-to-use software package that includes several meta-analysis methods that are absent in similar software packages. PYRAMA is faster compared to other tools, supports robust methods for analysis and meta-analysis, fixed-effects, random-effects and Bayesian meta-analysis and it is currently the only tool that supports meta-analysis with imputation of summary statistics. It is available both as a standalone tool and as a freely available web server.Availability: https://github.com/pbagos/PYRAMA, https://doi.org/10.5281/zenodo.17830449.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using semantic search to find publicly available gene-expression datasets. 使用语义搜索来查找公开可用的基因表达数据集。

IF 5.4

Bioinformatics (Oxford, England)

Pub Date : 2026-02-02 DOI: 10.1093/bioinformatics/btag053

Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo

Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.

Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.

Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机：数以百万计的高通量分子数据集已经在公共存储库中共享。研究人员可以重复使用这些数据来验证他们自己的发现并探索新的问题。一个常见的目标是找到解决类似研究主题的多个数据集，并直接组合它们或整合它们的推断。然而，一个主要的挑战是找到相关的数据集，因为候选数据集数量庞大，描述不一致，缺乏语义注释。这一挑战是FAIR科学数据原则中的第一项。在这里，我们专注于基因表达Omnibus （GEO）中的数据集发现，这是一个包含100,000 s数据序列的存储库。GEO支持基于关键字、本体术语和其他注释的查询。然而，回顾这些结果既耗时又乏味，而且经常遗漏相关的数据集。结果：我们假设语言模型可以通过将数据集描述总结为数字表示（嵌入）来解决这个问题。假设研究人员之前已经发现了一些相关数据集，我们评估了发现其他相关数据集的潜力。对于六种人类医疗条件，我们使用30个模型为人类管理员先前与这些条件关联的数据集生成嵌入，并识别出具有最相似描述的其他数据集。这种方法通常比GEO的搜索引擎更有效，但并不总是如此。表现最好的模型在一般语料库上进行训练，使用对比学习策略，并使用相对较大的嵌入。我们的研究结果表明，语言模型有潜力改善数据集发现，可能与现有的搜索工具相结合。可用性：我们的分析代码和一个基于web的工具，使其他人能够使用我们的方法，分别可以从https://github.com/srp33/GEO_NLP和https://github.com/srp33/GEOfinder3.0获得。补充信息：补充数据可在生物信息学在线获取。

{"title":"Using semantic search to find publicly available gene-expression datasets.","authors":"Grace S Brown, James Wengler, Aaron Joyce S Fabelico, Abigail Muir, Anna Tubbs, Amanda Warren, Alexandra N Millett, Xinrui Xiang Yu, Paul Pavlidis, Sanja Rogic, Stephen R Piccolo","doi":"10.1093/bioinformatics/btag053","DOIUrl":"10.1093/bioinformatics/btag053","url":null,"abstract":"Motivation: Millions of high-throughput, molecular datasets have been shared in public repositories. Researchers can reuse such data to validate their own findings and explore novel questions. A frequent goal is to find multiple datasets that address similar research topics and to either combine them directly or integrate inferences from them. However, a major challenge is finding relevant datasets due to the vast number of candidates, inconsistencies in their descriptions, and a lack of semantic annotations. This challenge is first among the FAIR principles for scientific data. Here we focus on dataset discovery within Gene Expression Omnibus (GEO), a repository containing 100,000 s of data series. GEO supports queries based on keywords, ontology terms, and other annotations. However, reviewing these results is time-consuming and tedious, and it often misses relevant datasets.Results: We hypothesized that language models could address this problem by summarizing dataset descriptions as numeric representations (embeddings). Assuming a researcher has previously found some relevant datasets, we evaluated the potential to find additional relevant datasets. For six human medical conditions, we used 30 models to generate embeddings for datasets that human curators had previously associated with the conditions and identified other datasets with the most similar descriptions. This approach was often, but not always, more effective than GEO's search engine. The top-performing models were trained on general corpora, used contrastive-learning strategies, and used relatively large embeddings. Our findings suggest that language models have the potential to improve dataset discovery, likely in combination with existing search tools.Availability: Our analysis code and a Web-based tool that enables others to use our methodology are availabe from https://github.com/srp33/GEO_NLP and https://github.com/srp33/GEOfinder3.0, respectively.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0