首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores. 基于注意机制的纳米孔蛋白质识别自监督学习方法。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf657
Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong

Motivation: Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.

Results: We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.

Availability and implementation: The main program is available at https://doi.org/10.5281/zenodo.17172822.

动机:纳米孔是一种前沿的跨学科工具,可以在单分子水平上分析生物分子,用于许多应用,例如DNA测序。目前正在努力将纳米孔扩展到蛋白质组学,包括开发用于蛋白质测序和鉴定的机器学习算法。然而,单分子数据本质上是有噪声的,难以处理。此外,纳米孔机器学习的发展和性能受到数据稀缺的影响。自监督学习是一种新兴的方法,可能在纳米孔场景中产生优势。结果:我们提出并实验验证了使用自监督学习(NanoSSL)进行纳米孔分析,这是一种基于注意力机制的生成式自监督学习框架,用于识别纳米孔中的蛋白质信号。利用由自我监督预训练和监督微调组成的两步方法,NanoSSL从经验数据中学习有用的特征表示,以促进下游分类任务。受传统蛋白质测序技术中片段化概念的启发,在预训练过程中,每个易位事件被分割成多个大小相等的非重叠片段,其中一些片段被随机屏蔽,并使用屏蔽自编码器进行重构。学习重构的纳米孔事件的特征表示有助于分子识别的微调。在这项研究中,我们重新测试了一个公开可用的纳米孔多重蛋白质传感数据集,用于模型迭代,随后使用自制的固态纳米孔测量了阿尔茨海默病生物标志物a β1-42。实验结果表明,在对两个突变的a - β1-42、E22G和G37R进行分类时,NanoSSL在正确率、精密度、召回率和F1分数四个指标上取得了前所未有的成绩。验证了自监督学习和注意机制是成绩提高的来源。可用性和实现:主程序可在https://doi.org/10.5281/zenodo.17172822上获得。
{"title":"NanoSSL: attention mechanism-based self-supervised learning method for protein identification using nanopores.","authors":"Yong Xie, Jindong Li, Ziyan Zhang, Bin Meng, Shuaijian Dai, Yuchen Zhou, Eamonn Kennedy, Niandong Jiao, Haobin Chen, Zhuxin Dong","doi":"10.1093/bioinformatics/btaf657","DOIUrl":"10.1093/bioinformatics/btaf657","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are cutting-edge interdisciplinary tools that can analyze biomolecules at the single-molecule level for many applications, e.g. DNA sequencing. Efforts are underway to extend nanopores to proteomics, including the development of machine learning algorithms for protein sequencing and identification. However, single-molecule data are intrinsically noisy and hard to process. Moreover, the development and performance of machine learning for nanopore is jeopardized by data scarcity. Self-supervised learning is an emerging method that may yield advantages in nanopore scenarios.</p><p><strong>Results: </strong>We propose and experimentally validate Nanopore analysis using Self-Supervised Learning (NanoSSL), a generative self-supervised learning framework based on attention mechanisms for the identification of protein signals from nanopores. Leveraging a two-step approach consisting of self-supervised pre-training and supervised fine-tuning, NanoSSL learns useful feature representations from empirical data to facilitate downstream classification tasks. Inspired by the concept of fragmentation in conventional protein sequencing technologies, during pretraining each translocation event is split into multiple non-overlapping fragments of equal size, some of which are randomly masked and reconstructed using a masked autoencoder. Learning the feature representations of the reconstructed nanopore events facilitates molecular identification in fine-tuning. In this study, we retested a publicly available nanopore multiplexed protein sensing dataset for model iteration, and subsequently measured Alzheimer's disease biomarker Aβ1-42 using homemade solid-state nanopores. Empirical results indicated NanoSSL achieved an unprecedented performance across four metrics: accuracy, precision, recall, and F1 score, when classifying two mutated Aβ1-42, E22G and G37R. The self-supervised learning and attention mechanism were verified as the source of performance gains.</p><p><strong>Availability and implementation: </strong>The main program is available at https://doi.org/10.5281/zenodo.17172822.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"42 1","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12777981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors. MegaPlantTF:一个用于植物转录因子综合鉴定和分类的机器学习框架。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf678
Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali

Motivation: Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.

Results: We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.

Availability and implementation: MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.

研究动机:了解转录因子在植物中的作用对研究基因调控和各种生物过程至关重要。然而,由于这些蛋白质的多样性和复杂性,TF的检测和分类仍然具有挑战性。传统的方法,如BLAST,通常在不太常见的转录因子家族上存在较高的计算复杂度和有限的性能。结果:我们引入了MegaPlantTF,这是第一个全面的机器学习和深度学习框架,用于预测(TF与非TF)和分类(家族水平)植物转录因子。我们的方法采用基于k-mer的蛋白质表示和结合深度前馈神经网络和堆叠集成分类器的两阶段架构。为了确保可靠的性能评估,我们报告了微观、宏观和加权平均性能指标,提供了频繁和代表性不足的TF家族的整体评估。此外,我们采用基于阈值的评估来校准TF检测的置信度。结果表明,MegaPlantTF在k-mer大小为3、分类阈值为0.5的情况下具有较强的准确性和精密度,即使在严格的阈值下也能保持稳定的性能。除了标准的交叉验证测试外,对高粱双色的用例研究证实,我们的方法在全基因组分析中表现出色,使其非常适合大规模的TF鉴定和分类任务。MegaPlantTF通过将k-mer编码、二元家族特异性分类器和两阶段堆叠集成到一个统一的、可重复的框架中,为大规模植物TF识别和分类做出了新的贡献。可用性和实现:MegaPlantTF可通过公共web服务器(https://bioinformatics.um6p.ma/MegaPlantTF)免费访问。完整的源代码,包括预训练模型和示例数据集,可在https://github.com/Bioinformatics-UM6P/MegaPlantTF.Contacts和补充信息:补充数据可在线获得。任何通信应通过电子邮件或在MegaPlantTF Github页面上打开问题发给作者。
{"title":"MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors.","authors":"Genereux Akotenou, Asmaa H Hassan, Morad M Mokhtar, Achraf El Allali","doi":"10.1093/bioinformatics/btaf678","DOIUrl":"10.1093/bioinformatics/btaf678","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the role of transcription factors (TFs) in plants is essential for the study of gene regulation and various biological processes. However, both TF detection and classification remain challenging due to the great diversity and complexity of these proteins. Conventional approaches, such as BLAST, often suffer from high computational complexity and limited performance on less common TF families.</p><p><strong>Results: </strong>We introduce MegaPlantTF, the first comprehensive machine learning and deep learning framework for the prediction (TF versus non-TF) and classification (family-level) of plant TFs. Our method employs k-mer-based protein representations and a two-stage architecture combining a deep feed-forward neural network with a stacking ensemble classifier. To ensure robust performance assessment, we report micro-, macro-, and weighted-average performance metrics, providing a holistic evaluation of both frequent and underrepresented TF families. Additionally, we employ threshold-based evaluation to calibrate confidence in TF detection. The results show that MegaPlantTF achieves strong accuracy and precision, particularly with a k-mer size of 3 and a classification threshold of 0.5, and maintains stable performance even under stringent thresholds. In addition to the standard cross-validation tests, a use case study on Sorghum bicolor confirms that our method performs strongly in the genome-wide analysis, making it highly suitable for large-scale TF identification and classification tasks. MegaPlantTF represents a novel contribution by integrating k-mer encoding, binary family-specific classifiers, and a two-stage stacking ensemble into a unified, reproducible framework for large-scale plant TF identification and classification.</p><p><strong>Availability and implementation: </strong>MegaPlantTF is freely accessible through a public web server available at https://bioinformatics.um6p.ma/MegaPlantTF. The complete source code, including pretrained models and example datasets, is available at https://github.com/Bioinformatics-UM6P/MegaPlantTF.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12803907/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity. BPSS:用于检测蛋白质多样性的细菌肽序列选择的Nextflow管道。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf677
Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch

Motivation: Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.

Results: We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.

Availability and implementation: All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.

动机:序列可变性可能非常高,特别是在细菌中,由于与它们的高复制率和环境选择压力相关的突变的快速积累,这通常有利于多样化选择。对于大多数物种来说,没有自动化的、计算效率高的工具可用于构建覆盖目标蛋白等位基因变异的非冗余数据库。结果:我们因此开发了细菌肽序列选择(BPSS),这是Nextflow的一个管道,用于定义用于检测感兴趣蛋白质的所有变体的肽序列的最小列表。可用性:所有使用的代码和容器都可以在Gitlab上从https://gitbio.ens-lyon.fr/ciri/stapath/bpss免费获得,或者在GPLv3开源许可证和DockerHub平台下从https://hub.docker.com/u/stapath.Supplementary免费获得Zenodo (10.5281/ Zenodo .16894981)。
{"title":"BPSS: a Nextflow pipeline for Bacterial Peptide Sequence Selection to detect protein diversity.","authors":"Sylvère Bastien, Pauline François, Sara Moussadeq, Jérôme Lemoine, Karen Moreau, François Vandenesch","doi":"10.1093/bioinformatics/btaf677","DOIUrl":"10.1093/bioinformatics/btaf677","url":null,"abstract":"<p><strong>Motivation: </strong>Sequence variability can be extremely high, particularly in bacteria due to the rapid accumulation of mutations linked to their high replication rate and environmental selection pressure, which often favors diversifying selection. For most species, there are no automated, computationally efficient tools available for constructing a nonredundant database covering the allelic variability of target proteins.</p><p><strong>Results: </strong>We have thus developed Bacterial Peptide Sequence Selection, a Nextflow pipeline to define a minimal list of peptide sequences for detecting all variants of a protein of interest.</p><p><strong>Availability and implementation: </strong>All the code and containers used are freely available on Gitlab from https://gitbio.ens-lyon.fr/ciri/stapath/bpss or on Zenodo (10.5281/zenodo.16894981) under GPLv3 open-source license and DockerHub platform from https://hub.docker.com/u/stapath.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145835679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models. Hi-Enhancer:基于blend - kan和Stacking-Auto模型的两阶段增强子预测和定位框架。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf441
Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang

Motivation: Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.

Results: We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.

Availability and implementation: The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.

动机:基因表达在细胞功能中起着至关重要的作用,增强子可以精确调控基因表达。因此,对增强子的准确预测尤为重要。然而,现有的预测方法精度较低,或者依赖于固定的多个表观遗传信号,这些信号可能并不总是可用的。结果:我们提出了一个两阶段框架,通过灵活组合多个表观遗传信号来准确预测增强子。在第一阶段,我们设计了一个blend -KAN模型,该模型集成了各种基分类器的结果,并采用Kolmogorov-Arnold Networks (KAN)作为元分类器,基于多个表观遗传信号的灵活组合来预测增强子。在第二阶段,我们建立了一个stack - auto模型,该模型使用DNABERT-2提取序列特征,并基于Stacking策略和AutoGluon框架定位增强子。当使用5个表观遗传信号时,blendin - kan模型的准确率达到99.69±0.11%。在跨细胞系预测中,准确率大于等于93.72%。在高斯噪声条件下,仍能保持98.74±0.03%的精度。在第二阶段,stack - auto模型的准确率达到80.50%,优于现有的17种方法。结果表明,我们的模型可以灵活地利用多种表观遗传信号的组合来预测和定位增强子。可用性和实施:源代码可在https://github.com/emanlee/Hi-Enhancer和https://doi.org/10.6084/m9.figshare.29262158.v1.Supplementary上获得信息:补充数据可在Bioinformatics在线上获得。
{"title":"Hi-Enhancer: a two-stage framework for prediction and localization of enhancers based on Blending-KAN and Stacking-Auto models.","authors":"Aimin Li, Haotian Zhou, Rong Fei, Juntao Zou, Xiguo Yuan, Yajun Liu, Saurav Mallik, Xinhong Hei, Lei Wang","doi":"10.1093/bioinformatics/btaf441","DOIUrl":"10.1093/bioinformatics/btaf441","url":null,"abstract":"<p><strong>Motivation: </strong>Gene expression plays a crucial role in cell function, and enhancers can regulate gene expression precisely. Therefore, accurate prediction of enhancers is particularly critical. However, existing prediction methods have low accuracy or rely on fixed multiple epigenetic signals, which may not always be available.</p><p><strong>Results: </strong>We propose a two-stage framework that accurately predicts enhancers by flexibly combining multiple epigenetic signals. In the first stage, we designed a Blending-KAN model, which integrates the results of various base classifiers and employs Kolmogorov-Arnold Networks (KAN) as a meta-classifier to predict enhancers based on flexible combinations of multiple epigenetic signals. In the second stage, we developed a Stacking-Auto model, which extracted sequence features using DNABERT-2 and located the enhancers based on the Stacking strategy and AutoGluon framework. The accuracy of the Blending-KAN model reached 99.69 ± 0.11% when five epigenetic signals were used. In cross-cell line prediction, the accuracy was more significant than or equal to 93.72%. With Gaussian noise, it still maintains an accuracy of 98.74 ± 0.03%. In the second stage, the accuracy of the Stacking-Auto model is 80.50%, which is better than the existing 17 methods. The results show that our models can be flexibly used to predict and locate enhancers utilizing a combination of multiple epigenetic signals.</p><p><strong>Availability and implementation: </strong>The source code is available at https://github.com/emanlee/Hi-Enhancer and https://doi.org/10.6084/m9.figshare.29262158.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12758598/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144839356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spider: a flexible and unified framework for simulating spatial transcriptomics data. Spider:一个灵活和统一的框架,用于模拟空间转录组学数据。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf562
Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng

Motivation: Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of "gold standard" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.

Results: To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.

Availability and implementation: All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.

动机:空间转录组学(ST)技术通过同时获取基因表达谱和细胞位置信息,为细胞异质性提供了有价值的见解。然而,“金标准”数据集的有限多样性和准确性阻碍了对快速增长的ST分析工具进行基准测试的有效性和公平性。结果:为了解决这一问题,我们提出了Spider,这是一个灵活而全面的框架,可以在不需要参考真实ST数据的情况下模拟ST数据。通过使用单元格类型比例和相邻单元格之间的过渡矩阵来表征空间格局,与现有的仿真方法相比,Spider可以产生更真实和多样化的模拟数据,并提供更强的建模灵活性。此外,Spider还提供了用于自定义空间域的交互功能,例如区域分割和组织学成像数据的集成。基准分析表明,Spider在保留真实ST数据的空间特征和便于下游分析方法的评估方面优于其他模拟工具。Spider是用Python实现的,可以在https://github.com/YANG-ERA/Spider.Availability上获得:所有代码,本文中的模拟ST数据都可以在https://github.com/YANG-ERA/Spider.Supplementary上公开获得:补充数据可以在Bioinformatics online上获得。
{"title":"Spider: a flexible and unified framework for simulating spatial transcriptomics data.","authors":"Jiyuan Yang, Nana Wei, Yang Qu, Congcong Hu, Weiwei Zhang, Lin Liu, Hua-Jun Wu, Xiaoqi Zheng","doi":"10.1093/bioinformatics/btaf562","DOIUrl":"10.1093/bioinformatics/btaf562","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies provide valuable insights into cellular heterogeneity by simultaneously acquiring both gene expression profiles and cellular location information. However, the limited diversity and accuracy of \"gold standard\" datasets hindered the effectiveness and fairness of benchmarking rapidly growing ST analysis tools.</p><p><strong>Results: </strong>To address this issue, we proposed Spider, a flexible and comprehensive framework for simulating ST data without requiring real ST data as a reference. By characterizing the spatial patterns using cell type proportions and transition matrix between adjacent cells, Spider can produce more realistic and diverse simulated data and offer enhanced modeling flexibility compared to existing simulation methods. Additionally, Spider provides interactive features for customizing the spatial domain, such as zone segmentation and integration of histology imaging data. Benchmark analyses demonstrate that Spider outperforms other simulation tools in preserving the spatial characteristics of real ST data and facilitating the evaluation of downstream analysis methods. Spider is implemented in Python and available at https://github.com/YANG-ERA/Spider.</p><p><strong>Availability and implementation: </strong>All codes, simulated ST data in this paper are publicly available at https://github.com/YANG-ERA/Spider.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790819/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145524860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CholeraSeq: a comprehensive genomic pipeline for cholera surveillance and near real-time outbreak investigation. CholeraSeq:用于霍乱监测和近实时疫情调查的全面基因组管道。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf665
Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson

Summary: Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.

Availability and implementation: CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.

摘要:动机:下一代测序技术已广泛应用于霍乱流行地区,但缺乏端到端的可重复管道,该管道将读取QC、过滤、参考图谱、变异调用/注释、重组筛选、简约信息位点/变异密码子提取、用于下游系统动力学和流行病学分析的系统发育推断结合起来,从而减缓了疫情调查和公共卫生反应。结果:CholeraSeq是一个用于霍乱基因组监测的高通量基因组学管道。它摄取一致的基因组、短读序列数据、草稿程序集,并从本地环境无缝扩展到云环境。为了加快在流行病学背景下对新爆发菌株的定位,我们提供了从公共数据汇编而成的经过策划的现成核心基因组比对,从而能够灵活、快速地整合新样本,用于爆发调查。可用性和实现:CholeraSeq在GitHub平台https://github.com/CERI-KRISP/CholeraSeq上免费提供。CholeraSeq在Nextflow中实现,采用基于非核心社区标准的模块化设计。补充信息:现成的参考核心对齐和相关的元数据:https://doi.org/10.5281/zenodo.16909942。
{"title":"CholeraSeq: a comprehensive genomic pipeline for cholera surveillance and near real-time outbreak investigation.","authors":"Massimiliano S Tagliamonte, Abhinav Sharma, Alberto Riva, Monika Moir, Marco Salemi, Cheryl Baxter, Tulio de Oliveira, Carla N Mavian, Eduan Wilkinson","doi":"10.1093/bioinformatics/btaf665","DOIUrl":"10.1093/bioinformatics/btaf665","url":null,"abstract":"<p><strong>Summary: </strong>Next Generation Sequencing is widely deployed in cholera-endemic regions, yet an end-to-end reproducible pipeline that unifies read QC, filtering, reference mapping, variant calling/annotation, recombination screening, and extraction of parsimony informative sites/variant codons, phylogenetic inference for downstream phylodynamic and epidemiological analyses have been lacking, slowing outbreak investigation and public health response. CholeraSeq is a high-throughput genomics pipeline for cholera genomic surveillance. It ingests consensus genomes, short read sequence data, draft assemblies, and scales seamlessly from local to cloud environments. To accelerate epidemiological context placement of new outbreak strains, we provide a curated ready-to-use core genome alignment compiled from public data, enabling flexible, fast, integration of new samples for outbreak investigations.</p><p><strong>Availability and implementation: </strong>CholeraSeq is freely available on the GitHub platform https://github.com/CERI-KRISP/CholeraSeq. CholeraSeq is implemented in Nextflow with a modular design building upon the nf-core community standards.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790814/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Building multiscale Markov state models by systematic mapping of temporal communities. 基于时间群落系统映射的多尺度马尔可夫状态模型构建。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf585
Nir Nitskansky, Kessem Clein, Barak Raveh

Motivation: Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.

Results: We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.

Availability and implementation: Python code and instructions are available at https://github.com/ravehlab/mMSM.

动机:生物分子通过亚稳态之间的动态转变来实现其生物学功能。马尔可夫状态模型(mmsm)有效地捕获了这些亚稳态和在定义的时间尺度上的转变。然而,实际的动力学通常跨越多个时间尺度,从快速的原子振动到较慢的构象变化和折叠事件。结果:我们引入了多尺度马尔可夫状态模型(mmsm),该模型通过msm层次结构同时代表了多个时间分辨率的生物分子动力学,以及mMSM-explore,这是一种无监督算法,用于通过多尺度自适应采样生成mmsm,并实时识别时间亚稳态。我们在一个具有嵌套能量最小值的玩具系统上对我们的方法进行基准测试;在丙氨酸二肽上,先知道然后不知道它的两个反应坐标;最后,我们绘制了一个快速折叠的35个残基微型蛋白的折叠路径。我们展示了能量景观的有效映射,多尺度层次和过渡状态的正确表示,平稳概率和过渡动力学的准确推断,以及潜在的慢、中、快速反应坐标的从头识别。mmms揭示了不同尺度的动态过程如何共同促进生物分子机器的功能机制。可用性:Python代码和说明可在https://github.com/ravehlab/mMSM.Supplementary上获得:信息:补充数据可在Bioinformatics在线获得。
{"title":"Building multiscale Markov state models by systematic mapping of temporal communities.","authors":"Nir Nitskansky, Kessem Clein, Barak Raveh","doi":"10.1093/bioinformatics/btaf585","DOIUrl":"10.1093/bioinformatics/btaf585","url":null,"abstract":"<p><strong>Motivation: </strong>Biomolecules undergo dynamic transitions among metastable states to carry out their biological functions. Markov State Models (MSMs) effectively capture these metastable states and transitions at a defined temporal scale. However, biomolecular dynamics typically span multiple temporal scales, ranging from fast atomic vibrations to slower conformational changes and folding events.</p><p><strong>Results: </strong>We introduce multiscale Markov State Models (mMSMs), which capture biomolecular dynamics across multiple temporal resolutions simultaneously via a hierarchy of MSMs, and mMSM-explore, an unsupervised algorithm for generating mMSMs through multiscale adaptive sampling with on-the-fly identification of temporally metastable states. We benchmark our method on a toy system with nested energy minima; on alanine dipeptide, first with and then without assuming prior knowledge of its two reaction coordinates; and finally, on a fast-folding 35-residue miniprotein, where we map folding pathways across scales. We demonstrate efficient mapping of energy landscapes, correct representation of multiscale hierarchies and transition states, accurate inference of stationary probabilities and transition kinetics, as well as de novo identification of underlying slow, intermediate, and fast reaction coordinates. mMSMs reveal how dynamic processes at different scales contribute collectively to the functional mechanisms of biomolecular machines.</p><p><strong>Availability and implementation: </strong>Python code and instructions are available at https://github.com/ravehlab/mMSM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145607840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reconstructing and comparing signal transduction networks from single-cell protein quantification data. 从单细胞蛋白定量数据重建和比较信号转导网络。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf675
Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz

Motivation: Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.

Results: Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.

Availability and implementation: The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.

动机:信号转导网络调节了许多基本的生物过程,在癌症等疾病中经常发生畸变。对这种网络的机制理解,以及它们在细胞群之间的差异,对于设计有效的治疗策略至关重要。通常,这样的网络是基于系统扰动实验的计算重建,然后是信号蛋白活性的量化。最近的技术进步现在允许在单个细胞中同时定量许多(信号)蛋白的活性。这使得在不进行系统扰动的情况下重建或量化信号网络成为可能。结果:在这里,我们引入单细胞模块化响应分析(scMRA)和单细胞比较网络重建(scCNR),通过利用单细胞(磷-)蛋白测量的异质性来推导信号转导网络。该方法将总蛋白丰度的随机变化视为自然扰动实验,其影响通过网络传播,从而促进了潜在信号网络的重建和量化。scCNR重建细胞群体特异性网络,其中来自不同群体的细胞具有相同的底层拓扑结构,但群体之间的相互作用强度可能不同。我们在模拟数据上广泛验证了scMRA和scCNR,并将其应用于未发表的EGFR抑制剂处理的角质形成细胞的(磷-)蛋白测量数据,以恢复EGFR下游的信号差异。scCNR将有助于揭示细胞群之间信号传导的机制差异,并将随后指导良好的治疗策略的发展。可用性和实现:本研究中用于scCNR的代码已经存放在Zenodo https://doi.org/10.5281/zenodo.17600937上,也可以在https://github.com/ibivu/scmra上作为python模块获得。此外,复制所有数字的代码可在https://github.com/tstohn/scmra_analysis上获得。
{"title":"Reconstructing and comparing signal transduction networks from single-cell protein quantification data.","authors":"Tim Stohn, Roderick A P M van Eijl, Klaas W Mulder, Lodewyk F A Wessels, Evert Bosdriesz","doi":"10.1093/bioinformatics/btaf675","DOIUrl":"10.1093/bioinformatics/btaf675","url":null,"abstract":"<p><strong>Motivation: </strong>Signal transduction networks regulate many essential biological processes and are frequently aberrated in diseases such as cancer. A mechanistic understanding of such networks, and how they differ between cell populations, is essential to design effective treatment strategies. Typically, such networks are computationally reconstructed based on systematic perturbation experiments, followed by quantification of signaling protein activity. Recent technological advances now allow for the quantification of the activity of many (signaling) proteins simultaneously in single cells. This makes it feasible to reconstruct or quantify signaling networks without performing systematic perturbations.</p><p><strong>Results: </strong>Here, we introduce single-cell modular response analysis (scMRA) and single-cell comparative network reconstruction (scCNR) to derive signal transduction networks by exploiting the heterogeneity of single-cell (phospho-)protein measurements. The methods treat stochastic variation in total protein abundances as natural perturbation experiments, whose effects propagate through the network and hence facilitate the reconstruction and quantification of the underlying signaling network. scCNR reconstructs cell population-specific networks, where cells from different populations have the same underlying topology, but the interaction strengths can differ between populations. We extensively validated scMRA and scCNR on simulated data, and applied it to unpublished data of (phospho-)protein measurements of EGFR-inhibitor-treated keratinocytes to recover signaling differences downstream of EGFR. scCNR will help to unravel the mechanistic signaling differences between cell populations, and will subsequently guide the development of well-informed treatment strategies.</p><p><strong>Availability and implementation: </strong>The code used for scCNR in this study has been deposited on Zenodo https://doi.org/10.5281/zenodo.17600937 and is also available as a Python module at https://github.com/ibivu/scmra. Additionally, data and code to reproduce all figures is available at https://github.com/tstohn/scmra_analysis.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12797212/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PanForest: predicting genes in genomes using random forests. PanForest:使用随机森林预测基因组中的基因。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btag005
Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney

Motivation: The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organization, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.

Results: PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.A pangenome of 12 741 accessory genes in 1000 Escherichia coli genomes was analysed in around 5 h using eight processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.

Availability and implementation: The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.

动机:基因组中某些基因的存在或不存在会影响其他基因是否可能存在或不存在。了解这些基因共存和回避模式揭示了基因组组织的基本原理,其应用范围从进化重建到合成基因组的合理设计。实现:这里介绍的PanForest使用随机森林分类器从存在的其他基因集中预测基因组中基因的存在和不存在。PanForest输出的性能统计数据显示,基于基因组中其他基因的存在或不存在,每个基因的存在或不存在是如何可预测的。此外,PanForest产生统计数据,表明每个基因在预测其他基因存在或不存在时的重要性。PanForest软件可以串行或并行运行,从而便于在生命网络规模上分析泛基因组。结果:使用8台处理器,在大约5小时内分析了1000个大肠杆菌基因组中12741个辅助基因的全基因组。为了证明PanForest的实用性,我们提出了一个案例研究,并表明某些与抗微生物药物耐药性相关的基因可靠地预测了与同一药物耐药性相关的其他基因的存在或缺失。此外,我们强调了这些基因与其他未知的与抗菌素耐药性(AMR)或与其他药物耐药性相关的基因之间的几种关联。我们设想将PanForest应用于从生物医学科学、合成生物学到分子生态学等涉及泛基因组基因分布动力学的多个学科的研究中。可用性:该软件是免费提供的,附有完整的手册,可在www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.Supplementary上找到信息:补充数据可在Bioinformatics在线获得。
{"title":"PanForest: predicting genes in genomes using random forests.","authors":"Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney","doi":"10.1093/bioinformatics/btag005","DOIUrl":"10.1093/bioinformatics/btag005","url":null,"abstract":"<p><strong>Motivation: </strong>The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organization, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.</p><p><strong>Results: </strong>PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.A pangenome of 12 741 accessory genes in 1000 Escherichia coli genomes was analysed in around 5 h using eight processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.</p><p><strong>Availability and implementation: </strong>The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12857576/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PLM-eXplain: divide and conquer the protein embedding space. PLM-eXplain:分而治之蛋白质嵌入空间。
IF 5.4 Pub Date : 2026-01-02 DOI: 10.1093/bioinformatics/btaf631
Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln

Motivation: Protein language models (PLMs) have revolutionized computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. Bridging this gap requires approaches that maintain predictive performance while providing interpretable explanations of model behaviour.

Results: We present PLM-eXplain (PLM-X), an explainable adapter layer that bridges this gap by factoring PLM embeddings into two complementary components: an interpretable subspace based on established biochemical features, and a residual subspace that retains predictive, non-interpretable information. Using embeddings from ESM2 and ProtBert, PLM-X incorporates well-established properties, including secondary structure and hydropathy, while maintaining high predictive performance. We demonstrate the effectiveness of our approach across three biologically relevant classification tasks: extracellular vesicle association, transmembrane helix prediction, and aggregation propensity prediction. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.

Availability and implementation: Source code and models are available at https://github.com/AIT4LIFE-UU/PLM-eXplain/.

动机:蛋白质语言模型(PLMs)通过其为各种预测任务生成强大的序列表示的能力,彻底改变了计算生物学。然而,它们的黑箱性质限制了生物学解释和翻译可操作的见解。弥合这一差距需要在提供模型行为的可解释解释的同时保持预测性能的方法。结果:我们提出了PLM- explain (PLM- x),这是一个可解释的适配器层,通过将PLM嵌入分解为两个互补的组件来弥合这一差距:一个基于已建立的生化特征的可解释子空间,和一个保留预测性、不可解释信息的残差子空间。利用ESM2和ProtBert的嵌入,PLM-X结合了完善的特性,包括二级结构和亲水性,同时保持了高预测性能。我们在三个生物学相关的分类任务中证明了我们的方法的有效性:细胞外囊泡关联、跨膜螺旋预测和聚集倾向预测。PLM- x能够在不牺牲准确性的情况下对模型决策进行生物解释,为提高PLM在各种下游应用程序中的可解释性提供了一种通用的解决方案。可用性和实现:源代码和模型可在https://github.com/AIT4LIFE-UU/PLM-eXplain/.Supplementary上获得信息:其他数据可在在线补充材料中获得。
{"title":"PLM-eXplain: divide and conquer the protein embedding space.","authors":"Jan van Eck, Dea Gogishvili, Wilson Silva, Sanne Abeln","doi":"10.1093/bioinformatics/btaf631","DOIUrl":"10.1093/bioinformatics/btaf631","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models (PLMs) have revolutionized computational biology through their ability to generate powerful sequence representations for diverse prediction tasks. However, their black-box nature limits biological interpretation and translation to actionable insights. Bridging this gap requires approaches that maintain predictive performance while providing interpretable explanations of model behaviour.</p><p><strong>Results: </strong>We present PLM-eXplain (PLM-X), an explainable adapter layer that bridges this gap by factoring PLM embeddings into two complementary components: an interpretable subspace based on established biochemical features, and a residual subspace that retains predictive, non-interpretable information. Using embeddings from ESM2 and ProtBert, PLM-X incorporates well-established properties, including secondary structure and hydropathy, while maintaining high predictive performance. We demonstrate the effectiveness of our approach across three biologically relevant classification tasks: extracellular vesicle association, transmembrane helix prediction, and aggregation propensity prediction. PLM-X enables biological interpretation of model decisions without sacrificing accuracy, offering a generalizable solution for enhancing PLM interpretability across various downstream applications.</p><p><strong>Availability and implementation: </strong>Source code and models are available at https://github.com/AIT4LIFE-UU/PLM-eXplain/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12790820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1