首页 > 最新文献

Bioinformatics (Oxford, England)最新文献

英文 中文
SA2E: spatial-aware auto-encoder for cell type deconvolution of spatial transcriptomics data. 空间感知自编码器,用于空间转录组学数据的细胞类型反卷积。
IF 5.4 Pub Date : 2026-03-20 DOI: 10.1093/bioinformatics/btag133
Yaxiong Ma, Zengfa Dou, Yuhong Zha, Xiaoke Ma

Motivation: Spatial transcriptomics (ST) technologies measure gene expression together with spatial locations, but each spot typically contains a mixture of cell types, posing a challenge for downstream analysis. Cell-type deconvolution aims to infer spot-wise cell-type proportions by integrating single-cell RNA-seq (scRNA-seq) and ST data. Many existing methods construct cell-type signatures from predefined marker genes, which can limit performance when marker information is incomplete or unavailable.

Results: To address this limitation, we propose a spatial-aware auto-encoder framework (SA2E) for cell-type deconvolution without requiring predefined cell-type biomarkers. SA2E learns latent spot representations using a spatially regularized auto-encoder that preserves the local topology of the spot spatial graph. Based on these representations, SA2E learns cell-type signatures by enforcing them to reconstruct ST expression. In our framework, simulated ST data with known proportions are used for supervised pretraining, while real ST data are optimized using the reconstruction objective. Extensive experiments on simulated and real ST datasets demonstrate that SA2E outperforms state-of-the-art deconvolution baselines.

Availability and implementation: The code of SA2E is available at Github (https://github.com/xkmaxidian/SA2E) and Zenodo (DOI: 10.5281/zenodo.18765467).

动机:空间转录组学(ST)技术测量基因表达和空间位置,但每个位点通常包含细胞类型的混合物,这对下游分析提出了挑战。细胞型反褶积旨在通过整合单细胞RNA-seq (scRNA-seq)和ST数据来推断点方向的细胞型比例。许多现有的方法从预定义的标记基因构建细胞类型特征,当标记信息不完整或不可用时,这可能会限制性能。为了解决这一限制,我们提出了一个空间感知的自编码器框架(SA2E),用于细胞型反卷积,而不需要预定义的细胞型生物标志物。SA2E使用空间正则化自编码器学习潜在点表示,该编码器保留了点空间图的局部拓扑结构。基于这些表征,SA2E通过强制它们重构ST表达来学习细胞类型特征。在我们的框架中,已知比例的模拟ST数据用于监督预训练,而真实ST数据使用重建目标进行优化。在模拟和真实ST数据集上进行的大量实验表明,SA2E优于最先进的反褶积基线。可用性和实现:SA2E的代码可在Github (https://github.com/xkmaxidian/SA2E)和Zenodo (DOI: 10.5281/ Zenodo .18765467)上获得。
{"title":"SA2E: spatial-aware auto-encoder for cell type deconvolution of spatial transcriptomics data.","authors":"Yaxiong Ma, Zengfa Dou, Yuhong Zha, Xiaoke Ma","doi":"10.1093/bioinformatics/btag133","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag133","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies measure gene expression together with spatial locations, but each spot typically contains a mixture of cell types, posing a challenge for downstream analysis. Cell-type deconvolution aims to infer spot-wise cell-type proportions by integrating single-cell RNA-seq (scRNA-seq) and ST data. Many existing methods construct cell-type signatures from predefined marker genes, which can limit performance when marker information is incomplete or unavailable.</p><p><strong>Results: </strong>To address this limitation, we propose a spatial-aware auto-encoder framework (SA2E) for cell-type deconvolution without requiring predefined cell-type biomarkers. SA2E learns latent spot representations using a spatially regularized auto-encoder that preserves the local topology of the spot spatial graph. Based on these representations, SA2E learns cell-type signatures by enforcing them to reconstruct ST expression. In our framework, simulated ST data with known proportions are used for supervised pretraining, while real ST data are optimized using the reconstruction objective. Extensive experiments on simulated and real ST datasets demonstrate that SA2E outperforms state-of-the-art deconvolution baselines.</p><p><strong>Availability and implementation: </strong>The code of SA2E is available at Github (https://github.com/xkmaxidian/SA2E) and Zenodo (DOI: 10.5281/zenodo.18765467).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147494647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HaDeX2: multi-dimensional analysis of Hydrogen-Deuterium Exchange Mass Spectrometry data. HaDeX2:氢-氘交换质谱数据的多维分析。
IF 5.4 Pub Date : 2026-03-16 DOI: 10.1093/bioinformatics/btag128
Weronika Puchała, Krystyna Grzesiak, Dominik Rafacz, Michał Kistowski, Jochem H Smit, Julien Marcoux, Michał Dadlez, Michał Burdukiewicz

Summary: Hydrogen-Deuterium Exchange Mass Spectrometry (HDX-MS) monitors deuterium uptake at the peptide level, in a time-dependent manner. It produces complex, multi-dimensional data that must be interpreted at minimum both the temporal and sequence levels. Specialized tools are therefore essential to preprocess, integrate, and analyze HDX-MS data and translate it into meaningful biological insights. HaDeX2 provides statistical inferences and their visualizations across five dimensions of HDX-MS data: protein sequence, time, biological states, peptide charge and experimental replicates.

Availability and implementation: HaDeX2 is freely available as an R package (https://github.com/hadexversum/HaDeX2; https://doi.org/10.5281/zenodo.18543703) and web server (https://hadex2.mslab-ibb.pl/). To run the GUI locally, users should install a dedicated companion package (https://github.com/hadexversum/HaDeXGUI).

Supplementary information: Supplementary data are available at Bioinformatics online.

摘要:氢-氘交换质谱(HDX-MS)以时间依赖性的方式监测肽水平上的氘摄取。它产生复杂的多维数据,这些数据必须至少在时间和序列级别上进行解释。因此,需要专门的工具来预处理、整合和分析HDX-MS数据,并将其转化为有意义的生物学见解。HaDeX2提供了HDX-MS数据的五个维度的统计推断及其可视化:蛋白质序列,时间,生物状态,肽电荷和实验重复。可用性和实现:HaDeX2作为R包(https://github.com/hadexversum/HaDeX2; https://doi.org/10.5281/zenodo.18543703)和web服务器(https://hadex2.mslab-ibb.pl/)免费提供。要在本地运行GUI,用户应该安装一个专用的配套软件包(https://github.com/hadexversum/HaDeXGUI).Supplementary information:补充数据可在Bioinformatics网站在线获得。
{"title":"HaDeX2: multi-dimensional analysis of Hydrogen-Deuterium Exchange Mass Spectrometry data.","authors":"Weronika Puchała, Krystyna Grzesiak, Dominik Rafacz, Michał Kistowski, Jochem H Smit, Julien Marcoux, Michał Dadlez, Michał Burdukiewicz","doi":"10.1093/bioinformatics/btag128","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag128","url":null,"abstract":"<p><strong>Summary: </strong>Hydrogen-Deuterium Exchange Mass Spectrometry (HDX-MS) monitors deuterium uptake at the peptide level, in a time-dependent manner. It produces complex, multi-dimensional data that must be interpreted at minimum both the temporal and sequence levels. Specialized tools are therefore essential to preprocess, integrate, and analyze HDX-MS data and translate it into meaningful biological insights. HaDeX2 provides statistical inferences and their visualizations across five dimensions of HDX-MS data: protein sequence, time, biological states, peptide charge and experimental replicates.</p><p><strong>Availability and implementation: </strong>HaDeX2 is freely available as an R package (https://github.com/hadexversum/HaDeX2; https://doi.org/10.5281/zenodo.18543703) and web server (https://hadex2.mslab-ibb.pl/). To run the GUI locally, users should install a dedicated companion package (https://github.com/hadexversum/HaDeXGUI).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identification of autosomal and sex chromosome aneuploidies using next generation sequencing. 利用下一代测序技术鉴定常染色体和性染色体非整倍体。
IF 5.4 Pub Date : 2026-03-16 DOI: 10.1093/bioinformatics/btag104
Nidia Barco-Armengol, Dèlia Yubero, Clara Xiol, Núria Catasús, Laura Martí-Sánchez, Judith Armstrong, Francesc Palau, Guerau Fernandez

Motivation: Chromosomal abnormalities, referred to as aneuploidies, occur in approximately 0.3% of live births. While the majority of aneuploidies in humans are incompatible with life, well-characterized exceptions include Down syndrome (47,+21), Patau syndrome (47,+13), Edwards syndrome (47,+18), Turner syndrome (45, X0), Klinefelter syndrome (47, XXY), and triple X syndrome (47, XXX). These chromosomal alterations disrupt gene expression and cellular function, leading to genetic and developmental disorders. With the increasing adoption of next-generation sequencing (NGS) in clinical diagnostics, this study aims to explore the potential use of NGS for aneuploidies detection.

Results: Using data derived from clinical exomes (CES) and whole exomes (WES) sequencing we have been able to detect autosomal as well as sex chromosome aneuploidies with high specificity. Moreover, we have also been able to identify mosaic aneuploidies proving the high sensibility of this methodological approach. Thus, we present NGS as a cost-effective first line approach to detect chromosomal aneuploidies in routine diagnostic practice.

Availability: Scripts are available at https://github.com/B-R-I-D-G-E/AneuploidiesStudies.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:染色体异常,称为非整倍体,发生在约0.3%的活产婴儿中。虽然人类的大多数非整倍体与生命不兼容,但特征明确的例外包括唐氏综合征(47,+21),帕托综合征(47,+13),爱德华兹综合征(47,+18),特纳综合征(45,X0), Klinefelter综合征(47,XXY)和三重X综合征(47,XXX)。这些染色体改变破坏基因表达和细胞功能,导致遗传和发育障碍。随着新一代测序(NGS)在临床诊断中的应用越来越广泛,本研究旨在探索NGS在非整倍体检测中的潜在应用。结果:利用临床外显子组(CES)和全外显子组(WES)测序的数据,我们已经能够以高特异性检测常染色体和性染色体非整倍体。此外,我们还能够识别马赛克非整倍体,证明这种方法方法的高敏感性。因此,我们提出NGS作为一个具有成本效益的一线方法来检测染色体非整倍体的常规诊断实践。可用性:可在https://github.com/B-R-I-D-G-E/AneuploidiesStudies.Supplementary上获得脚本信息;补充数据可在Bioinformatics在线上获得。
{"title":"Identification of autosomal and sex chromosome aneuploidies using next generation sequencing.","authors":"Nidia Barco-Armengol, Dèlia Yubero, Clara Xiol, Núria Catasús, Laura Martí-Sánchez, Judith Armstrong, Francesc Palau, Guerau Fernandez","doi":"10.1093/bioinformatics/btag104","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag104","url":null,"abstract":"<p><strong>Motivation: </strong>Chromosomal abnormalities, referred to as aneuploidies, occur in approximately 0.3% of live births. While the majority of aneuploidies in humans are incompatible with life, well-characterized exceptions include Down syndrome (47,+21), Patau syndrome (47,+13), Edwards syndrome (47,+18), Turner syndrome (45, X0), Klinefelter syndrome (47, XXY), and triple X syndrome (47, XXX). These chromosomal alterations disrupt gene expression and cellular function, leading to genetic and developmental disorders. With the increasing adoption of next-generation sequencing (NGS) in clinical diagnostics, this study aims to explore the potential use of NGS for aneuploidies detection.</p><p><strong>Results: </strong>Using data derived from clinical exomes (CES) and whole exomes (WES) sequencing we have been able to detect autosomal as well as sex chromosome aneuploidies with high specificity. Moreover, we have also been able to identify mosaic aneuploidies proving the high sensibility of this methodological approach. Thus, we present NGS as a cost-effective first line approach to detect chromosomal aneuploidies in routine diagnostic practice.</p><p><strong>Availability: </strong>Scripts are available at https://github.com/B-R-I-D-G-E/AneuploidiesStudies.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tractor Workflow: A Scalable Nextflow Framework for Local Ancestry-Aware Genome-Wide Association Studies. 拖拉机工作流:一个可扩展的Nextflow框架,用于局部祖先感知全基因组关联研究。
IF 5.4 Pub Date : 2026-03-16 DOI: 10.1093/bioinformatics/btag124
Nirav N Shah, Taotao Tan, Jessica Honorato-Mauer, Yi-Sian Lin, Adam X Maihofer, Clement C Zai, Marcos Santoro, Caroline M Nievergelt, Elizabeth G Atkinson

Motivation: The routine exclusion of admixed individuals from traditional Genome-Wide Association Studies (GWAS) due to concerns about spurious associations has limited multi-ancestry genetic discovery. Tractor addresses this issue by incorporating local ancestry into association testing, enabling the identification of ancestry-enriched signals and generating ancestry-specific summary statistics. However, adoption has been constrained by the complexity of prerequisite steps, including phasing and local ancestry inference, which require substantial bioinformatics expertise and introduce key analytical decision points.

Results: We developed a scalable, automated Nextflow workflow that integrates phasing, local ancestry inference, and Tractor association testing into a reproducible end-to-end pipeline. To demonstrate its utility, we applied the workflow to 32 blood biomarkers in 6,245 two-way African-European admixed individuals from the UK Biobank. This pipeline performed efficiently at scale, replicating known associations and uncovering key ancestry-specific loci. These associations were largely driven by variants present on African ancestral tracts but absent from European tracts, underscoring the value of local ancestry-aware methods in uncovering previously masked genetic signals.

Availability and implementation: The workflow is modular, customizable, and compatible with commonly used phasing and local ancestry tools, minimizing manual intervention while preserving analytical flexibility. By lowering technical barriers to implementation, this framework facilitates broader adoption of local ancestry-aware GWAS, paving the way for expanded genetic discovery.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:由于担心存在虚假关联,传统的全基因组关联研究(GWAS)通常会将混合个体排除在外,这限制了多祖先遗传的发现。Tractor通过将本地祖先整合到关联测试中来解决这个问题,从而能够识别祖先丰富的信号,并生成特定于祖先的汇总统计数据。然而,采用受到先决步骤的复杂性的限制,包括分阶段和本地祖先推断,这需要大量的生物信息学专业知识和引入关键的分析决策点。结果:我们开发了一个可扩展的自动化Nextflow工作流,将分阶段、本地祖先推断和拖拉机关联测试集成到一个可重复的端到端管道中。为了证明其实用性,我们将该工作流程应用于来自英国生物银行的6245名双向非洲-欧洲混血个体的32种血液生物标志物。这个管道在规模上有效地执行,复制已知的关联并发现关键的特定于祖先的位点。这些关联在很大程度上是由非洲祖先区存在的变异驱动的,但在欧洲地区却没有,这强调了当地祖先意识方法在发现以前被掩盖的遗传信号方面的价值。可用性和实现:工作流是模块化的,可定制的,并且与常用的分阶段和本地祖先工具兼容,在保持分析灵活性的同时最小化人工干预。通过降低实施的技术障碍,该框架促进了更广泛地采用具有本地血统意识的GWAS,为扩大遗传发现铺平了道路。补充信息:补充数据可在生物信息学在线获取。
{"title":"Tractor Workflow: A Scalable Nextflow Framework for Local Ancestry-Aware Genome-Wide Association Studies.","authors":"Nirav N Shah, Taotao Tan, Jessica Honorato-Mauer, Yi-Sian Lin, Adam X Maihofer, Clement C Zai, Marcos Santoro, Caroline M Nievergelt, Elizabeth G Atkinson","doi":"10.1093/bioinformatics/btag124","DOIUrl":"10.1093/bioinformatics/btag124","url":null,"abstract":"<p><strong>Motivation: </strong>The routine exclusion of admixed individuals from traditional Genome-Wide Association Studies (GWAS) due to concerns about spurious associations has limited multi-ancestry genetic discovery. Tractor addresses this issue by incorporating local ancestry into association testing, enabling the identification of ancestry-enriched signals and generating ancestry-specific summary statistics. However, adoption has been constrained by the complexity of prerequisite steps, including phasing and local ancestry inference, which require substantial bioinformatics expertise and introduce key analytical decision points.</p><p><strong>Results: </strong>We developed a scalable, automated Nextflow workflow that integrates phasing, local ancestry inference, and Tractor association testing into a reproducible end-to-end pipeline. To demonstrate its utility, we applied the workflow to 32 blood biomarkers in 6,245 two-way African-European admixed individuals from the UK Biobank. This pipeline performed efficiently at scale, replicating known associations and uncovering key ancestry-specific loci. These associations were largely driven by variants present on African ancestral tracts but absent from European tracts, underscoring the value of local ancestry-aware methods in uncovering previously masked genetic signals.</p><p><strong>Availability and implementation: </strong>The workflow is modular, customizable, and compatible with commonly used phasing and local ancestry tools, minimizing manual intervention while preserving analytical flexibility. By lowering technical barriers to implementation, this framework facilitates broader adoption of local ancestry-aware GWAS, paving the way for expanded genetic discovery.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Competing Subclones and Fitness Diversity Shape Tumor Evolution Across Cancer Types. 竞争亚克隆和适应度多样性决定了不同癌症类型的肿瘤进化。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag127
Hai Chen, Jingmin Shu, Rekha Mudappathi, Elaine Li, Panwen Wang, Leif Bergsagel, Ping Yang, Zhifu Sun, Logan Zhao, Changxin Shi, Jeffrey P Townsend, Carlo Maley, Li Liu

Motivation: Intratumor heterogeneity arises from ongoing somatic evolution and complicates cancer diagnosis, prognosis, and treatment. Reconstructing evolutionary dynamics typically requires spatiotemporal samples, which are often unavailable in clinical settings. Computational approaches that can infer tumor evolutionary history from single-timepoint bulk sequencing data remain limited.

Results: We present TEATIME (estimating evolutionary events through single-timepoint sequencing), a novel computational framework that models tumors as mixtures of two competing cell populations: an ancestral clone with baseline fitness and a derived subclone with elevated fitness. Using cross-sectional bulk sequencing data, TEATIME estimates mutation rates, timing of subclone emergence, relative fitness, and number of generations of growth. To quantify intratumor fitness asymmetries, we introduce a novel metric-fitness diversity-which captures the imbalance between competing cell populations and serves as a measure of functional intratumor heterogeneity. Applying TEATIME to 33 tumor types from The Cancer Genome Atlas, we revealed divergent as well as convergent evolutionary patterns. Notably, we found that immune-hot microenvironments constraint subclonal expansion and limit fitness diversity. Moreover, we detected temporal dependencies in mutation acquisition, where early driver mutations in ancestral clones epistatically shape the fitness landscape, predisposing specific subclones to selective advantages. These findings underscore the importance of intratumor competition and tumor-microenvironment interactions in shaping evolutionary trajectories, driving intratumor heterogeneity. Lastly, we demonstrate that TEATIME-derived evolutionary parameters and fitness diversity offer novel prognostic insights across multiple cancer types.

Availability: R implementation of TEATIME is available on GitHub (https://github.com/liliulab/TEATIME) and Zenodo (https://zenodo.org/records/17422174).

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:肿瘤内异质性源于持续的体细胞进化,使癌症的诊断、预后和治疗复杂化。重建进化动力学通常需要时空样本,这在临床环境中往往是不可用的。从单时间点批量测序数据推断肿瘤进化史的计算方法仍然有限。结果:我们提出了TEATIME(通过单时间点测序估计进化事件),这是一个新的计算框架,将肿瘤建模为两个竞争细胞群体的混合物:具有基线适应度的祖先克隆和具有高适应度的衍生亚克隆。使用横断面批量测序数据,TEATIME估计突变率、亚克隆出现的时间、相对适应度和生长的代数。为了量化肿瘤内适应度不对称,我们引入了一种新的指标——适应度多样性,它捕捉竞争细胞群之间的不平衡,并作为功能性肿瘤内异质性的衡量标准。将TEATIME应用于癌症基因组图谱中的33种肿瘤类型,我们揭示了不同和趋同的进化模式。值得注意的是,我们发现免疫热微环境限制亚克隆扩增和限制适应度多样性。此外,我们发现突变获取中的时间依赖性,其中祖先克隆的早期驱动突变在上位性上塑造了适应度景观,使特定亚克隆倾向于选择优势。这些发现强调了肿瘤内竞争和肿瘤-微环境相互作用在形成进化轨迹、驱动肿瘤内异质性中的重要性。最后,我们证明了teatime衍生的进化参数和适应性多样性为多种癌症类型的预后提供了新的见解。可用性:TEATIME的R实现可在GitHub (https://github.com/liliulab/TEATIME)和Zenodo (https://zenodo.org/records/17422174).Supplementary)上获得信息:补充数据可在Bioinformatics在线获得。
{"title":"Competing Subclones and Fitness Diversity Shape Tumor Evolution Across Cancer Types.","authors":"Hai Chen, Jingmin Shu, Rekha Mudappathi, Elaine Li, Panwen Wang, Leif Bergsagel, Ping Yang, Zhifu Sun, Logan Zhao, Changxin Shi, Jeffrey P Townsend, Carlo Maley, Li Liu","doi":"10.1093/bioinformatics/btag127","DOIUrl":"10.1093/bioinformatics/btag127","url":null,"abstract":"<p><strong>Motivation: </strong>Intratumor heterogeneity arises from ongoing somatic evolution and complicates cancer diagnosis, prognosis, and treatment. Reconstructing evolutionary dynamics typically requires spatiotemporal samples, which are often unavailable in clinical settings. Computational approaches that can infer tumor evolutionary history from single-timepoint bulk sequencing data remain limited.</p><p><strong>Results: </strong>We present TEATIME (estimating evolutionary events through single-timepoint sequencing), a novel computational framework that models tumors as mixtures of two competing cell populations: an ancestral clone with baseline fitness and a derived subclone with elevated fitness. Using cross-sectional bulk sequencing data, TEATIME estimates mutation rates, timing of subclone emergence, relative fitness, and number of generations of growth. To quantify intratumor fitness asymmetries, we introduce a novel metric-fitness diversity-which captures the imbalance between competing cell populations and serves as a measure of functional intratumor heterogeneity. Applying TEATIME to 33 tumor types from The Cancer Genome Atlas, we revealed divergent as well as convergent evolutionary patterns. Notably, we found that immune-hot microenvironments constraint subclonal expansion and limit fitness diversity. Moreover, we detected temporal dependencies in mutation acquisition, where early driver mutations in ancestral clones epistatically shape the fitness landscape, predisposing specific subclones to selective advantages. These findings underscore the importance of intratumor competition and tumor-microenvironment interactions in shaping evolutionary trajectories, driving intratumor heterogeneity. Lastly, we demonstrate that TEATIME-derived evolutionary parameters and fitness diversity offer novel prognostic insights across multiple cancer types.</p><p><strong>Availability: </strong>R implementation of TEATIME is available on GitHub (https://github.com/liliulab/TEATIME) and Zenodo (https://zenodo.org/records/17422174).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable analysis of whole slide spatial proteomics with Harpy. 利用Harpy进行全片空间蛋白质组学的可扩展分析。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag122
Benjamin Rombaut, Arne Defauw, Frank Vernaillen, Julien Mortier, Evelien Van Hamme, Sofie Van Gassen, Ruth Seurinck, Yvan Saeys

Motivation: Current spatial proteomics data analysis workflows are limited in efficiency and scalability when applied to gigapixel sized datasets. Moreover, they often lack extensive quality control tools and exhibit limited interoperability with existing spatial omics analysis ecosystems.

Results: We introduce Harpy, a new Python workflow capable of accelerated processing of large spatial proteomics datasets. We demonstrate the utility of Harpy on four datasets and show that it can rapidly apply state-of-the-art segmentation and feature extraction via parallel processing. Each analysis step is accompanied by appropriate quality control steps. Scalable clustering of cells and pixels allows identification of cell types, processed up to 27 times faster than previously reported. Processing and visualization can be performed locally or on high-performance computing servers. Additionally, Harpy integrates well with existing spatial single-cell analysis tools in the Python and R software ecosystem.

Availability and implementation: Harpy is available on GitHub at https://github.com/saeyslab/harpy and archived on Zenodo at https://doi.org/10.5281/zenodo.15546703.

Supplementary information: Supplementary data are available online.

动机:当前的空间蛋白质组学数据分析工作流程在应用于十亿像素大小的数据集时,在效率和可扩展性方面受到限制。此外,它们通常缺乏广泛的质量控制工具,并且与现有空间组学分析生态系统的互操作性有限。结果:我们介绍了Harpy,一个新的Python工作流,能够加速处理大型空间蛋白质组学数据集。我们展示了Harpy在四个数据集上的实用性,并表明它可以通过并行处理快速应用最先进的分割和特征提取。每个分析步骤都伴随着适当的质量控制步骤。可扩展的细胞和像素集群允许识别细胞类型,处理速度比以前报道的快27倍。处理和可视化可以在本地或在高性能计算服务器上执行。此外,Harpy与Python和R软件生态系统中现有的空间单细胞分析工具集成得很好。可用性和实现:Harpy在GitHub上可用,网址为https://github.com/saeyslab/harpy,并在Zenodo上存档,网址为https://doi.org/10.5281/zenodo.15546703.Supplementary information:补充数据可在线获得。
{"title":"Scalable analysis of whole slide spatial proteomics with Harpy.","authors":"Benjamin Rombaut, Arne Defauw, Frank Vernaillen, Julien Mortier, Evelien Van Hamme, Sofie Van Gassen, Ruth Seurinck, Yvan Saeys","doi":"10.1093/bioinformatics/btag122","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag122","url":null,"abstract":"<p><strong>Motivation: </strong>Current spatial proteomics data analysis workflows are limited in efficiency and scalability when applied to gigapixel sized datasets. Moreover, they often lack extensive quality control tools and exhibit limited interoperability with existing spatial omics analysis ecosystems.</p><p><strong>Results: </strong>We introduce Harpy, a new Python workflow capable of accelerated processing of large spatial proteomics datasets. We demonstrate the utility of Harpy on four datasets and show that it can rapidly apply state-of-the-art segmentation and feature extraction via parallel processing. Each analysis step is accompanied by appropriate quality control steps. Scalable clustering of cells and pixels allows identification of cell types, processed up to 27 times faster than previously reported. Processing and visualization can be performed locally or on high-performance computing servers. Additionally, Harpy integrates well with existing spatial single-cell analysis tools in the Python and R software ecosystem.</p><p><strong>Availability and implementation: </strong>Harpy is available on GitHub at https://github.com/saeyslab/harpy and archived on Zenodo at https://doi.org/10.5281/zenodo.15546703.</p><p><strong>Supplementary information: </strong>Supplementary data are available online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UPDhmm: detecting Uniparental Disomy from NGS trio data. UPDhmm:从NGS三人组数据中检测单亲染色体。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag062
Marta Sevilla-Porras, Carlos Ruiz-Arenas, Luis A Pérez-Jurado

Summary: Uniparental disomies (UPDs) are copy-neutral chromosomal alterations that occur when both copies of a chromosome pair (entire or segmental) come from one parent. UPDs, including isodisomies (identical parental chromosome) and heterodisomies (two different homologs from the same parent), reflect meiotic and/or mitotic aberrations of chromosomal segregation that can be associated with congenital or acquired disease. Despite their relevance, current methods to detect UPDs using sequence data (exomes or genomes) have limited sensitivity for small events, cannot precisely determine the UPD sub-type or coordinates, and perform poorly when including individuals or populations with consanguinity. We present UPDhmm, a novel tool that uses trio-based sequence data (proband and parents) and models inheritance patterns. UPDhmm predicts the most likely inheritance scenario, normal Mendelian inheritance vs UPD event, based on genotype combinations using a Hidden Markov Model (HMM). We validated the method using simulations on exome and genome data from 1000-Genomes projects. UPDhmm overperformed currently available methods in detecting simulated UPD events in both data types. We applied UPDhmm to a collection of nearly 2400 families with a proband with autism spectrum disorder (Simons Simplex Collection Project) and identified UPD events in two affected individuals, one of them previously unreported. These two events, a paternal isodisomy of chr8 and a maternal heterodisomy of chr22, can be genetic causes of the disease, demonstrating the clinical utility of UPDhmm. Thus, UPDhmm can facilitate the incorporation of UPD detection into clinical pipelines of genomic analysis.

Availability and implementation: UPDhmm is implemented in R and is available in the Bioconductor package (version 1.5.0): https://www.bioconductor.org/packages/release/bioc/html/UPDhmm.html. The source code can be found at https://github.com/martasevilla/UPDhmm under the MIT license.

Supplementary information: Supplementary data, including additional figures and datasets, are available online at the journal's website.

摘要:单亲二体病(UPDs)是当一对染色体的两个拷贝(整个或部分)来自一个亲本时发生的拷贝中性染色体改变。upd,包括同位二体(相同的亲本染色体)和异位二体(来自同一亲本的两个不同的同源染色体),反映了染色体分离的减数分裂和/或有丝分裂畸变,可能与先天性或获得性疾病有关。尽管它们具有相关性,但目前使用序列数据(外显子组或基因组)检测UPD的方法对小事件的敏感性有限,不能精确确定UPD亚型或坐标,并且在包括具有血缘关系的个体或群体时表现不佳。我们提出UPDhmm,一个使用基于三序列数据(先证者和父母)和模型继承模式的新工具。UPDhmm基于使用隐马尔可夫模型(HMM)的基因型组合预测最可能的遗传情景,即正常孟德尔遗传vs UPD事件。我们通过模拟来自1000- genomics项目的外显子组和基因组数据来验证该方法。UPDhmm在检测两种数据类型中的模拟UPD事件方面优于当前可用的方法。我们将UPDhmm应用于近2400个患有自闭症谱系障碍的先证家庭(Simons Simplex collection Project),并在两个受影响的个体中发现了UPD事件,其中一个以前未报道过。这两个事件,父亲的chr8同位体和母亲的chr22异位体,可能是该疾病的遗传原因,证明了UPDhmm的临床应用。因此,UPDhmm可以促进将UPD检测纳入基因组分析的临床管道。可用性和实现:UPDhmm是在R中实现的,并且可以在Bioconductor包(版本1.5.0)中获得:https://www.bioconductor.org/packages/release/bioc/html/UPDhmm.html。源代码可以在MIT许可下的https://github.com/martasevilla/UPDhmm上找到。补充信息:补充数据,包括额外的图表和数据集,可在期刊网站上在线获取。
{"title":"UPDhmm: detecting Uniparental Disomy from NGS trio data.","authors":"Marta Sevilla-Porras, Carlos Ruiz-Arenas, Luis A Pérez-Jurado","doi":"10.1093/bioinformatics/btag062","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag062","url":null,"abstract":"<p><strong>Summary: </strong>Uniparental disomies (UPDs) are copy-neutral chromosomal alterations that occur when both copies of a chromosome pair (entire or segmental) come from one parent. UPDs, including isodisomies (identical parental chromosome) and heterodisomies (two different homologs from the same parent), reflect meiotic and/or mitotic aberrations of chromosomal segregation that can be associated with congenital or acquired disease. Despite their relevance, current methods to detect UPDs using sequence data (exomes or genomes) have limited sensitivity for small events, cannot precisely determine the UPD sub-type or coordinates, and perform poorly when including individuals or populations with consanguinity. We present UPDhmm, a novel tool that uses trio-based sequence data (proband and parents) and models inheritance patterns. UPDhmm predicts the most likely inheritance scenario, normal Mendelian inheritance vs UPD event, based on genotype combinations using a Hidden Markov Model (HMM). We validated the method using simulations on exome and genome data from 1000-Genomes projects. UPDhmm overperformed currently available methods in detecting simulated UPD events in both data types. We applied UPDhmm to a collection of nearly 2400 families with a proband with autism spectrum disorder (Simons Simplex Collection Project) and identified UPD events in two affected individuals, one of them previously unreported. These two events, a paternal isodisomy of chr8 and a maternal heterodisomy of chr22, can be genetic causes of the disease, demonstrating the clinical utility of UPDhmm. Thus, UPDhmm can facilitate the incorporation of UPD detection into clinical pipelines of genomic analysis.</p><p><strong>Availability and implementation: </strong>UPDhmm is implemented in R and is available in the Bioconductor package (version 1.5.0): https://www.bioconductor.org/packages/release/bioc/html/UPDhmm.html. The source code can be found at https://github.com/martasevilla/UPDhmm under the MIT license.</p><p><strong>Supplementary information: </strong>Supplementary data, including additional figures and datasets, are available online at the journal's website.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147488402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches. 超越黑名单:排除集生成策略和替代方法的关键评估。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag110
Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, J Chuck Harrell, Mikhail G Dozmorov

Motivation: Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Alternatively, "sponge" or decoy sequences have been proposed to reduce alignment artifacts.

Results: We examined the widely used Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to sensitivity to input data, aligner choice, and read length. We further explored the use of "sponge" sequences-unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA-as an alternative approach. We additionally investigated the effect of the T2T-CHM13 genome assembly on improving biological signals. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets, and recommend the use of the T2T-CHM13 assembly or, for the hg38 genome assembly, "sponge" sequences as an alignment-guided strategy for reducing artifacts and improving functional genomics analyses.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:短读测序数据可能受到某些基因组区域比对伪影的影响。去除与这些排除区域重叠的读取,以前被称为黑名单,有助于潜在地改善生物信号。另外,“海绵”或诱饵序列已被提议减少对齐伪影。结果:我们检查了广泛使用的Blacklist软件,发现由于对输入数据、对齐器选择和读取长度的敏感性,预生成的排除集难以重现。我们进一步探索了使用“海绵”序列——未组装的基因组区域,如卫星DNA、核糖体DNA和线粒体DNA——作为一种替代方法。我们还研究了T2T-CHM13基因组组装对改善生物信号的影响。与黑名单衍生的排除集相比,将读数与包含海绵序列的基因组对齐可以降低ChIP-seq数据中的信号相关性,同时保留生物信号。基于海绵的比对对RNA-seq基因计数的影响也很小,这表明比染色质谱更广泛的适用性。这些结果突出了固定排除集的局限性,并建议使用T2T-CHM13组合或hg38基因组组合的“海绵”序列作为定位指导策略,以减少伪像和改进功能基因组学分析。补充信息:补充数据可在生物信息学在线获取。
{"title":"Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches.","authors":"Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, J Chuck Harrell, Mikhail G Dozmorov","doi":"10.1093/bioinformatics/btag110","DOIUrl":"10.1093/bioinformatics/btag110","url":null,"abstract":"<p><strong>Motivation: </strong>Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Alternatively, \"sponge\" or decoy sequences have been proposed to reduce alignment artifacts.</p><p><strong>Results: </strong>We examined the widely used Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to sensitivity to input data, aligner choice, and read length. We further explored the use of \"sponge\" sequences-unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA-as an alternative approach. We additionally investigated the effect of the T2T-CHM13 genome assembly on improving biological signals. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets, and recommend the use of the T2T-CHM13 assembly or, for the hg38 genome assembly, \"sponge\" sequences as an alignment-guided strategy for reducing artifacts and improving functional genomics analyses.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mixtum: a graphical tool for two-way admixture analysis in population genetics based on f-statistics. Mixtum:一个基于f统计的群体遗传学双向混合分析的图形工具。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag123
José-María Castelo, José-Angel Oteo, Gonzalo Oteo-García

Summary: Mixtum is a Python-based code that estimates ancestry contributions in a process of two-way admixture based on bi-allelic genotype data. The outcomes of Mixtum come from the geometric interpretation of the f-statistics formalism. Designed with user-friendliness as a priority, Mixtum allows to interactively handle a menu of user-supplied populations to build different mixture models in conjunction with the set of auxiliary populations required by the framework. The results are presented graphically and numerically. Importantly, Mixtum provides a novel index (an angle) that assesses the quality of the ancestral reconstruction of the model under scrutiny. The use and interpretation of the outcomes of Mixtum are explained and illustrated with case studies.

Availability and implementation: The open source code is available on GitHub at https://github.com/jmcastelo/mixtum and on Zenodo at https://doi.org/10.5281/zenodo.17789375. Mixtum is implemented in Python and runs on Linux, Windows and macOS.

Mixtum是一个基于python的代码,它可以根据双等位基因型数据估计双向混合过程中的祖先贡献。Mixtum的结果来自f统计形式主义的几何解释。Mixtum以用户友好性为优先级设计,允许交互式地处理用户提供的种群菜单,以便与框架所需的辅助种群集一起构建不同的混合模型。结果以图形和数值形式给出。重要的是,Mixtum提供了一个新的指标(一个角度)来评估被审查的模型的祖先重建的质量。Mixtum结果的使用和解释通过案例研究进行了解释和说明。可用性和实现:开源代码可在GitHub上获得https://github.com/jmcastelo/mixtum,在Zenodo上获得https://doi.org/10.5281/zenodo.17789375。Mixtum是用Python实现的,可以在Linux、Windows和macOS上运行。
{"title":"Mixtum: a graphical tool for two-way admixture analysis in population genetics based on f-statistics.","authors":"José-María Castelo, José-Angel Oteo, Gonzalo Oteo-García","doi":"10.1093/bioinformatics/btag123","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag123","url":null,"abstract":"<p><strong>Summary: </strong>Mixtum is a Python-based code that estimates ancestry contributions in a process of two-way admixture based on bi-allelic genotype data. The outcomes of Mixtum come from the geometric interpretation of the f-statistics formalism. Designed with user-friendliness as a priority, Mixtum allows to interactively handle a menu of user-supplied populations to build different mixture models in conjunction with the set of auxiliary populations required by the framework. The results are presented graphically and numerically. Importantly, Mixtum provides a novel index (an angle) that assesses the quality of the ancestral reconstruction of the model under scrutiny. The use and interpretation of the outcomes of Mixtum are explained and illustrated with case studies.</p><p><strong>Availability and implementation: </strong>The open source code is available on GitHub at https://github.com/jmcastelo/mixtum and on Zenodo at https://doi.org/10.5281/zenodo.17789375. Mixtum is implemented in Python and runs on Linux, Windows and macOS.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ET-Pfam: Ensemble transfer learning for protein family prediction. ET-Pfam:蛋白质家族预测的集成迁移学习。
IF 5.4 Pub Date : 2026-03-12 DOI: 10.1093/bioinformatics/btag121
Sofia A Duarte, Rosario Vitale, Sofia Escudero, Emilio Fenoy, Leandro Bugnon, Diego H Milone, Georgina Stegmayer

Motivation: Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are annotated with domain families through profile Hidden Markov models (pHMMs). Using the aligned sequences of a curated family, one HMM is trained independently for each family, missing the opportunity of learning patterns across families, that is, from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.

Results: In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from protein large language models. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish error rates compared to individual DL models, boosting prediction performance. Among the novel ensemble strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual base model error (12.91%) and competitors of the state-of-the-art.

Availability: Data and source code are available at https://github.com/sinc-lab/ET-Pfam.

Supplementary information: Supplementary data are available at Bioinformatics online.

动机:由于序列生成的快速增长,已经超过了专家管理员手动审查和注释它们的能力,蛋白质的计算注释仍然是当今生物信息学中的一个重大挑战。Pfam数据库包含大量通过隐马尔可夫模型(phmm)对结构域家族进行注释的蛋白质。使用整理的家族的对齐序列,一个HMM为每个家族独立训练,错过了跨家族学习模式的机会,也就是说,从所有数据集的完整视图。作为一种替代方案,最近提出了一些深度学习(DL)模型,尽管如此,它们对输入的表示很简单,性能也有适度的提高。结果:在这项工作中,我们提出了ET-Pfam,这是一种基于迁移学习和多个DL分类器集成的新方法,用于预测Pfam数据库中的功能族。几个基本深度学习模型首先使用从蛋白质大语言模型中学习到的表示进行训练。然后,通过学习每个模型和每个Pfam族的权重,使用经典的集成策略和新颖的投票方法对基础模型进行集成。结果表明,与单个深度学习模型相比,所提出的ET-Pfam方法可以持续降低错误率,提高预测性能。在本文提出的新型集成策略中,通过家庭投票学习的权重获得了最好的性能,错误率最低(7.00%),显著超过了最佳个人基本模型误差(12.91%)和最先进的竞争对手。可用性:数据和源代码可在https://github.com/sinc-lab/ET-Pfam.Supplementary上获得:补充数据可在Bioinformatics在线上获得。
{"title":"ET-Pfam: Ensemble transfer learning for protein family prediction.","authors":"Sofia A Duarte, Rosario Vitale, Sofia Escudero, Emilio Fenoy, Leandro Bugnon, Diego H Milone, Georgina Stegmayer","doi":"10.1093/bioinformatics/btag121","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag121","url":null,"abstract":"<p><strong>Motivation: </strong>Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are annotated with domain families through profile Hidden Markov models (pHMMs). Using the aligned sequences of a curated family, one HMM is trained independently for each family, missing the opportunity of learning patterns across families, that is, from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.</p><p><strong>Results: </strong>In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from protein large language models. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish error rates compared to individual DL models, boosting prediction performance. Among the novel ensemble strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual base model error (12.91%) and competitors of the state-of-the-art.</p><p><strong>Availability: </strong>Data and source code are available at https://github.com/sinc-lab/ET-Pfam.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Bioinformatics (Oxford, England)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1