Pub Date : 2026-03-20DOI: 10.1093/bioinformatics/btag133
Yaxiong Ma, Zengfa Dou, Yuhong Zha, Xiaoke Ma
Motivation: Spatial transcriptomics (ST) technologies measure gene expression together with spatial locations, but each spot typically contains a mixture of cell types, posing a challenge for downstream analysis. Cell-type deconvolution aims to infer spot-wise cell-type proportions by integrating single-cell RNA-seq (scRNA-seq) and ST data. Many existing methods construct cell-type signatures from predefined marker genes, which can limit performance when marker information is incomplete or unavailable.
Results: To address this limitation, we propose a spatial-aware auto-encoder framework (SA2E) for cell-type deconvolution without requiring predefined cell-type biomarkers. SA2E learns latent spot representations using a spatially regularized auto-encoder that preserves the local topology of the spot spatial graph. Based on these representations, SA2E learns cell-type signatures by enforcing them to reconstruct ST expression. In our framework, simulated ST data with known proportions are used for supervised pretraining, while real ST data are optimized using the reconstruction objective. Extensive experiments on simulated and real ST datasets demonstrate that SA2E outperforms state-of-the-art deconvolution baselines.
Availability and implementation: The code of SA2E is available at Github (https://github.com/xkmaxidian/SA2E) and Zenodo (DOI: 10.5281/zenodo.18765467).
{"title":"SA2E: spatial-aware auto-encoder for cell type deconvolution of spatial transcriptomics data.","authors":"Yaxiong Ma, Zengfa Dou, Yuhong Zha, Xiaoke Ma","doi":"10.1093/bioinformatics/btag133","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag133","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies measure gene expression together with spatial locations, but each spot typically contains a mixture of cell types, posing a challenge for downstream analysis. Cell-type deconvolution aims to infer spot-wise cell-type proportions by integrating single-cell RNA-seq (scRNA-seq) and ST data. Many existing methods construct cell-type signatures from predefined marker genes, which can limit performance when marker information is incomplete or unavailable.</p><p><strong>Results: </strong>To address this limitation, we propose a spatial-aware auto-encoder framework (SA2E) for cell-type deconvolution without requiring predefined cell-type biomarkers. SA2E learns latent spot representations using a spatially regularized auto-encoder that preserves the local topology of the spot spatial graph. Based on these representations, SA2E learns cell-type signatures by enforcing them to reconstruct ST expression. In our framework, simulated ST data with known proportions are used for supervised pretraining, while real ST data are optimized using the reconstruction objective. Extensive experiments on simulated and real ST datasets demonstrate that SA2E outperforms state-of-the-art deconvolution baselines.</p><p><strong>Availability and implementation: </strong>The code of SA2E is available at Github (https://github.com/xkmaxidian/SA2E) and Zenodo (DOI: 10.5281/zenodo.18765467).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147494647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-16DOI: 10.1093/bioinformatics/btag128
Weronika Puchała, Krystyna Grzesiak, Dominik Rafacz, Michał Kistowski, Jochem H Smit, Julien Marcoux, Michał Dadlez, Michał Burdukiewicz
Summary: Hydrogen-Deuterium Exchange Mass Spectrometry (HDX-MS) monitors deuterium uptake at the peptide level, in a time-dependent manner. It produces complex, multi-dimensional data that must be interpreted at minimum both the temporal and sequence levels. Specialized tools are therefore essential to preprocess, integrate, and analyze HDX-MS data and translate it into meaningful biological insights. HaDeX2 provides statistical inferences and their visualizations across five dimensions of HDX-MS data: protein sequence, time, biological states, peptide charge and experimental replicates.
Availability and implementation: HaDeX2 is freely available as an R package (https://github.com/hadexversum/HaDeX2; https://doi.org/10.5281/zenodo.18543703) and web server (https://hadex2.mslab-ibb.pl/). To run the GUI locally, users should install a dedicated companion package (https://github.com/hadexversum/HaDeXGUI).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"HaDeX2: multi-dimensional analysis of Hydrogen-Deuterium Exchange Mass Spectrometry data.","authors":"Weronika Puchała, Krystyna Grzesiak, Dominik Rafacz, Michał Kistowski, Jochem H Smit, Julien Marcoux, Michał Dadlez, Michał Burdukiewicz","doi":"10.1093/bioinformatics/btag128","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag128","url":null,"abstract":"<p><strong>Summary: </strong>Hydrogen-Deuterium Exchange Mass Spectrometry (HDX-MS) monitors deuterium uptake at the peptide level, in a time-dependent manner. It produces complex, multi-dimensional data that must be interpreted at minimum both the temporal and sequence levels. Specialized tools are therefore essential to preprocess, integrate, and analyze HDX-MS data and translate it into meaningful biological insights. HaDeX2 provides statistical inferences and their visualizations across five dimensions of HDX-MS data: protein sequence, time, biological states, peptide charge and experimental replicates.</p><p><strong>Availability and implementation: </strong>HaDeX2 is freely available as an R package (https://github.com/hadexversum/HaDeX2; https://doi.org/10.5281/zenodo.18543703) and web server (https://hadex2.mslab-ibb.pl/). To run the GUI locally, users should install a dedicated companion package (https://github.com/hadexversum/HaDeXGUI).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-16DOI: 10.1093/bioinformatics/btag104
Nidia Barco-Armengol, Dèlia Yubero, Clara Xiol, Núria Catasús, Laura Martí-Sánchez, Judith Armstrong, Francesc Palau, Guerau Fernandez
Motivation: Chromosomal abnormalities, referred to as aneuploidies, occur in approximately 0.3% of live births. While the majority of aneuploidies in humans are incompatible with life, well-characterized exceptions include Down syndrome (47,+21), Patau syndrome (47,+13), Edwards syndrome (47,+18), Turner syndrome (45, X0), Klinefelter syndrome (47, XXY), and triple X syndrome (47, XXX). These chromosomal alterations disrupt gene expression and cellular function, leading to genetic and developmental disorders. With the increasing adoption of next-generation sequencing (NGS) in clinical diagnostics, this study aims to explore the potential use of NGS for aneuploidies detection.
Results: Using data derived from clinical exomes (CES) and whole exomes (WES) sequencing we have been able to detect autosomal as well as sex chromosome aneuploidies with high specificity. Moreover, we have also been able to identify mosaic aneuploidies proving the high sensibility of this methodological approach. Thus, we present NGS as a cost-effective first line approach to detect chromosomal aneuploidies in routine diagnostic practice.
Availability: Scripts are available at https://github.com/B-R-I-D-G-E/AneuploidiesStudies.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Identification of autosomal and sex chromosome aneuploidies using next generation sequencing.","authors":"Nidia Barco-Armengol, Dèlia Yubero, Clara Xiol, Núria Catasús, Laura Martí-Sánchez, Judith Armstrong, Francesc Palau, Guerau Fernandez","doi":"10.1093/bioinformatics/btag104","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag104","url":null,"abstract":"<p><strong>Motivation: </strong>Chromosomal abnormalities, referred to as aneuploidies, occur in approximately 0.3% of live births. While the majority of aneuploidies in humans are incompatible with life, well-characterized exceptions include Down syndrome (47,+21), Patau syndrome (47,+13), Edwards syndrome (47,+18), Turner syndrome (45, X0), Klinefelter syndrome (47, XXY), and triple X syndrome (47, XXX). These chromosomal alterations disrupt gene expression and cellular function, leading to genetic and developmental disorders. With the increasing adoption of next-generation sequencing (NGS) in clinical diagnostics, this study aims to explore the potential use of NGS for aneuploidies detection.</p><p><strong>Results: </strong>Using data derived from clinical exomes (CES) and whole exomes (WES) sequencing we have been able to detect autosomal as well as sex chromosome aneuploidies with high specificity. Moreover, we have also been able to identify mosaic aneuploidies proving the high sensibility of this methodological approach. Thus, we present NGS as a cost-effective first line approach to detect chromosomal aneuploidies in routine diagnostic practice.</p><p><strong>Availability: </strong>Scripts are available at https://github.com/B-R-I-D-G-E/AneuploidiesStudies.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-16DOI: 10.1093/bioinformatics/btag124
Nirav N Shah, Taotao Tan, Jessica Honorato-Mauer, Yi-Sian Lin, Adam X Maihofer, Clement C Zai, Marcos Santoro, Caroline M Nievergelt, Elizabeth G Atkinson
Motivation: The routine exclusion of admixed individuals from traditional Genome-Wide Association Studies (GWAS) due to concerns about spurious associations has limited multi-ancestry genetic discovery. Tractor addresses this issue by incorporating local ancestry into association testing, enabling the identification of ancestry-enriched signals and generating ancestry-specific summary statistics. However, adoption has been constrained by the complexity of prerequisite steps, including phasing and local ancestry inference, which require substantial bioinformatics expertise and introduce key analytical decision points.
Results: We developed a scalable, automated Nextflow workflow that integrates phasing, local ancestry inference, and Tractor association testing into a reproducible end-to-end pipeline. To demonstrate its utility, we applied the workflow to 32 blood biomarkers in 6,245 two-way African-European admixed individuals from the UK Biobank. This pipeline performed efficiently at scale, replicating known associations and uncovering key ancestry-specific loci. These associations were largely driven by variants present on African ancestral tracts but absent from European tracts, underscoring the value of local ancestry-aware methods in uncovering previously masked genetic signals.
Availability and implementation: The workflow is modular, customizable, and compatible with commonly used phasing and local ancestry tools, minimizing manual intervention while preserving analytical flexibility. By lowering technical barriers to implementation, this framework facilitates broader adoption of local ancestry-aware GWAS, paving the way for expanded genetic discovery.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Tractor Workflow: A Scalable Nextflow Framework for Local Ancestry-Aware Genome-Wide Association Studies.","authors":"Nirav N Shah, Taotao Tan, Jessica Honorato-Mauer, Yi-Sian Lin, Adam X Maihofer, Clement C Zai, Marcos Santoro, Caroline M Nievergelt, Elizabeth G Atkinson","doi":"10.1093/bioinformatics/btag124","DOIUrl":"10.1093/bioinformatics/btag124","url":null,"abstract":"<p><strong>Motivation: </strong>The routine exclusion of admixed individuals from traditional Genome-Wide Association Studies (GWAS) due to concerns about spurious associations has limited multi-ancestry genetic discovery. Tractor addresses this issue by incorporating local ancestry into association testing, enabling the identification of ancestry-enriched signals and generating ancestry-specific summary statistics. However, adoption has been constrained by the complexity of prerequisite steps, including phasing and local ancestry inference, which require substantial bioinformatics expertise and introduce key analytical decision points.</p><p><strong>Results: </strong>We developed a scalable, automated Nextflow workflow that integrates phasing, local ancestry inference, and Tractor association testing into a reproducible end-to-end pipeline. To demonstrate its utility, we applied the workflow to 32 blood biomarkers in 6,245 two-way African-European admixed individuals from the UK Biobank. This pipeline performed efficiently at scale, replicating known associations and uncovering key ancestry-specific loci. These associations were largely driven by variants present on African ancestral tracts but absent from European tracts, underscoring the value of local ancestry-aware methods in uncovering previously masked genetic signals.</p><p><strong>Availability and implementation: </strong>The workflow is modular, customizable, and compatible with commonly used phasing and local ancestry tools, minimizing manual intervention while preserving analytical flexibility. By lowering technical barriers to implementation, this framework facilitates broader adoption of local ancestry-aware GWAS, paving the way for expanded genetic discovery.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147470485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1093/bioinformatics/btag127
Hai Chen, Jingmin Shu, Rekha Mudappathi, Elaine Li, Panwen Wang, Leif Bergsagel, Ping Yang, Zhifu Sun, Logan Zhao, Changxin Shi, Jeffrey P Townsend, Carlo Maley, Li Liu
Motivation: Intratumor heterogeneity arises from ongoing somatic evolution and complicates cancer diagnosis, prognosis, and treatment. Reconstructing evolutionary dynamics typically requires spatiotemporal samples, which are often unavailable in clinical settings. Computational approaches that can infer tumor evolutionary history from single-timepoint bulk sequencing data remain limited.
Results: We present TEATIME (estimating evolutionary events through single-timepoint sequencing), a novel computational framework that models tumors as mixtures of two competing cell populations: an ancestral clone with baseline fitness and a derived subclone with elevated fitness. Using cross-sectional bulk sequencing data, TEATIME estimates mutation rates, timing of subclone emergence, relative fitness, and number of generations of growth. To quantify intratumor fitness asymmetries, we introduce a novel metric-fitness diversity-which captures the imbalance between competing cell populations and serves as a measure of functional intratumor heterogeneity. Applying TEATIME to 33 tumor types from The Cancer Genome Atlas, we revealed divergent as well as convergent evolutionary patterns. Notably, we found that immune-hot microenvironments constraint subclonal expansion and limit fitness diversity. Moreover, we detected temporal dependencies in mutation acquisition, where early driver mutations in ancestral clones epistatically shape the fitness landscape, predisposing specific subclones to selective advantages. These findings underscore the importance of intratumor competition and tumor-microenvironment interactions in shaping evolutionary trajectories, driving intratumor heterogeneity. Lastly, we demonstrate that TEATIME-derived evolutionary parameters and fitness diversity offer novel prognostic insights across multiple cancer types.
Availability: R implementation of TEATIME is available on GitHub (https://github.com/liliulab/TEATIME) and Zenodo (https://zenodo.org/records/17422174).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Competing Subclones and Fitness Diversity Shape Tumor Evolution Across Cancer Types.","authors":"Hai Chen, Jingmin Shu, Rekha Mudappathi, Elaine Li, Panwen Wang, Leif Bergsagel, Ping Yang, Zhifu Sun, Logan Zhao, Changxin Shi, Jeffrey P Townsend, Carlo Maley, Li Liu","doi":"10.1093/bioinformatics/btag127","DOIUrl":"10.1093/bioinformatics/btag127","url":null,"abstract":"<p><strong>Motivation: </strong>Intratumor heterogeneity arises from ongoing somatic evolution and complicates cancer diagnosis, prognosis, and treatment. Reconstructing evolutionary dynamics typically requires spatiotemporal samples, which are often unavailable in clinical settings. Computational approaches that can infer tumor evolutionary history from single-timepoint bulk sequencing data remain limited.</p><p><strong>Results: </strong>We present TEATIME (estimating evolutionary events through single-timepoint sequencing), a novel computational framework that models tumors as mixtures of two competing cell populations: an ancestral clone with baseline fitness and a derived subclone with elevated fitness. Using cross-sectional bulk sequencing data, TEATIME estimates mutation rates, timing of subclone emergence, relative fitness, and number of generations of growth. To quantify intratumor fitness asymmetries, we introduce a novel metric-fitness diversity-which captures the imbalance between competing cell populations and serves as a measure of functional intratumor heterogeneity. Applying TEATIME to 33 tumor types from The Cancer Genome Atlas, we revealed divergent as well as convergent evolutionary patterns. Notably, we found that immune-hot microenvironments constraint subclonal expansion and limit fitness diversity. Moreover, we detected temporal dependencies in mutation acquisition, where early driver mutations in ancestral clones epistatically shape the fitness landscape, predisposing specific subclones to selective advantages. These findings underscore the importance of intratumor competition and tumor-microenvironment interactions in shaping evolutionary trajectories, driving intratumor heterogeneity. Lastly, we demonstrate that TEATIME-derived evolutionary parameters and fitness diversity offer novel prognostic insights across multiple cancer types.</p><p><strong>Availability: </strong>R implementation of TEATIME is available on GitHub (https://github.com/liliulab/TEATIME) and Zenodo (https://zenodo.org/records/17422174).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1093/bioinformatics/btag122
Benjamin Rombaut, Arne Defauw, Frank Vernaillen, Julien Mortier, Evelien Van Hamme, Sofie Van Gassen, Ruth Seurinck, Yvan Saeys
Motivation: Current spatial proteomics data analysis workflows are limited in efficiency and scalability when applied to gigapixel sized datasets. Moreover, they often lack extensive quality control tools and exhibit limited interoperability with existing spatial omics analysis ecosystems.
Results: We introduce Harpy, a new Python workflow capable of accelerated processing of large spatial proteomics datasets. We demonstrate the utility of Harpy on four datasets and show that it can rapidly apply state-of-the-art segmentation and feature extraction via parallel processing. Each analysis step is accompanied by appropriate quality control steps. Scalable clustering of cells and pixels allows identification of cell types, processed up to 27 times faster than previously reported. Processing and visualization can be performed locally or on high-performance computing servers. Additionally, Harpy integrates well with existing spatial single-cell analysis tools in the Python and R software ecosystem.
Availability and implementation: Harpy is available on GitHub at https://github.com/saeyslab/harpy and archived on Zenodo at https://doi.org/10.5281/zenodo.15546703.
Supplementary information: Supplementary data are available online.
{"title":"Scalable analysis of whole slide spatial proteomics with Harpy.","authors":"Benjamin Rombaut, Arne Defauw, Frank Vernaillen, Julien Mortier, Evelien Van Hamme, Sofie Van Gassen, Ruth Seurinck, Yvan Saeys","doi":"10.1093/bioinformatics/btag122","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag122","url":null,"abstract":"<p><strong>Motivation: </strong>Current spatial proteomics data analysis workflows are limited in efficiency and scalability when applied to gigapixel sized datasets. Moreover, they often lack extensive quality control tools and exhibit limited interoperability with existing spatial omics analysis ecosystems.</p><p><strong>Results: </strong>We introduce Harpy, a new Python workflow capable of accelerated processing of large spatial proteomics datasets. We demonstrate the utility of Harpy on four datasets and show that it can rapidly apply state-of-the-art segmentation and feature extraction via parallel processing. Each analysis step is accompanied by appropriate quality control steps. Scalable clustering of cells and pixels allows identification of cell types, processed up to 27 times faster than previously reported. Processing and visualization can be performed locally or on high-performance computing servers. Additionally, Harpy integrates well with existing spatial single-cell analysis tools in the Python and R software ecosystem.</p><p><strong>Availability and implementation: </strong>Harpy is available on GitHub at https://github.com/saeyslab/harpy and archived on Zenodo at https://doi.org/10.5281/zenodo.15546703.</p><p><strong>Supplementary information: </strong>Supplementary data are available online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1093/bioinformatics/btag062
Marta Sevilla-Porras, Carlos Ruiz-Arenas, Luis A Pérez-Jurado
Summary: Uniparental disomies (UPDs) are copy-neutral chromosomal alterations that occur when both copies of a chromosome pair (entire or segmental) come from one parent. UPDs, including isodisomies (identical parental chromosome) and heterodisomies (two different homologs from the same parent), reflect meiotic and/or mitotic aberrations of chromosomal segregation that can be associated with congenital or acquired disease. Despite their relevance, current methods to detect UPDs using sequence data (exomes or genomes) have limited sensitivity for small events, cannot precisely determine the UPD sub-type or coordinates, and perform poorly when including individuals or populations with consanguinity. We present UPDhmm, a novel tool that uses trio-based sequence data (proband and parents) and models inheritance patterns. UPDhmm predicts the most likely inheritance scenario, normal Mendelian inheritance vs UPD event, based on genotype combinations using a Hidden Markov Model (HMM). We validated the method using simulations on exome and genome data from 1000-Genomes projects. UPDhmm overperformed currently available methods in detecting simulated UPD events in both data types. We applied UPDhmm to a collection of nearly 2400 families with a proband with autism spectrum disorder (Simons Simplex Collection Project) and identified UPD events in two affected individuals, one of them previously unreported. These two events, a paternal isodisomy of chr8 and a maternal heterodisomy of chr22, can be genetic causes of the disease, demonstrating the clinical utility of UPDhmm. Thus, UPDhmm can facilitate the incorporation of UPD detection into clinical pipelines of genomic analysis.
Availability and implementation: UPDhmm is implemented in R and is available in the Bioconductor package (version 1.5.0): https://www.bioconductor.org/packages/release/bioc/html/UPDhmm.html. The source code can be found at https://github.com/martasevilla/UPDhmm under the MIT license.
Supplementary information: Supplementary data, including additional figures and datasets, are available online at the journal's website.
{"title":"UPDhmm: detecting Uniparental Disomy from NGS trio data.","authors":"Marta Sevilla-Porras, Carlos Ruiz-Arenas, Luis A Pérez-Jurado","doi":"10.1093/bioinformatics/btag062","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag062","url":null,"abstract":"<p><strong>Summary: </strong>Uniparental disomies (UPDs) are copy-neutral chromosomal alterations that occur when both copies of a chromosome pair (entire or segmental) come from one parent. UPDs, including isodisomies (identical parental chromosome) and heterodisomies (two different homologs from the same parent), reflect meiotic and/or mitotic aberrations of chromosomal segregation that can be associated with congenital or acquired disease. Despite their relevance, current methods to detect UPDs using sequence data (exomes or genomes) have limited sensitivity for small events, cannot precisely determine the UPD sub-type or coordinates, and perform poorly when including individuals or populations with consanguinity. We present UPDhmm, a novel tool that uses trio-based sequence data (proband and parents) and models inheritance patterns. UPDhmm predicts the most likely inheritance scenario, normal Mendelian inheritance vs UPD event, based on genotype combinations using a Hidden Markov Model (HMM). We validated the method using simulations on exome and genome data from 1000-Genomes projects. UPDhmm overperformed currently available methods in detecting simulated UPD events in both data types. We applied UPDhmm to a collection of nearly 2400 families with a proband with autism spectrum disorder (Simons Simplex Collection Project) and identified UPD events in two affected individuals, one of them previously unreported. These two events, a paternal isodisomy of chr8 and a maternal heterodisomy of chr22, can be genetic causes of the disease, demonstrating the clinical utility of UPDhmm. Thus, UPDhmm can facilitate the incorporation of UPD detection into clinical pipelines of genomic analysis.</p><p><strong>Availability and implementation: </strong>UPDhmm is implemented in R and is available in the Bioconductor package (version 1.5.0): https://www.bioconductor.org/packages/release/bioc/html/UPDhmm.html. The source code can be found at https://github.com/martasevilla/UPDhmm under the MIT license.</p><p><strong>Supplementary information: </strong>Supplementary data, including additional figures and datasets, are available online at the journal's website.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147488402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1093/bioinformatics/btag110
Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, J Chuck Harrell, Mikhail G Dozmorov
Motivation: Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Alternatively, "sponge" or decoy sequences have been proposed to reduce alignment artifacts.
Results: We examined the widely used Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to sensitivity to input data, aligner choice, and read length. We further explored the use of "sponge" sequences-unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA-as an alternative approach. We additionally investigated the effect of the T2T-CHM13 genome assembly on improving biological signals. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets, and recommend the use of the T2T-CHM13 assembly or, for the hg38 genome assembly, "sponge" sequences as an alignment-guided strategy for reducing artifacts and improving functional genomics analyses.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Beyond Blacklists: A Critical Assessment of Exclusion Set Generation Strategies and Alternative Approaches.","authors":"Brydon P G Wall, Jonathan D Ogata, My Nguyen, Amy L Olex, Konstantinos V Floros, Anthony C Faber, Joseph L McClay, J Chuck Harrell, Mikhail G Dozmorov","doi":"10.1093/bioinformatics/btag110","DOIUrl":"10.1093/bioinformatics/btag110","url":null,"abstract":"<p><strong>Motivation: </strong>Short-read sequencing data can be affected by alignment artifacts in certain genomic regions. Removing reads overlapping these exclusion regions, previously known as Blacklists, help to potentially improve biological signal. Alternatively, \"sponge\" or decoy sequences have been proposed to reduce alignment artifacts.</p><p><strong>Results: </strong>We examined the widely used Blacklist software and found that pre-generated exclusion sets were difficult to reproduce due to sensitivity to input data, aligner choice, and read length. We further explored the use of \"sponge\" sequences-unassembled genomic regions such as satellite DNA, ribosomal DNA, and mitochondrial DNA-as an alternative approach. We additionally investigated the effect of the T2T-CHM13 genome assembly on improving biological signals. Aligning reads to a genome that includes sponge sequences reduced signal correlation in ChIP-seq data comparably to Blacklist-derived exclusion sets while preserving biological signal. Sponge-based alignment also had minimal impact on RNA-seq gene counts, suggesting broader applicability beyond chromatin profiling. These results highlight the limitations of fixed exclusion sets, and recommend the use of the T2T-CHM13 assembly or, for the hg38 genome assembly, \"sponge\" sequences as an alignment-guided strategy for reducing artifacts and improving functional genomics analyses.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: Mixtum is a Python-based code that estimates ancestry contributions in a process of two-way admixture based on bi-allelic genotype data. The outcomes of Mixtum come from the geometric interpretation of the f-statistics formalism. Designed with user-friendliness as a priority, Mixtum allows to interactively handle a menu of user-supplied populations to build different mixture models in conjunction with the set of auxiliary populations required by the framework. The results are presented graphically and numerically. Importantly, Mixtum provides a novel index (an angle) that assesses the quality of the ancestral reconstruction of the model under scrutiny. The use and interpretation of the outcomes of Mixtum are explained and illustrated with case studies.
Availability and implementation: The open source code is available on GitHub at https://github.com/jmcastelo/mixtum and on Zenodo at https://doi.org/10.5281/zenodo.17789375. Mixtum is implemented in Python and runs on Linux, Windows and macOS.
{"title":"Mixtum: a graphical tool for two-way admixture analysis in population genetics based on f-statistics.","authors":"José-María Castelo, José-Angel Oteo, Gonzalo Oteo-García","doi":"10.1093/bioinformatics/btag123","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag123","url":null,"abstract":"<p><strong>Summary: </strong>Mixtum is a Python-based code that estimates ancestry contributions in a process of two-way admixture based on bi-allelic genotype data. The outcomes of Mixtum come from the geometric interpretation of the f-statistics formalism. Designed with user-friendliness as a priority, Mixtum allows to interactively handle a menu of user-supplied populations to build different mixture models in conjunction with the set of auxiliary populations required by the framework. The results are presented graphically and numerically. Importantly, Mixtum provides a novel index (an angle) that assesses the quality of the ancestral reconstruction of the model under scrutiny. The use and interpretation of the outcomes of Mixtum are explained and illustrated with case studies.</p><p><strong>Availability and implementation: </strong>The open source code is available on GitHub at https://github.com/jmcastelo/mixtum and on Zenodo at https://doi.org/10.5281/zenodo.17789375. Mixtum is implemented in Python and runs on Linux, Windows and macOS.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-12DOI: 10.1093/bioinformatics/btag121
Sofia A Duarte, Rosario Vitale, Sofia Escudero, Emilio Fenoy, Leandro Bugnon, Diego H Milone, Georgina Stegmayer
Motivation: Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are annotated with domain families through profile Hidden Markov models (pHMMs). Using the aligned sequences of a curated family, one HMM is trained independently for each family, missing the opportunity of learning patterns across families, that is, from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.
Results: In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from protein large language models. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish error rates compared to individual DL models, boosting prediction performance. Among the novel ensemble strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual base model error (12.91%) and competitors of the state-of-the-art.
Availability: Data and source code are available at https://github.com/sinc-lab/ET-Pfam.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"ET-Pfam: Ensemble transfer learning for protein family prediction.","authors":"Sofia A Duarte, Rosario Vitale, Sofia Escudero, Emilio Fenoy, Leandro Bugnon, Diego H Milone, Georgina Stegmayer","doi":"10.1093/bioinformatics/btag121","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag121","url":null,"abstract":"<p><strong>Motivation: </strong>Due to the rapid growth of sequence generation, which has surpassed the expert curators ability to manually review and annotate them, the computational annotation of proteins remains a significant challenge in bioinformatics nowadays. The Pfam database contains a large collection of proteins that are annotated with domain families through profile Hidden Markov models (pHMMs). Using the aligned sequences of a curated family, one HMM is trained independently for each family, missing the opportunity of learning patterns across families, that is, from a complete view of all the dataset. As an alternative, some deep learning (DL) models have been recently proposed, nevertheless with simple representations of the inputs and moderate improvements in performance.</p><p><strong>Results: </strong>In this work we present ET-Pfam, a novel approach based on transfer learning and ensembles of multiple DL classifiers to predict functional families in the Pfam database. Several base DL models are first trained using learned representations from protein large language models. Then, the base models are integrated using classical ensemble strategies and novel voting approaches by learning weights for each model and for each Pfam family. Results demonstrate that the proposed ET-Pfam method can consistently diminish error rates compared to individual DL models, boosting prediction performance. Among the novel ensemble strategies presented here, the learned weights by family voting achieved the best performance, with the lowest error rate (7.00%), significantly surpassing the best individual base model error (12.91%) and competitors of the state-of-the-art.</p><p><strong>Availability: </strong>Data and source code are available at https://github.com/sinc-lab/ET-Pfam.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147461371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}