Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag097
Eli Piliper, Stephanie Goya, Alexander L Greninger
Motivation: Unique molecular identifiers (UMIs) are widely used in next-generation sequencing to enable accurate molecular counting and error correction. However, challenges remain in accurately collapsing UMI clusters, especially when read counts are low or sparse read clusters arise from barcode sequencing errors.
Results: We present RUMINA, a Rust-based pipeline for UMI-aware deduplication and error correction, optimized for both amplicon and shotgun sequencing. RUMINA supports multiple UMI cluster strategies, alongside majority-rule read selection independent of mapping quality, as well as discrete handling of 1-2 read clusters, paired-end merging, and read-length stratification. Benchmarking using simulated HIV population sequencing data and real-world iCLIP and TCR datasets showed that RUMINA improves ultra-low frequency SNV detection (0.01%-1%), reduces false positives, enhances reproducibility, and processes sequencing data up to 10-fold faster than existing tools. By integrating UMI- and sequence-level correction in a high-performance framework, RUMINA offers a fast, scalable, and robust solution for UMI-enabled sequencing workflows.
Availability and implementation: RUMINA is implemented in Rust and distributed as open-source code and precompiled binaries. Source code and installation instructions are available at https://github.com/greninger-lab/rumina. Documentation associated with this manuscript is available at https://github.com/greninger-lab/rumina_paper.
{"title":"RUMINA: high-throughput deduplication of unique molecular identifiers for amplicon and whole-genome sequencing with enhanced error correction.","authors":"Eli Piliper, Stephanie Goya, Alexander L Greninger","doi":"10.1093/bioinformatics/btag097","DOIUrl":"10.1093/bioinformatics/btag097","url":null,"abstract":"<p><strong>Motivation: </strong>Unique molecular identifiers (UMIs) are widely used in next-generation sequencing to enable accurate molecular counting and error correction. However, challenges remain in accurately collapsing UMI clusters, especially when read counts are low or sparse read clusters arise from barcode sequencing errors.</p><p><strong>Results: </strong>We present RUMINA, a Rust-based pipeline for UMI-aware deduplication and error correction, optimized for both amplicon and shotgun sequencing. RUMINA supports multiple UMI cluster strategies, alongside majority-rule read selection independent of mapping quality, as well as discrete handling of 1-2 read clusters, paired-end merging, and read-length stratification. Benchmarking using simulated HIV population sequencing data and real-world iCLIP and TCR datasets showed that RUMINA improves ultra-low frequency SNV detection (0.01%-1%), reduces false positives, enhances reproducibility, and processes sequencing data up to 10-fold faster than existing tools. By integrating UMI- and sequence-level correction in a high-performance framework, RUMINA offers a fast, scalable, and robust solution for UMI-enabled sequencing workflows.</p><p><strong>Availability and implementation: </strong>RUMINA is implemented in Rust and distributed as open-source code and precompiled binaries. Source code and installation instructions are available at https://github.com/greninger-lab/rumina. Documentation associated with this manuscript is available at https://github.com/greninger-lab/rumina_paper.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12975283/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147286652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag090
Husam B R Alabed, Dorotea Frongia Mancini, Martina Pergola, Luigina Romani, Sabata Martino, Albert Koulman, Roberto Maria Pellegrino
Motivation: Understanding the functional roles of lipids is essential for interpreting metabolic phenotypes in health, disease, and dietary interventions. However, lipidomic analyses typically focus on individual lipid species, making it difficult to extract mechanistic and systems-level insights. We therefore asked how quantitative lipidomic data can be translated into biologically structured and function-oriented interpretations.
Results: Here, we present a major update to LipidOne (lipidone.eu), introducing the novel analytical module: Functional Lipid Analysis (FLA). FLA computes 42 indices describing lipid functions related to membrane structure, energy storage, and signaling. Indices are derived from lipid classes and fatty acyl-, alkyl-, and alkenyl-chain composition, statistically compared across experimental groups, and explored using multivariate and visualization tools. Each index is semantically annotated and linked to predicted protein mediators, enabling pathway- and network-based interpretation. Application to published datasets confirmed previous conclusions while uncovering additional biologically coherent functional insights.
Availability and implementation: New FLA module is freely available through LipidOne.eu web platform. The LipidOne FLA core R script (v1.0.0) is archived on Zenodo (DOI: 10.5281/zenodo.18468230). The LipidOne web platform is available at https://lipidone.eu.
{"title":"Functional lipid analysis via index-based lipidomics profile: a new computational module in LipidOne.","authors":"Husam B R Alabed, Dorotea Frongia Mancini, Martina Pergola, Luigina Romani, Sabata Martino, Albert Koulman, Roberto Maria Pellegrino","doi":"10.1093/bioinformatics/btag090","DOIUrl":"10.1093/bioinformatics/btag090","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding the functional roles of lipids is essential for interpreting metabolic phenotypes in health, disease, and dietary interventions. However, lipidomic analyses typically focus on individual lipid species, making it difficult to extract mechanistic and systems-level insights. We therefore asked how quantitative lipidomic data can be translated into biologically structured and function-oriented interpretations.</p><p><strong>Results: </strong>Here, we present a major update to LipidOne (lipidone.eu), introducing the novel analytical module: Functional Lipid Analysis (FLA). FLA computes 42 indices describing lipid functions related to membrane structure, energy storage, and signaling. Indices are derived from lipid classes and fatty acyl-, alkyl-, and alkenyl-chain composition, statistically compared across experimental groups, and explored using multivariate and visualization tools. Each index is semantically annotated and linked to predicted protein mediators, enabling pathway- and network-based interpretation. Application to published datasets confirmed previous conclusions while uncovering additional biologically coherent functional insights.</p><p><strong>Availability and implementation: </strong>New FLA module is freely available through LipidOne.eu web platform. The LipidOne FLA core R script (v1.0.0) is archived on Zenodo (DOI: 10.5281/zenodo.18468230). The LipidOne web platform is available at https://lipidone.eu.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12970596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147322653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag086
Felix Lenner, Anders Jemt, Lucia Peña Pérez, Ramprasad Neethiraj, Peter Pruisscher, Daniel Schmitz, Annick Renevey, Pádraic Corcoran, Daniel Nilsson, Jesper Eisfeldt, Anna Lindstrand, Valtteri Wirta, Adam Ameur, Lars Feuk
Motivation: Long-read sequencing (LRS) is increasingly used for human medical research and clinical diagnostics due to its capacity to generate complete genome information. However, there is a lack of robust and easy-to-use pipelines for comprehensive LRS data analysis.
Results: Here we present Nallo, a Nextflow pipeline for analysis of PacBio and Oxford Nanopore data, with additional support for rare disease research projects. The pipeline detects a wide range of genetic variants, performs genome assembly, and reports CpG methylation. It also enables annotation and ranking of variants based on their predicted functional consequences.
Availability and implementation: Nallo is available from GitHub: https://github.com/genomic-medicine-sweden/nallo.
{"title":"Nallo: a Nextflow pipeline for comprehensive human long-read genome analysis.","authors":"Felix Lenner, Anders Jemt, Lucia Peña Pérez, Ramprasad Neethiraj, Peter Pruisscher, Daniel Schmitz, Annick Renevey, Pádraic Corcoran, Daniel Nilsson, Jesper Eisfeldt, Anna Lindstrand, Valtteri Wirta, Adam Ameur, Lars Feuk","doi":"10.1093/bioinformatics/btag086","DOIUrl":"10.1093/bioinformatics/btag086","url":null,"abstract":"<p><strong>Motivation: </strong>Long-read sequencing (LRS) is increasingly used for human medical research and clinical diagnostics due to its capacity to generate complete genome information. However, there is a lack of robust and easy-to-use pipelines for comprehensive LRS data analysis.</p><p><strong>Results: </strong>Here we present Nallo, a Nextflow pipeline for analysis of PacBio and Oxford Nanopore data, with additional support for rare disease research projects. The pipeline detects a wide range of genetic variants, performs genome assembly, and reports CpG methylation. It also enables annotation and ranking of variants based on their predicted functional consequences.</p><p><strong>Availability and implementation: </strong>Nallo is available from GitHub: https://github.com/genomic-medicine-sweden/nallo.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12988770/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146230082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag040
Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercüment Çiçek, Oznur Tastan
Motivation: Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.
Results: In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.
Availability and implementation: The scripts to run the experiments are available at: https://github.com/tastanlab/ohe.
{"title":"One-hot news: drug synergy models shortcut molecular features.","authors":"Emine Beyza Çandır, Halil İbrahim Kuru, Magnus Rattray, A Ercüment Çiçek, Oznur Tastan","doi":"10.1093/bioinformatics/btag040","DOIUrl":"10.1093/bioinformatics/btag040","url":null,"abstract":"<p><strong>Motivation: </strong>Combinatorial drug therapy holds great promise for tackling complex diseases, but the vast number of possible drug combinations makes exhaustive experimental testing infeasible. Computational models have been developed to guide experimental screens by assigning synergy scores to drug pair-cell line combinations, where they take input structural and chemical information on drugs and molecular features of cell lines. The premise of these models is that they leverage this biological and chemical information to predict synergy measurements.</p><p><strong>Results: </strong>In this study, we demonstrate that replacing drug and cell line representations with simple one-hot encodings results in comparable or even slightly improved performance across diverse published drug combination models. This unexpected finding suggests that current models use these representations primarily as identifiers and exploit covariation in the synergy labels. Our synthetic data experiments show that models can learn from the true features; however, when drugs and cell lines recur across drug-drug-cell triplets, this repeating structure impairs feature-based learning. While the current synergy prediction models can aid in prioritizing drug pairs within a panel of tested drugs and cell lines, our results highlight the need for better strategies to learn from intended features and to generalize to unseen drugs and cell lines.</p><p><strong>Availability and implementation: </strong>The scripts to run the experiments are available at: https://github.com/tastanlab/ohe.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13005728/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146044327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag064
Yojana Gadiya, Javier Millán Acosta, Ammar Ammar, Alejandro Adriaque Lozano, Delano Wetstede, Dominik Martinát, Ana Claudia Sima, Hailiang Mei, Egon Willighagen, Tooba Abbassi-Daloii
Motivation: Integrating omics data analysis with publicly available databases is crucial for unravelling complex biological mechanisms. However, this integration process is often intricate and time-consuming due to the diversity and complexity of the data involved. Achieving consistent harmonization across data types is challenging when managing disparate formats and sources. To address these issues, we introduce pyBiodatafuse, a query-based Python tool designed to integrate biomedical databases. This tool establishes a modular framework that simplifies data wrangling, enabling the creation of context-specific knowledge graphs (KGs) while supporting graph-based analyses.
Results: We developed a pipeline for generating context-specific knowledge graphs dynamically, allowing users to create KGs on the fly from a set of gene or metabolite identifiers. pyBiodatafuse features a user-friendly interface that streamlines this process, making it accessible even to researchers without extensive computational expertise. Additionally, the tool offers plugins for widely used platforms such as Cytoscape, Neo4j, and GraphDB, enabling local hosting of resulting property and RDF graphs. This versatility ensures that generated KGs can be efficiently utilized within diverse research workflows. To demonstrate its potential, we used pyBiodatafuse to create a graph for post-COVID syndrome using differential gene expression data, showcasing its ability to build adaptable and context-specific knowledge representations. Thus, pyBiodatafuse sets the stage for streamlined data integration, empowering researchers to focus on discovery and analysis without being hindered by data management complexities.
Availability and implementation: pyBiodatafuse is open-source, with its source code and PyPi package available at https://github.com/BioDataFuse/pyBiodatafuse and https://pypi.org/project/pyBiodatafuse/. The user interface can be accessed at https://biodatafuse.org/. Additionally, a release has been made on Zenodo at https://doi.org/10.5281/zenodo.18468942.
{"title":"pyBiodatafuse: extending interoperability of data using modular queries across biomedical resources.","authors":"Yojana Gadiya, Javier Millán Acosta, Ammar Ammar, Alejandro Adriaque Lozano, Delano Wetstede, Dominik Martinát, Ana Claudia Sima, Hailiang Mei, Egon Willighagen, Tooba Abbassi-Daloii","doi":"10.1093/bioinformatics/btag064","DOIUrl":"10.1093/bioinformatics/btag064","url":null,"abstract":"<p><strong>Motivation: </strong>Integrating omics data analysis with publicly available databases is crucial for unravelling complex biological mechanisms. However, this integration process is often intricate and time-consuming due to the diversity and complexity of the data involved. Achieving consistent harmonization across data types is challenging when managing disparate formats and sources. To address these issues, we introduce pyBiodatafuse, a query-based Python tool designed to integrate biomedical databases. This tool establishes a modular framework that simplifies data wrangling, enabling the creation of context-specific knowledge graphs (KGs) while supporting graph-based analyses.</p><p><strong>Results: </strong>We developed a pipeline for generating context-specific knowledge graphs dynamically, allowing users to create KGs on the fly from a set of gene or metabolite identifiers. pyBiodatafuse features a user-friendly interface that streamlines this process, making it accessible even to researchers without extensive computational expertise. Additionally, the tool offers plugins for widely used platforms such as Cytoscape, Neo4j, and GraphDB, enabling local hosting of resulting property and RDF graphs. This versatility ensures that generated KGs can be efficiently utilized within diverse research workflows. To demonstrate its potential, we used pyBiodatafuse to create a graph for post-COVID syndrome using differential gene expression data, showcasing its ability to build adaptable and context-specific knowledge representations. Thus, pyBiodatafuse sets the stage for streamlined data integration, empowering researchers to focus on discovery and analysis without being hindered by data management complexities.</p><p><strong>Availability and implementation: </strong>pyBiodatafuse is open-source, with its source code and PyPi package available at https://github.com/BioDataFuse/pyBiodatafuse and https://pypi.org/project/pyBiodatafuse/. The user interface can be accessed at https://biodatafuse.org/. Additionally, a release has been made on Zenodo at https://doi.org/10.5281/zenodo.18468942.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12949461/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag061
Pengcheng Li, Kai Zhang, Xiaozhong Liu, Xuhong Zhang
Motivation: Citation is central to scholarly communication, enabling researchers to navigate rapidly expanding literature and identify relevant prior work. Yet the 'reasoning' behind why a particular paper is cited is often implicit or opaque. Although academic search engines and literature tools rank candidate papers for a query, the motivations underlying these rankings are rarely transparent, making it difficult for scholars to interpret and act on retrieved results-especially in biomedical research where domain knowledge is essential.
Results: We propose an encoder-decoder framework that leverages curated biomedical knowledge to generate 'explanations of citation motivation' in a structured bio-triplet format. We evaluate the approach against recent families of pre-trained language models for text generation, including BERT-style (and variants) and GPT-style (and variants) models. In cancer-focused experiments using PubMed Central, we annotate over 10 000 citation relations with bio-triplets grounded in curated knowledge from multiple biomedical databases. Trained on these annotations, our model outperforms strong sequence-generation baselines, improving precision, recall, and F1 for citation-motivation generation.
Availability and implementation: Code and data are available at Zenodo (archival DOI: 10.281/zenodo.14893445) and GitHub: https://github.com/zhongxiangboy/Knowledge-based-Citation-Reasoning-for-Biomedical-Domain.
{"title":"Knowledge-based citation reasoning for biomedical domain.","authors":"Pengcheng Li, Kai Zhang, Xiaozhong Liu, Xuhong Zhang","doi":"10.1093/bioinformatics/btag061","DOIUrl":"10.1093/bioinformatics/btag061","url":null,"abstract":"<p><strong>Motivation: </strong>Citation is central to scholarly communication, enabling researchers to navigate rapidly expanding literature and identify relevant prior work. Yet the 'reasoning' behind why a particular paper is cited is often implicit or opaque. Although academic search engines and literature tools rank candidate papers for a query, the motivations underlying these rankings are rarely transparent, making it difficult for scholars to interpret and act on retrieved results-especially in biomedical research where domain knowledge is essential.</p><p><strong>Results: </strong>We propose an encoder-decoder framework that leverages curated biomedical knowledge to generate 'explanations of citation motivation' in a structured bio-triplet format. We evaluate the approach against recent families of pre-trained language models for text generation, including BERT-style (and variants) and GPT-style (and variants) models. In cancer-focused experiments using PubMed Central, we annotate over 10 000 citation relations with bio-triplets grounded in curated knowledge from multiple biomedical databases. Trained on these annotations, our model outperforms strong sequence-generation baselines, improving precision, recall, and F1 for citation-motivation generation.</p><p><strong>Availability and implementation: </strong>Code and data are available at Zenodo (archival DOI: 10.281/zenodo.14893445) and GitHub: https://github.com/zhongxiangboy/Knowledge-based-Citation-Reasoning-for-Biomedical-Domain.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987763/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147286702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Incomplete gene models negatively impact single-cell gene expression quantification. This is particularly true in non-model species where often gene 3' ends are inaccurately annotated, while most scRNA-seq methods only capture the 3' transcript region. This results in many genes being incorrectly quantified or not detected.
Results: GeneExt leverages scRNA-seq data to refine gene annotations. We exemplify GeneExt usage and its impact on the gene expression quantification of eight non-model organism single-cell atlases. By extending and homogenizing gene annotations, our tool will help improve biological interpretation and cross-species comparisons of cell type expression atlases.
Availability: GeneExt is available at https://github.com/sebepedroslab/GeneExt (DOI: https://doi.org/10.5281/zenodo.18712940) under a GNU General Public license, together with test data and usage instructions.
{"title":"GeneExt: a gene model extension tool for enhanced single-cell RNA-seq analysis.","authors":"Grygoriy Zolotarov, Xavier Grau-Bové, Arnau Sebé-Pedrós","doi":"10.1093/bioinformatics/btag094","DOIUrl":"10.1093/bioinformatics/btag094","url":null,"abstract":"<p><strong>Motivation: </strong>Incomplete gene models negatively impact single-cell gene expression quantification. This is particularly true in non-model species where often gene 3' ends are inaccurately annotated, while most scRNA-seq methods only capture the 3' transcript region. This results in many genes being incorrectly quantified or not detected.</p><p><strong>Results: </strong>GeneExt leverages scRNA-seq data to refine gene annotations. We exemplify GeneExt usage and its impact on the gene expression quantification of eight non-model organism single-cell atlases. By extending and homogenizing gene annotations, our tool will help improve biological interpretation and cross-species comparisons of cell type expression atlases.</p><p><strong>Availability: </strong>GeneExt is available at https://github.com/sebepedroslab/GeneExt (DOI: https://doi.org/10.5281/zenodo.18712940) under a GNU General Public license, together with test data and usage instructions.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12970594/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btaf264
Kristen Schneider, Simon Walker, Chris Gignoux, Ryan Layer
Motivation: Genome-wide association studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and Tabix query tool in both compression ratio and decompression speed.
Results: When applied to 10 GWAS files from PanUKBB, STABIX created smaller compressed data and indices than Tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same 10 files, STABIX per gene decompression was, on average 7× faster than Tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.
Availability and implementation: Software freely available for download at GitHub: https://github.com/kristen-schneider/stabix/.
{"title":"STABIX: summary-statistic-based GWAS indexing and compression.","authors":"Kristen Schneider, Simon Walker, Chris Gignoux, Ryan Layer","doi":"10.1093/bioinformatics/btaf264","DOIUrl":"10.1093/bioinformatics/btaf264","url":null,"abstract":"<p><strong>Motivation: </strong>Genome-wide association studies (GWAS) are widely used to investigate the role of genetics in disease traits, but the resulting file sizes from these studies are large, posing barriers to efficient storage, sharing, and querying. This issue is especially important for biobanks like the UK Biobank that publish GWAS for thousands of traits, increasing the volume of data that must be effectively managed. Current compression and query methods reduce file sizes and allow for quick genomic position-based queries but do not provide utility for quickly finding loci based on their summary statistics. For example, finding all SNVs in a particular p-value range would require decompressing and scanning the whole file. We propose a new tool, STABIX, which introduces summary-statistic-based queries and improves upon the standard bgzip compression and Tabix query tool in both compression ratio and decompression speed.</p><p><strong>Results: </strong>When applied to 10 GWAS files from PanUKBB, STABIX created smaller compressed data and indices than Tabix for all files, where bgzip and tbi files were an average of 1.2 times the size of STABIX compressed files and indexes. In the same 10 files, STABIX per gene decompression was, on average 7× faster than Tabix per gene decompression, and achieved faster per gene decompression times for over 99% of nearly 20,000 genes.</p><p><strong>Availability and implementation: </strong>Software freely available for download at GitHub: https://github.com/kristen-schneider/stabix/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12980329/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144000230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-28DOI: 10.1093/bioinformatics/btag085
Lucas Goiriz, Guillermo Rodrigo
Summary: We present PyEvoMotion, an open-source Python tool for inferring molecular clock models with time-dependent Gaussian noise from high-throughput genomic datasets. PyEvoMotion features a command-line interface and a modular architecture, allowing seamless integration into larger bioinformatic pipelines. The tool supports customizable filtering, temporal discretization definition, and mutation classification, making it adaptable to diverse research needs. While traditional phylogenetic methods may encounter computational challenges with large datasets, PyEvoMotion can process thousands to millions of sequences to compute statistical parameters associated with a stochastic differential equation model, thereby weighting the genetic variation within the population. Using viral genomic data, we demonstrate its capability to infer evolutionary rates and detect non-Brownian evolutionary motions with subdiffusive behavior. PyEvoMotion shows potential to provide overlooked insights into genome evolution in different contexts.
Availability and implementation: The open source software is available on GitHub at https://github.com/luksgrin/PyEvoMotion and on SourceForge at https://sourceforge.net/projects/pyevomotion.
{"title":"PyEvoMotion: a Python tool for population-based time-course analysis of genome evolution.","authors":"Lucas Goiriz, Guillermo Rodrigo","doi":"10.1093/bioinformatics/btag085","DOIUrl":"10.1093/bioinformatics/btag085","url":null,"abstract":"<p><strong>Summary: </strong>We present PyEvoMotion, an open-source Python tool for inferring molecular clock models with time-dependent Gaussian noise from high-throughput genomic datasets. PyEvoMotion features a command-line interface and a modular architecture, allowing seamless integration into larger bioinformatic pipelines. The tool supports customizable filtering, temporal discretization definition, and mutation classification, making it adaptable to diverse research needs. While traditional phylogenetic methods may encounter computational challenges with large datasets, PyEvoMotion can process thousands to millions of sequences to compute statistical parameters associated with a stochastic differential equation model, thereby weighting the genetic variation within the population. Using viral genomic data, we demonstrate its capability to infer evolutionary rates and detect non-Brownian evolutionary motions with subdiffusive behavior. PyEvoMotion shows potential to provide overlooked insights into genome evolution in different contexts.</p><p><strong>Availability and implementation: </strong>The open source software is available on GitHub at https://github.com/luksgrin/PyEvoMotion and on SourceForge at https://sourceforge.net/projects/pyevomotion.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The increasing prevalence of antibiotic-resistant bacteria has intensified the demand for novel antimicrobial agents. Antimicrobial peptides (AMPs) have emerged as promising alternatives, yet their identification or classification remains challenging due to the lack of multi-perspective information, insufficient feature representation learning, and monocular data modalities.
Results: In this paper, we propose a dual diffusion model-based representation learning framework for classifying AMPs, which effectively integrates both peptide sequence and structure information to address existing issues for the task. Specifically, our approach utilizes a multi-view feature construction module, which encodes peptide sequences and structures from distinctive perspectives, deriving initial feature representations with enriched biological semantics. To enhance representation learning, the proposed framework leverages both diffusion models for sequence and structure information respectively to effectively capture complex semantics from dual modalities. In addition, both single-modal and dual-modal contrastive learning are used to further advance the representation learning. Results of comprehensive experiments demonstrate that our model outperforms existing methods for the task of AMPs classification, providing a feasible solution to accelerating the discovery of novel antimicrobial agents.
Availability of implementation: The data and source codes are available in GitHub at https://github.com/kww567upup/DDM.
{"title":"A dual diffusion model-based representation learning framework for antimicrobial peptides classification.","authors":"Wen Kong, Lingling Fu, Xingpeng Jiang, Weizhong Zhao","doi":"10.1093/bioinformatics/btag077","DOIUrl":"10.1093/bioinformatics/btag077","url":null,"abstract":"<p><strong>Motivation: </strong>The increasing prevalence of antibiotic-resistant bacteria has intensified the demand for novel antimicrobial agents. Antimicrobial peptides (AMPs) have emerged as promising alternatives, yet their identification or classification remains challenging due to the lack of multi-perspective information, insufficient feature representation learning, and monocular data modalities.</p><p><strong>Results: </strong>In this paper, we propose a dual diffusion model-based representation learning framework for classifying AMPs, which effectively integrates both peptide sequence and structure information to address existing issues for the task. Specifically, our approach utilizes a multi-view feature construction module, which encodes peptide sequences and structures from distinctive perspectives, deriving initial feature representations with enriched biological semantics. To enhance representation learning, the proposed framework leverages both diffusion models for sequence and structure information respectively to effectively capture complex semantics from dual modalities. In addition, both single-modal and dual-modal contrastive learning are used to further advance the representation learning. Results of comprehensive experiments demonstrate that our model outperforms existing methods for the task of AMPs classification, providing a feasible solution to accelerating the discovery of novel antimicrobial agents.</p><p><strong>Availability of implementation: </strong>The data and source codes are available in GitHub at https://github.com/kww567upup/DDM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12960902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146204197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}