Pub Date : 2026-01-14DOI: 10.1093/bioinformatics/btag023
Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath
Motivation: Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.
Results: scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.
Availability: scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"scSNViz: Visualization and analysis of Cell-Specific expressed SNVs.","authors":"Siera Martinez, Tushar Sharma, Luke Johnson, Allen Kim, Vania Ballesteros Prieto, Hovhannes Arestakesyan, Sunisha Harish, Jewel Dias, Joseph Goldfrank, Nathan Edwards, Anelia Horvath","doi":"10.1093/bioinformatics/btag023","DOIUrl":"10.1093/bioinformatics/btag023","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately characterizing expressed genetic variation at the single-cell level is essential for understanding transcriptional heterogeneity, allelic regulation, and mutational dynamics within complex tissues. However, few tools enable comprehensive visualization and quantitative analysis of expressed variants across individual cells.</p><p><strong>Results: </strong>scSNViz is an R package for the exploration, quantification, and visualization of expressed single-nucleotide variants (SNVs) from cell-barcoded single-cell RNA sequencing (scRNA-seq) data. The software supports estimation of variant allele fractions, clustering of SNV expression profiles, and 2D and 3D visualization of individual SNVs or user-defined SNV groups. Beyond visualization, scSNViz facilitates investigation of cell-, cluster-, or lineage-specific variant expression patterns, as well as allelic dynamics including imprinting, random allele inactivation, and transcriptional bursting. It interoperates seamlessly with established single-cell frameworks-Seurat for clustering, Slingshot for trajectory inference, scType for cell-type annotation, and CopyKat for copy-number profiling-enabling integrative multi-omic analyses of expressed variation.</p><p><strong>Availability: </strong>scSNViz is implemented in R and freely available at https://github.com/HorvathLab/scSNViz (DOI: 10.5281/zenodo.17307516). The package includes comprehensive documentation and example workflows designed for users with limited bioinformatics experience.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1093/bioinformatics/btag021
Huda Ahmad, Hannah M Doherty, Sam Benedict, James Haycocks, Ge Zhou, Patrick Moynihan, Danesh Moradigaravand, Manuel Banzhaf
Motivation: Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualisation and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.
Results: We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualisation and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualise phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customisable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.
Availability: ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.
Contact: example@example.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"ChemGenXplore: An Interactive Tool for Exploring and Analysing Chemical Genomic Data.","authors":"Huda Ahmad, Hannah M Doherty, Sam Benedict, James Haycocks, Ge Zhou, Patrick Moynihan, Danesh Moradigaravand, Manuel Banzhaf","doi":"10.1093/bioinformatics/btag021","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag021","url":null,"abstract":"<p><strong>Motivation: </strong>Chemical genomics is a powerful high-throughput approach to systematically link phenotypes to genotypes. However, the vast datasets generated remain challenging to explore due to the lack of integrated, interactive tools for visualisation and analysis. Existing workflows often require multiple independent software tools, limiting data accessibility and collaboration. Therefore, we created a user-friendly platform that enables efficient exploration and sharing of chemical genomics data.</p><p><strong>Results: </strong>We developed ChemGenXplore, a web-based Shiny application designed to streamline the visualisation and analysis of chemical genomic screens. It offers two primary functionalities: one for exploring pre-implemented datasets and another for analysing user-uploaded datasets. ChemGenXplore enables users to visualise phenotypic profiles, assess gene-gene and condition-condition correlations, perform GO and KEGG enrichment analysis, and generate customisable, interactive heatmaps. To further support collaborative research, ChemGenXplore also facilitates the comparative analysis of chemical genomic and other omics datasets. By consolidating these features into a single interactive and accessible tool, ChemGenXplore facilitates data sharing, enhances reproducibility, and promotes collaboration within the research community.</p><p><strong>Availability: </strong>ChemGenXplore is freely accessible as a web application at https://chemgenxplore.kaust.edu.sa/. Source code and documentation, including instructions for local installation, are provided on GitHub (https://github.com/Hudaahmadd/ChemGenXplore). A Docker image is also available on DockerHub (https://hub.docker.com/r/hudaahmad/chemgenxplore) to ensure reproducibility and simplify installation.</p><p><strong>Contact: </strong>example@example.org.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1093/bioinformatics/btag002
Hayden C Metsky, Katherine J Siddle, Christian B Matranga, Pardis C Sabeti
{"title":"Best practices when benchmarking CATCH for the design of genome enrichment probes.","authors":"Hayden C Metsky, Katherine J Siddle, Christian B Matranga, Pardis C Sabeti","doi":"10.1093/bioinformatics/btag002","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag002","url":null,"abstract":"","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1093/bioinformatics/btag006
Yitao Xu, Guanyun Wei, Jingying Zhou, Yuanhua Huang, Weichua Yu, Zhixiang Lin, Ran Liu, Xiaodan Fan
Motivation: Accurate in silico identification of B-cell epitope residues is crucial for antibody design and structure-guided vaccine development. Although recent protein language models and structure-aware methods can capture spatial information of tertiary structure when generating residue embeddings, most existing epitope predictors use these embeddings to perform classification for individual residues one by one, without enforcing spatial continuity for reported epitope residues. Such methods often result in biologically implausible predictions because B-cell epitope residues always cluster together on the antigen surface.
Results: We present RoBep, a region-oriented B-cell epitope predictor that explicitly models the spatial clustering of epitope residues. RoBep introduces a novel region constraint mechanism and combines the advanced protein language model ESM-Cambrian with an equivariant graph neural network. Our method outperforms existing structure-based methods on the benchmark dataset, demonstrating improvements of 26%, 45%, 13%, and 43% in F1, MCC, AUPR, and AUROC0.1, respectively. In addition to residue-level predictions, RoBep can also provide antibody-antigen binding regions. Importantly, the predicted epitope residues are ensured to be spatially compact, enhancing biological plausibility and practical relevance for immunotherapeutic design.
Availability: A user-friendly website for using RoBep is provided at https://huggingface.co/spaces/NielTT/RoBep. All datasets, source code used in this work, and implementation instructions of the website are publicly available at https://github.com/YitaoXU/RoBep.
{"title":"RoBep: A Region-Oriented Deep Learning Model for B-Cell Epitope Prediction.","authors":"Yitao Xu, Guanyun Wei, Jingying Zhou, Yuanhua Huang, Weichua Yu, Zhixiang Lin, Ran Liu, Xiaodan Fan","doi":"10.1093/bioinformatics/btag006","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag006","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate in silico identification of B-cell epitope residues is crucial for antibody design and structure-guided vaccine development. Although recent protein language models and structure-aware methods can capture spatial information of tertiary structure when generating residue embeddings, most existing epitope predictors use these embeddings to perform classification for individual residues one by one, without enforcing spatial continuity for reported epitope residues. Such methods often result in biologically implausible predictions because B-cell epitope residues always cluster together on the antigen surface.</p><p><strong>Results: </strong>We present RoBep, a region-oriented B-cell epitope predictor that explicitly models the spatial clustering of epitope residues. RoBep introduces a novel region constraint mechanism and combines the advanced protein language model ESM-Cambrian with an equivariant graph neural network. Our method outperforms existing structure-based methods on the benchmark dataset, demonstrating improvements of 26%, 45%, 13%, and 43% in F1, MCC, AUPR, and AUROC0.1, respectively. In addition to residue-level predictions, RoBep can also provide antibody-antigen binding regions. Importantly, the predicted epitope residues are ensured to be spatially compact, enhancing biological plausibility and practical relevance for immunotherapeutic design.</p><p><strong>Availability: </strong>A user-friendly website for using RoBep is provided at https://huggingface.co/spaces/NielTT/RoBep. All datasets, source code used in this work, and implementation instructions of the website are publicly available at https://github.com/YitaoXU/RoBep.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145960998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1093/bioinformatics/btag012
Kaylee D Rich, James D Wasmuth
Motivation: Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.
Results: We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.
Availability: mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.
Supplementary information: Data related to the results are incorporated into the article and online supplementary material available at Bioinformatics online.
{"title":"mimicDetector: a pipeline for protein motif mimicry detection in host-pathogen interactions.","authors":"Kaylee D Rich, James D Wasmuth","doi":"10.1093/bioinformatics/btag012","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag012","url":null,"abstract":"<p><strong>Motivation: </strong>Molecular mimicry is used by pathogens to evade the host immune system and manipulate other host cellular processes. It is often mediated by short motifs in non-homologous proteins, whose detection challenges the sensitivity and specificity of existing bioinformatics tools.</p><p><strong>Results: </strong>We present mimicDetector, a k-mer-based pipeline for identifying protein-level molecular mimicry between pathogens and their hosts. Applied to 17 globally important pathogens, mimicDetector identified a broad and biologically plausible set of mimicry candidates, including helminth proteins mimicking components of the human complement system and a Leishmania infantum mimic of Reticulon-4, a regulator of immune cell recruitment.</p><p><strong>Availability: </strong>mimicDetector is freely available at https://github.com/kayleerich/mimicDetector/, implemented in Python and Snakemake, and compatible with Unix-based systems.</p><p><strong>Supplementary information: </strong>Data related to the results are incorporated into the article and online supplementary material available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145961044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1093/bioinformatics/btag016
Joseph Thorpe, Nina Billows, Gabrielle C Ngwana-Joseph, Amy Ibrahim, Deborah Nolder, Colin J Sutherland, Nguyen Thi Hong Ngoc, Nguyen Thi Huong Binh, Nguyen Quang Thieu, Jamille G Dombrowski, Silvia Maria Di Santi, Claudio R F Marinho, Jody E Phelan, Tomasz Kurowski, Fady Mohareb, Susana Campino, Taane G Clark
Motivation: Malaria, caused by Plasmodium parasites, imposes a significant public health burden. While Plasmodium falciparum remains the primary target of elimination strategies due to its high mortality rate, lesser-known species such as P. malariae, P. vivax, and P. knowlesi continue to contribute to substantial human morbidity. Genomic approaches, including whole-genome sequencing, offer powerful tools for understanding the biology, transmission, and emerging drug resistance of these neglected Plasmodium species. However, there is an urgent need for informatic tools to summarise and visualise the high-dimensional and complex genomic data generated.
Results: We developed Malaria-GENOMAP, a user-friendly web-based tool, which integrates genomic variant data, such as allele frequencies, with geographical maps and chromosome-wide to gene views for in-depth exploration. The tool includes variation from P. knowlesi (n = 139), P. malariae (n = 158), P. ovale curtisi (n = 36), P. ovale wallikeri (n = 47), P. simium (n = 38), and P. vivax (n = 1,359). It enables the investigation of population structure, geographic associations of mutations, and putative drug resistance markers, offering valuable insights for malaria control efforts.
Availability: Malaria-GENOMAP is available online at https://genomics.lshtm.ac.uk/malaria-genomaps/#/.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Malaria-GENOMAP: A web-based tool for exploring genomic variation of malaria parasites.","authors":"Joseph Thorpe, Nina Billows, Gabrielle C Ngwana-Joseph, Amy Ibrahim, Deborah Nolder, Colin J Sutherland, Nguyen Thi Hong Ngoc, Nguyen Thi Huong Binh, Nguyen Quang Thieu, Jamille G Dombrowski, Silvia Maria Di Santi, Claudio R F Marinho, Jody E Phelan, Tomasz Kurowski, Fady Mohareb, Susana Campino, Taane G Clark","doi":"10.1093/bioinformatics/btag016","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag016","url":null,"abstract":"<p><strong>Motivation: </strong>Malaria, caused by Plasmodium parasites, imposes a significant public health burden. While Plasmodium falciparum remains the primary target of elimination strategies due to its high mortality rate, lesser-known species such as P. malariae, P. vivax, and P. knowlesi continue to contribute to substantial human morbidity. Genomic approaches, including whole-genome sequencing, offer powerful tools for understanding the biology, transmission, and emerging drug resistance of these neglected Plasmodium species. However, there is an urgent need for informatic tools to summarise and visualise the high-dimensional and complex genomic data generated.</p><p><strong>Results: </strong>We developed Malaria-GENOMAP, a user-friendly web-based tool, which integrates genomic variant data, such as allele frequencies, with geographical maps and chromosome-wide to gene views for in-depth exploration. The tool includes variation from P. knowlesi (n = 139), P. malariae (n = 158), P. ovale curtisi (n = 36), P. ovale wallikeri (n = 47), P. simium (n = 38), and P. vivax (n = 1,359). It enables the investigation of population structure, geographic associations of mutations, and putative drug resistance markers, offering valuable insights for malaria control efforts.</p><p><strong>Availability: </strong>Malaria-GENOMAP is available online at https://genomics.lshtm.ac.uk/malaria-genomaps/#/.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145954215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1093/bioinformatics/btag015
Cameron S Movassaghi, Amanda Momenzadeh, Jesse G Meyer
Motivation: Scientific software packages impose persistent maintenance costs due to dependency churn, version incompatibilities, and bug triage, even when the underlying algorithms are stable and well described. At the same time, peer-reviewed publications already function as the canonical record of many computational methods, yet translating narrative method descriptions into usable code remains labor intensive and error prone. Recent advances in large language models (LLMs) raise the question of whether published articles alone can serve as sufficient specifications for on-demand code generation, potentially reducing reliance on continuously maintained libraries.
Results: We systematically evaluated state-of-the-art LLMs by tasking them with implementing core algorithms using only the original scientific publications as input. Across a diverse benchmark including random forests, batch correction methods, gene regulatory network inference, and gene set enrichment analysis, we show that modern LLMs can frequently reproduce package-level functionality with performance indistinguishable from established libraries. Failures and discrepancies primarily arose when manuscripts underspecified implementation details or data structures, rather than from limitations in model reasoning. These results demonstrate that literature-driven code generation is already feasible for many well-specified algorithms, while also exposing where current publication standards hinder reproducibility.
Availability: All prompts, generated code, evaluation scripts, and benchmark datasets are publicly available at https://github.com/xomicsdatascience/articles-to-code.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications.","authors":"Cameron S Movassaghi, Amanda Momenzadeh, Jesse G Meyer","doi":"10.1093/bioinformatics/btag015","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag015","url":null,"abstract":"<p><strong>Motivation: </strong>Scientific software packages impose persistent maintenance costs due to dependency churn, version incompatibilities, and bug triage, even when the underlying algorithms are stable and well described. At the same time, peer-reviewed publications already function as the canonical record of many computational methods, yet translating narrative method descriptions into usable code remains labor intensive and error prone. Recent advances in large language models (LLMs) raise the question of whether published articles alone can serve as sufficient specifications for on-demand code generation, potentially reducing reliance on continuously maintained libraries.</p><p><strong>Results: </strong>We systematically evaluated state-of-the-art LLMs by tasking them with implementing core algorithms using only the original scientific publications as input. Across a diverse benchmark including random forests, batch correction methods, gene regulatory network inference, and gene set enrichment analysis, we show that modern LLMs can frequently reproduce package-level functionality with performance indistinguishable from established libraries. Failures and discrepancies primarily arose when manuscripts underspecified implementation details or data structures, rather than from limitations in model reasoning. These results demonstrate that literature-driven code generation is already feasible for many well-specified algorithms, while also exposing where current publication standards hinder reproducibility.</p><p><strong>Availability: </strong>All prompts, generated code, evaluation scripts, and benchmark datasets are publicly available at https://github.com/xomicsdatascience/articles-to-code.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145954298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1093/bioinformatics/btaf681
Anna Laddach, Fränze Progatzky, Vassilis Pachnis, Michael Shapiro
Summary: CatsCradle is an R package for single-cell analysis that exploits the duality between cells and the genes they express. Our package provides tools to cluster genes, visualise relationships between them, and to explore relationships between gene clusters (programmes) and cell clusters (cell types).
Availability and implementation: CatsCradle is available freely as an R Bioconductor package (https://bioconductor.org/packages/CatsCradle) and interfaces directly with Seurat and SingleCellExperiment data structures.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Capturing gene-cell duality in a cat's cradle.","authors":"Anna Laddach, Fränze Progatzky, Vassilis Pachnis, Michael Shapiro","doi":"10.1093/bioinformatics/btaf681","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf681","url":null,"abstract":"<p><strong>Summary: </strong>CatsCradle is an R package for single-cell analysis that exploits the duality between cells and the genes they express. Our package provides tools to cluster genes, visualise relationships between them, and to explore relationships between gene clusters (programmes) and cell clusters (cell types).</p><p><strong>Availability and implementation: </strong>CatsCradle is available freely as an R Bioconductor package (https://bioconductor.org/packages/CatsCradle) and interfaces directly with Seurat and SingleCellExperiment data structures.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145954267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1093/bioinformatics/btag003
Evgeny Tankhilevich, Sergio Martinez Cuesta, Ian Barrett, Carolina Berg, Lovisa Holmberg Schiavone, Andrew R Leach
Motivation: Recombinant protein expression can be a limiting step in the production of protein reagents for drug discovery and other biotechnology applications. We introduce RP3Net (Recombinant Protein Production Prediction Network), an AI model of small-scale heterologous soluble protein expression in Escherichia coli. RP3Net utilizes the most recent protein and genomic foundational models. A curated dataset of internal experimental results from AstraZeneca (AZ) and publicly available data from the Structural Genomics Consortium (SGC) was used for training, validation and testing of RP3Net.
Results: RP3Net achieves an increase in Area Under Receiver Operator Curve (AUROC) of 0.15, compared to a baseline model. When experimentally validated on an independent, prospective, manually selected set of 97 constructs, RP3Net outperformed currently available models, with an AUROC of 0.83, delivering accurate predictions in 77% of the cases, and correctly identifying successfully expressing constructs in 92% of cases.
Availability: The model, along with installation and running instructions, is available under an MIT license at https://github.com/RP3Net/RP3Net, DOI 10.5281/zenodo.17243498.
Supplementary information: Supplementary data are available at Bioinformatics online.
动机:重组蛋白的表达在用于药物发现和其他生物技术应用的蛋白质试剂的生产中可能是一个限制步骤。我们引入了重组蛋白生产预测网络(RP3Net),这是一个大肠杆菌小范围外源可溶性蛋白表达的人工智能模型。RP3Net利用了最新的蛋白质和基因组基础模型。RP3Net的培训、验证和测试使用了来自阿斯利康(AstraZeneca)的内部实验结果和来自结构基因组学联盟(SGC)的公开数据。结果:与基线模型相比,RP3Net实现了接受者操作曲线下面积(AUROC)增加0.15。当在一个独立的、前瞻性的、人工选择的97个构建体上进行实验验证时,RP3Net的AUROC为0.83,优于目前可用的模型,在77%的病例中提供准确的预测,在92%的病例中正确识别成功表达的构建体。可用性:该模型以及安装和运行说明在MIT许可下可在https://github.com/RP3Net/RP3Net, DOI 10.5281/zenodo.17243498获得。补充信息:补充数据可在生物信息学在线获取。
{"title":"RP3Net: a deep learning model for predicting recombinant protein production in Escherichia coli.","authors":"Evgeny Tankhilevich, Sergio Martinez Cuesta, Ian Barrett, Carolina Berg, Lovisa Holmberg Schiavone, Andrew R Leach","doi":"10.1093/bioinformatics/btag003","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag003","url":null,"abstract":"<p><strong>Motivation: </strong>Recombinant protein expression can be a limiting step in the production of protein reagents for drug discovery and other biotechnology applications. We introduce RP3Net (Recombinant Protein Production Prediction Network), an AI model of small-scale heterologous soluble protein expression in Escherichia coli. RP3Net utilizes the most recent protein and genomic foundational models. A curated dataset of internal experimental results from AstraZeneca (AZ) and publicly available data from the Structural Genomics Consortium (SGC) was used for training, validation and testing of RP3Net.</p><p><strong>Results: </strong>RP3Net achieves an increase in Area Under Receiver Operator Curve (AUROC) of 0.15, compared to a baseline model. When experimentally validated on an independent, prospective, manually selected set of 97 constructs, RP3Net outperformed currently available models, with an AUROC of 0.83, delivering accurate predictions in 77% of the cases, and correctly identifying successfully expressing constructs in 92% of cases.</p><p><strong>Availability: </strong>The model, along with installation and running instructions, is available under an MIT license at https://github.com/RP3Net/RP3Net, DOI 10.5281/zenodo.17243498.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145954307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09DOI: 10.1093/bioinformatics/btag005
Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney
Motivation: The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organisation, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.
Implementations: PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.
Results: A pangenome of 12,741 accessory genes in 1,000 Escherichia coli genomes was analysed in around 5 hours using 8 processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.
Availability: The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"PanForest: predicting genes in genomes using random forests.","authors":"Alan J S Beavan, Maria Rosa Domingo-Sananes, James O McInerney","doi":"10.1093/bioinformatics/btag005","DOIUrl":"https://doi.org/10.1093/bioinformatics/btag005","url":null,"abstract":"<p><strong>Motivation: </strong>The presence or absence of some genes in a genome can influence whether other genes are likely to be present or absent. Understanding these gene co-occurrence and avoidance patterns reveals fundamental principles of genome organisation, with applications ranging from evolutionary reconstruction to rational design of synthetic genomes.</p><p><strong>Implementations: </strong>PanForest, presented here, uses random forest classifiers to predict the presence and absence of genes in genomes from the set of other genes present. Performance statistics output by PanForest reveal how predictable each gene's presence or absence is, based on the presence or absence of other genes in the genome. Further, PanForest produces statistics indicating the importance of each gene in predicting the presence or absence of each other gene. The PanForest software can run serially or in parallel, thereby facilitating the analysis of pangenomes at Network of Life scale.</p><p><strong>Results: </strong>A pangenome of 12,741 accessory genes in 1,000 Escherichia coli genomes was analysed in around 5 hours using 8 processors. To demonstrate PanForest's utility, we present a case study and show that certain genes associated with resistance to antimicrobial drugs reliably predict the presence or absence of other genes associated with resistance to the same drug. Further, we highlight several associations between those genes and others not known to be associated with antimicrobial resistance (AMR), or associated with resistance to other drugs. We envisage PanForest's use in studies from multiple disciplines concerning the dynamics of gene distributions in pangenomes ranging from biomedical science and synthetic biology to molecular ecology.</p><p><strong>Availability: </strong>The software if freely available with a full manual and can be found with at www.github.com/alanbeavan/PanForest DOI: https://doi.org/10.5281/zenodo.17865482.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145946703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}