Pub Date : 2026-01-14eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag010
Shanza Ayub, Jennifer L Gorman, Edward L Y Chen, Hartland W Jackson, Alina Selega, Kieran R Campbell
Motivation: Analysis workflows for highly multiplexed imaging technologies typically summarize each cell in terms of its post-segmentation mean expression, but additional cellular information can be quantified including cell morphology, sub-cellular expression patterns, and spatial cellular context, ultimately giving a multi-modal view of each cell. While deep learning models such as variational autoencoders are well-established for other multi-modal single-cell assays, their ability to integrate these multiple views of a cell from highly multiplexed imaging data remains largely unknown.
Results: Here, we explore the abilities of multi-modal variational autoencoders to learn unified latent cellular representations from multiple views of each single-cell quantified from highly multiplexed imaging, including mean expression, morphology, sub-cellular protein co-localization, and spatial cellular context, while conditioning on technical and batch specific effects. We show that the integrated multi-modal latent space is often more associated with patient-specific clinical outcomes compared to a set of existing baselines. In addition, we perform ablation analyses to understand which input views contribute to model performance, and explore the ability of these models to learn cellular representations that align with cellular phenotypes and enable integration across divergent datasets.
Availability and implementation: hmiVAE is implemented as a python package and is available at https://github.com/camlab-bioml/hmiVAE.
{"title":"Multi-view deep learning of highly multiplexed imaging data improves association of cell states with clinical outcomes.","authors":"Shanza Ayub, Jennifer L Gorman, Edward L Y Chen, Hartland W Jackson, Alina Selega, Kieran R Campbell","doi":"10.1093/bioadv/vbag010","DOIUrl":"https://doi.org/10.1093/bioadv/vbag010","url":null,"abstract":"<p><strong>Motivation: </strong>Analysis workflows for highly multiplexed imaging technologies typically summarize each cell in terms of its post-segmentation mean expression, but additional cellular information can be quantified including cell morphology, sub-cellular expression patterns, and spatial cellular context, ultimately giving a multi-modal view of each cell. While deep learning models such as variational autoencoders are well-established for other multi-modal single-cell assays, their ability to integrate these multiple views of a cell from highly multiplexed imaging data remains largely unknown.</p><p><strong>Results: </strong>Here, we explore the abilities of multi-modal variational autoencoders to learn unified latent cellular representations from multiple views of each single-cell quantified from highly multiplexed imaging, including mean expression, morphology, sub-cellular protein co-localization, and spatial cellular context, while conditioning on technical and batch specific effects. We show that the integrated multi-modal latent space is often more associated with patient-specific clinical outcomes compared to a set of existing baselines. In addition, we perform ablation analyses to understand which input views contribute to model performance, and explore the ability of these models to learn cellular representations that align with cellular phenotypes and enable integration across divergent datasets.</p><p><strong>Availability and implementation: </strong>hmiVAE is implemented as a python package and is available at https://github.com/camlab-bioml/hmiVAE.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag010"},"PeriodicalIF":2.8,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12955845/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147357756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag008
Amy Francis, Colin Campbell, Tom R Gaunt
Motivation: Missense variants-single nucleotide substitutions that result in an amino acid change in the encoded protein-play an important role in cancer. Distinguishing between recurrent and rare missense variants may reveal insights into selective pressures and functional consequences. While recurrent variants may undergo positive selection across patients, rare variants can also drive resistance or other phenotypes. However, most existing tools predict pathogenicity across broad populations and ignore tumour-specific contexts. Here, we present CanDrivR-CS, a suite of cancer-specific gradient boosting models designed to distinguish between rare and recurrent somatic missense variants.
Results: We curated data from the International Cancer Genome Consortium (ICGC) and trained 50 cancer-specific models. These significantly outperformed a pan-cancer baseline, achieving up to 90% F1 score in leave-one-group-out cross-validation (LOGO-CV) for skin melanoma. Notably, DNA shape features ranked among the most predictive across all cancers, with recurrent variants enriched in structurally complex DNA regions such as bends and rolls-potential mutational hotspots.
Availability and implementation: All code and data are available at CanDrivR-CS GitHub repository https://github.com/amyfrancis97/CanDrivR-CS, with further advice on the installation procedure in Section 1 of the Supplementary Materials.
{"title":"<i>CanDrivR-CS</i>: a cancer-specific machine learning framework for distinguishing recurrent and rare variants.","authors":"Amy Francis, Colin Campbell, Tom R Gaunt","doi":"10.1093/bioadv/vbag008","DOIUrl":"10.1093/bioadv/vbag008","url":null,"abstract":"<p><strong>Motivation: </strong>Missense variants-single nucleotide substitutions that result in an amino acid change in the encoded protein-play an important role in cancer. Distinguishing between recurrent and rare missense variants may reveal insights into selective pressures and functional consequences. While recurrent variants may undergo positive selection across patients, rare variants can also drive resistance or other phenotypes. However, most existing tools predict pathogenicity across broad populations and ignore tumour-specific contexts. Here, we present <i>CanDrivR-CS</i>, a suite of cancer-specific gradient boosting models designed to distinguish between rare and recurrent somatic missense variants.</p><p><strong>Results: </strong>We curated data from the International Cancer Genome Consortium (ICGC) and trained 50 cancer-specific models. These significantly outperformed a pan-cancer baseline, achieving up to 90% F1 score in leave-one-group-out cross-validation (LOGO-CV) for skin melanoma. Notably, DNA shape features ranked among the most predictive across all cancers, with recurrent variants enriched in structurally complex DNA regions such as bends and rolls-potential mutational hotspots.</p><p><strong>Availability and implementation: </strong>All code and data are available at <i>CanDrivR-CS</i> GitHub repository https://github.com/amyfrancis97/CanDrivR-CS, with further advice on the installation procedure in Section 1 of the Supplementary Materials.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag008"},"PeriodicalIF":2.8,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12935160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147312628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The availability of large-scale genetic data presents a unique opportunity to study the genetic ancestries of individuals, which requires an efficient and scalable method. The existing global ancestry methods are accurate, but they cannot scale to large genetic datasets. Identity-by-descent (IBD) segments are DNA segments shared by individuals such that they are inherited from a common recent ancestor without recombination. These IBD segments, which reflect co-ancestry, provide an efficient alternative for inferring genetic ancestry.
Results: We introduced a reference-based global ancestry inference method called FRAME (Fast Reference-based Ancestry Makeup Estimation). FRAME utilizes partial local ancestry information estimated through IBD segments. Instead of using sophisticated local ancestry inference methods designed to make the best calls at each site, we employed an efficient IBD method for faster and space-efficient algorithms that are robust to genotyping errors. Additionally, we introduced a new method of panel refinement that can enrich the ancestral homogeneity of individual haplotypes in the reference panel, thus leading to more accurate ancestry composition estimates. We benchmarked the performance of our method with real and simulated data. FRAME consumes ∼10-100 times less memory while maintaining a comparable accuracy.
Availability and implementation: Source code is available at https://github.com/ucfcbb/FRAME.
{"title":"FRAME: fast reference-based ancestry makeup estimation tool.","authors":"Pramesh Shakya, Ardalan Naseri, Degui Zhi, Shaojie Zhang","doi":"10.1093/bioadv/vbag006","DOIUrl":"10.1093/bioadv/vbag006","url":null,"abstract":"<p><strong>Motivation: </strong>The availability of large-scale genetic data presents a unique opportunity to study the genetic ancestries of individuals, which requires an efficient and scalable method. The existing global ancestry methods are accurate, but they cannot scale to large genetic datasets. Identity-by-descent (IBD) segments are DNA segments shared by individuals such that they are inherited from a common recent ancestor without recombination. These IBD segments, which reflect co-ancestry, provide an efficient alternative for inferring genetic ancestry.</p><p><strong>Results: </strong>We introduced a reference-based global ancestry inference method called FRAME (Fast Reference-based Ancestry Makeup Estimation). FRAME utilizes partial local ancestry information estimated through IBD segments. Instead of using sophisticated local ancestry inference methods designed to make the best calls at each site, we employed an efficient IBD method for faster and space-efficient algorithms that are robust to genotyping errors. Additionally, we introduced a new method of panel refinement that can enrich the ancestral homogeneity of individual haplotypes in the reference panel, thus leading to more accurate ancestry composition estimates. We benchmarked the performance of our method with real and simulated data. FRAME consumes ∼10-100 times less memory while maintaining a comparable accuracy.</p><p><strong>Availability and implementation: </strong>Source code is available at https://github.com/ucfcbb/FRAME.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag006"},"PeriodicalIF":2.8,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866910/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag007
Sankarasubramanian Jagadesan, Chittibabu Guda
Summary: Visium HD Spatial Transcriptomics Data Analysis and Visualization (VST-DAVis) is an interactive, R Shiny application and web browser designed for intuitive analysis of spatial transcriptomics data generated using the 10x Genomics Visium HD platform. This user-friendly tool empowers researchers, particularly those without programming expertise, to perform end-to-end spatial transcriptomics analysis through a streamlined graphical interface. The platform is capable of handling both single and multiple samples, enabling comparative analyses across diverse biological conditions or replicates. It accepts various input formats including both H5 and matrix-based files from Space Ranger and outputs high-quality graphics from various visualization tools. VST-DAVis integrates several widely used R packages, such as Seurat, Monocle3, CellChat, and hdWGCNA, to offer a robust and flexible analytical environment that supports a wide range of analytical tasks, including quality control, clustering, marker gene identification, subclustering, trajectory inference, pathway enrichment analysis, cell-cell communication modeling, co-expression analysis, and transcription factor network reconstruction. By combining its analytical depth with user-friendliness, VST-DAVis makes advanced analyses accessible to various research communities that utilize spatial transcriptomics data.
Availability and implementation: VST-DAVis is freely available at https://www.gudalab-rtools.net/VST-DAVis. It is implemented in R 4.5.2 and Bioconductor ≥ 3.22 using the Shiny framework and supports input from Space Ranger outputs. The source code and documentation are hosted on GitHub: https://github.com/GudaLab/VST-DAVis.
{"title":"VST-DAVis: an R Shiny application and web-browser for spatial transcriptomics data analysis and visualization.","authors":"Sankarasubramanian Jagadesan, Chittibabu Guda","doi":"10.1093/bioadv/vbag007","DOIUrl":"10.1093/bioadv/vbag007","url":null,"abstract":"<p><strong>Summary: </strong>Visium HD Spatial Transcriptomics Data Analysis and Visualization (VST-DAVis) is an interactive, R Shiny application and web browser designed for intuitive analysis of spatial transcriptomics data generated using the 10x Genomics Visium HD platform. This user-friendly tool empowers researchers, particularly those without programming expertise, to perform end-to-end spatial transcriptomics analysis through a streamlined graphical interface. The platform is capable of handling both single and multiple samples, enabling comparative analyses across diverse biological conditions or replicates. It accepts various input formats including both H5 and matrix-based files from Space Ranger and outputs high-quality graphics from various visualization tools. VST-DAVis integrates several widely used R packages, such as Seurat, Monocle3, CellChat, and hdWGCNA, to offer a robust and flexible analytical environment that supports a wide range of analytical tasks, including quality control, clustering, marker gene identification, subclustering, trajectory inference, pathway enrichment analysis, cell-cell communication modeling, co-expression analysis, and transcription factor network reconstruction. By combining its analytical depth with user-friendliness, VST-DAVis makes advanced analyses accessible to various research communities that utilize spatial transcriptomics data.</p><p><strong>Availability and implementation: </strong>VST-DAVis is freely available at https://www.gudalab-rtools.net/VST-DAVis. It is implemented in R 4.5.2 and Bioconductor ≥ 3.22 using the Shiny framework and supports input from Space Ranger outputs. The source code and documentation are hosted on GitHub: https://github.com/GudaLab/VST-DAVis.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag007"},"PeriodicalIF":2.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866912/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag005
Elvira Toscano, Elena Cimmino, Angelo Boccia, Leandra Sepe, Giovanni Paolella
Motivation: Quality assessment and assembly comparison are essential steps while assembling new genomes. Many tools for evaluating assemblies typically provide synthetic parameters representing assembly quality or overall features, while others provide long detailed files where it is not always easy to identify and visualize the regions of correspondence and difference among different chromosome assemblies.
Results: Here we present ChromoMapper, a new tool which scans the output from QUAST, as well as other similar alignment description files, to quickly identify and display similarities and differences between the compared assemblies. It uses the information provided about aligned blocks, combined with additional annotations, to represent the main alignment regions at chromosomal or sub-chromosomal scale, highlighting similarities and collinearity between compared sequences, points of inconsistency, discontinuities, repeated regions and interruptions in the assembled sequences.
Availability and implementation: ChromoMapper is available at https://chromomapper.ceinge.unina.it/ and via Zenodo (https://doi.org/10.5281/zenodo.16778863).
{"title":"ChromoMapper: a new tool to quickly compare large genome assemblies.","authors":"Elvira Toscano, Elena Cimmino, Angelo Boccia, Leandra Sepe, Giovanni Paolella","doi":"10.1093/bioadv/vbag005","DOIUrl":"https://doi.org/10.1093/bioadv/vbag005","url":null,"abstract":"<p><strong>Motivation: </strong>Quality assessment and assembly comparison are essential steps while assembling new genomes. Many tools for evaluating assemblies typically provide synthetic parameters representing assembly quality or overall features, while others provide long detailed files where it is not always easy to identify and visualize the regions of correspondence and difference among different chromosome assemblies.</p><p><strong>Results: </strong>Here we present <i>ChromoMapper</i>, a new tool which scans the output from <i>QUAST</i>, as well as other similar alignment description files, to quickly identify and display similarities and differences between the compared assemblies. It uses the information provided about aligned blocks, combined with additional annotations, to represent the main alignment regions at chromosomal or sub-chromosomal scale, highlighting similarities and collinearity between compared sequences, points of inconsistency, discontinuities, repeated regions and interruptions in the assembled sequences.</p><p><strong>Availability and implementation: </strong><i>ChromoMapper</i> is available at https://chromomapper.ceinge.unina.it/ and via Zenodo (https://doi.org/10.5281/zenodo.16778863).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag005"},"PeriodicalIF":2.8,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12947579/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147328491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbag004
Saeedeh Davoudi, Christopher S Henry, Christopher S Miller, Farnoush Banaei-Kashani
Motivation: Enzymes are proteins that catalyze specific biochemical reactions in cells. Enzyme Commission (EC) numbers are used to annotate enzymes in a four-level hierarchy that classifies enzymes based on the specific chemical reactions they catalyze. Accurate EC number prediction is essential for understanding enzyme functions. Despite the availability of numerous methods for predicting EC numbers from protein sequences, there is no unified framework for evaluating and studying such methods systematically. This gap limits the ability of the community to identify the most effective approaches for enzyme annotation.
Results: We introduce EC-Bench, a benchmark for EC number prediction, consisting of (i) an initial representative set of existing methods (including homology-based, deep learning, contrastive learning, and language model methods), (ii) existing and novel accuracy and efficiency performance metrics, and (iii) selected datasets to allow for comprehensive comparative study. EC-Bench is open-source and provides a framework for researchers to not only compare among existing methods objectively under uniform conditions, but also to introduce and effectively evaluate performance of new methods in a comparative framework. To demonstrate the utility of EC-Bench, we perform extensive experimentation to compare the existing EC number prediction methods and establish their advantages and disadvantages in a variety of prediction tasks, namely "exact EC number prediction," "EC number completion," and (partial or additional) "EC number recommendation." We find wide variation in the performance of different methods, but also subtle but potentially useful differences in the performance of different methods across tasks and for different parts of the EC hierarchy.
Availability and implementation: The benchmarking pipeline is available at https://github.com/dsaeedeh/EC-Bench.
{"title":"EC-Bench: a benchmark for enzyme commission number prediction.","authors":"Saeedeh Davoudi, Christopher S Henry, Christopher S Miller, Farnoush Banaei-Kashani","doi":"10.1093/bioadv/vbag004","DOIUrl":"10.1093/bioadv/vbag004","url":null,"abstract":"<p><strong>Motivation: </strong>Enzymes are proteins that catalyze specific biochemical reactions in cells. Enzyme Commission (EC) numbers are used to annotate enzymes in a four-level hierarchy that classifies enzymes based on the specific chemical reactions they catalyze. Accurate EC number prediction is essential for understanding enzyme functions. Despite the availability of numerous methods for predicting EC numbers from protein sequences, there is no unified framework for evaluating and studying such methods systematically. This gap limits the ability of the community to identify the most effective approaches for enzyme annotation.</p><p><strong>Results: </strong>We introduce EC-Bench, a benchmark for EC number prediction, consisting of (i) an initial representative set of existing methods (including homology-based, deep learning, contrastive learning, and language model methods), (ii) existing and novel accuracy and efficiency performance metrics, and (iii) selected datasets to allow for comprehensive comparative study. EC-Bench is open-source and provides a framework for researchers to not only compare among existing methods objectively under uniform conditions, but also to introduce and effectively evaluate performance of new methods in a comparative framework. To demonstrate the utility of EC-Bench, we perform extensive experimentation to compare the existing EC number prediction methods and establish their advantages and disadvantages in a variety of prediction tasks, namely \"exact EC number prediction,\" \"EC number completion,\" and (partial or additional) \"EC number recommendation.\" We find wide variation in the performance of different methods, but also subtle but potentially useful differences in the performance of different methods across tasks and for different parts of the EC hierarchy.</p><p><strong>Availability and implementation: </strong>The benchmarking pipeline is available at https://github.com/dsaeedeh/EC-Bench.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbag004"},"PeriodicalIF":2.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12889163/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146168086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf325
Bram Nap, Bronson Weston, Annette Brandt, Maximilian F Wodak, Ina Bergheim, Ines Thiele
Motivation: Nutrition is an important factor in human health, used to alleviate or prevent symptoms of various diseases. However, the effects of nutrition on the gut microbiome and human metabolism are not well understood. Whole-body metabolic models (WBMs) have been applied to study relationships between regional diets and human/microbiome metabolism. This method requires diets to be defined at the metabolite level, rather than the food item level, which has gated the application of personalized diets to WBMs.
Results: We developed the Nutrition Toolbox, which leverages open-source databases containing metabolite composition for over ten thousand food items to convert food items into their metabolic composition to create in silico diets. Additionally, when used with a previously published nutrition algorithm, minimal changes to a diet can be identified to achieve desirable shifts in human and microbiome metabolism. Taken together, we believe that the Nutrition Toolbox can help to understand the effects of nutrition on human metabolism and has the potential to contribute to personalized nutrition.
Availability and implementation: The Nutrition Toolbox is written in MATLAB. The code can be found at https://github.com/opencobra/cobratoolbox. A tutorial explaining the code is available in the COBRA toolbox and as view-only supplementary tutorial. Details on installing the COBRA toolbox are available at https://opencobra.github.io/cobratoolbox/stable/installation.html.
{"title":"The nutrition toolbox permits <i>in silico</i> generation, analysis, and optimization of personalized diets through metabolic modelling.","authors":"Bram Nap, Bronson Weston, Annette Brandt, Maximilian F Wodak, Ina Bergheim, Ines Thiele","doi":"10.1093/bioadv/vbaf325","DOIUrl":"10.1093/bioadv/vbaf325","url":null,"abstract":"<p><strong>Motivation: </strong>Nutrition is an important factor in human health, used to alleviate or prevent symptoms of various diseases. However, the effects of nutrition on the gut microbiome and human metabolism are not well understood. Whole-body metabolic models (WBMs) have been applied to study relationships between regional diets and human/microbiome metabolism. This method requires diets to be defined at the metabolite level, rather than the food item level, which has gated the application of personalized diets to WBMs.</p><p><strong>Results: </strong>We developed the Nutrition Toolbox, which leverages open-source databases containing metabolite composition for over ten thousand food items to convert food items into their metabolic composition to create <i>in silico</i> diets. Additionally, when used with a previously published nutrition algorithm, minimal changes to a diet can be identified to achieve desirable shifts in human and microbiome metabolism. Taken together, we believe that the Nutrition Toolbox can help to understand the effects of nutrition on human metabolism and has the potential to contribute to personalized nutrition.</p><p><strong>Availability and implementation: </strong>The Nutrition Toolbox is written in MATLAB. The code can be found at https://github.com/opencobra/cobratoolbox. A tutorial explaining the code is available in the COBRA toolbox and as view-only supplementary tutorial. Details on installing the COBRA toolbox are available at https://opencobra.github.io/cobratoolbox/stable/installation.html.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf325"},"PeriodicalIF":2.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12820401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf328
Aryan Bhasin, Francesco Saccon, Callum Canavan, Andrew Robson, Joao Euko, Alexandra C Walls, Yunguan Fu
Summary: Since the emergence of SARS-CoV-2, numerous studies have investigated antibody interactions with viral variants in vitro, and several datasets have been curated to compile available protein structures and experimental measurements. However, existing data remain fragmented, limiting their utility for the development and validation of machine learning models for antibody-antigen interaction prediction. Here, we present CoV-UniBind, a unified database comprising over 75 000 entries of SARS-CoV-2 antibody-antigen sequence, binding, and structural data, integrated and standardized from three public sources and multiple peer-reviewed publications. To demonstrate its utility, we benchmarked multiple protein folding, inverse folding, and language models across tasks relevant to antibody design and vaccine development. We expect CoV-UniBind to facilitate future computational efforts in antibody and vaccine development against SARS-CoV-2.
Availability and implementation: The curated datasets, model scores and antibody synonyms are free to download at https://huggingface.co/datasets/InstaDeepAI/cov-unibind. Folded structures are available upon request.
{"title":"CoV-UniBind: a unified antibody binding database for SARS-CoV-2.","authors":"Aryan Bhasin, Francesco Saccon, Callum Canavan, Andrew Robson, Joao Euko, Alexandra C Walls, Yunguan Fu","doi":"10.1093/bioadv/vbaf328","DOIUrl":"10.1093/bioadv/vbaf328","url":null,"abstract":"<p><strong>Summary: </strong>Since the emergence of SARS-CoV-2, numerous studies have investigated antibody interactions with viral variants <i>in vitro</i>, and several datasets have been curated to compile available protein structures and experimental measurements. However, existing data remain fragmented, limiting their utility for the development and validation of machine learning models for antibody-antigen interaction prediction. Here, we present CoV-UniBind, a unified database comprising over 75 000 entries of SARS-CoV-2 antibody-antigen sequence, binding, and structural data, integrated and standardized from three public sources and multiple peer-reviewed publications. To demonstrate its utility, we benchmarked multiple protein folding, inverse folding, and language models across tasks relevant to antibody design and vaccine development. We expect CoV-UniBind to facilitate future computational efforts in antibody and vaccine development against SARS-CoV-2.</p><p><strong>Availability and implementation: </strong>The curated datasets, model scores and antibody synonyms are free to download at https://huggingface.co/datasets/InstaDeepAI/cov-unibind. Folded structures are available upon request.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf328"},"PeriodicalIF":2.8,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12800777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145991986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02eCollection Date: 2026-01-01DOI: 10.1093/bioadv/vbaf326
Bruno Marques Silva, Fernanda de Jesus Trindade, Lucas Eduardo Costa Canesin, Giordano Souza, Alexandre Aleixo, Gisele Nunes, Renato Renison Moreira-Oliveira
Motivation: Although high-quality chromosome-scale genome assemblies are feasible, assembling large ones remains complex and resource-intensive. This demands reproducible and automated workflows that not only implement current best practices efficiently but also allow for improvement alongside future updates to those standards.
Results: We present Pipeasm, a Snakemake-based genome assembly pipeline containerized with Singularity. Pipeasm can use HiFi, ONT, and Hi-C data, automating read trimming, nuclear and mitogenome assembly, scaffolding, decontamination, and quality evaluation. Applied to four vertebrate species with distinct genomic characteristics, starting from a single command line and configuration file, it produced assemblies with scaffold L50 proportional to the expected chromosome and genome length, and up to 99.6% BUSCO completeness. Its output also includes detailed reports for each step, genome statistics, Hi-C maps, and files ready for curation.
Availability and implementation: Pipeasm is available at https://github.com/itvgenomics/pipeasm, implemented in Python/Snakemake with Singularity, and runs on Unix-based systems.
{"title":"Pipeasm: a tool for automated large chromosome-scale genome assembly and evaluation.","authors":"Bruno Marques Silva, Fernanda de Jesus Trindade, Lucas Eduardo Costa Canesin, Giordano Souza, Alexandre Aleixo, Gisele Nunes, Renato Renison Moreira-Oliveira","doi":"10.1093/bioadv/vbaf326","DOIUrl":"10.1093/bioadv/vbaf326","url":null,"abstract":"<p><strong>Motivation: </strong>Although high-quality chromosome-scale genome assemblies are feasible, assembling large ones remains complex and resource-intensive. This demands reproducible and automated workflows that not only implement current best practices efficiently but also allow for improvement alongside future updates to those standards.</p><p><strong>Results: </strong>We present Pipeasm, a Snakemake-based genome assembly pipeline containerized with Singularity. Pipeasm can use HiFi, ONT, and Hi-C data, automating read trimming, nuclear and mitogenome assembly, scaffolding, decontamination, and quality evaluation. Applied to four vertebrate species with distinct genomic characteristics, starting from a single command line and configuration file, it produced assemblies with scaffold L50 proportional to the expected chromosome and genome length, and up to 99.6% BUSCO completeness. Its output also includes detailed reports for each step, genome statistics, Hi-C maps, and files ready for curation.</p><p><strong>Availability and implementation: </strong>Pipeasm is available at https://github.com/itvgenomics/pipeasm, implemented in Python/Snakemake with Singularity, and runs on Unix-based systems.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf326"},"PeriodicalIF":2.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12800776/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Creating publication-quality visualizations is essential for bioinformatics but remains a bottleneck for researchers with limited coding expertise. While Large Language Models (LLMs) are proficient at generating code, they often fail in practice due to library dependencies, dataset mismatches, or syntax errors. These issues require manual intervention, slowing data interpretation.
Results: We present ggplotAgent, a novel multi-modal, self-debugging artificial intelligence agent that automates publication-ready ggplot2 visualizations. It features a dual-layered framework that resolves code execution errors and uses a vision-enabled agent to verify aesthetic correctness. In benchmarks against the DeepSeek-V3 model, ggplotAgent achieved a 100% code executability rate(versus 85%) and a "Publication-Ready" score of 1.9 (versus 0.7). Surprisingly, it showcased the ability to act as an expert collaborator by intelligently enhancing plots beyond the user's literal prompt, achieving a positive Insight Score of +0.3 over than the baseline (-0.05). These results demonstrate its ability to reliably produce accurate, high-quality visualizations directly from natural language.
Availability and implementation: ggplotAgent is freely accessible as a public web application at https://ggplotagent.databio1.com/ and an offline Streamlit app. The source code is available on GitHub at https://github.com/charlin90/ggplotAgent. This software is distributed under the MIT License.
{"title":"ggplotAgent: a self-debugging multi-modal agent for robust and reproducible scientific visualization.","authors":"Zelin Wang, Yuanyuan Yin, Jien Wang, Haiyan Yan, Xuan Xie, Yiqing Zheng","doi":"10.1093/bioadv/vbaf332","DOIUrl":"10.1093/bioadv/vbaf332","url":null,"abstract":"<p><strong>Motivation: </strong>Creating publication-quality visualizations is essential for bioinformatics but remains a bottleneck for researchers with limited coding expertise. While Large Language Models (LLMs) are proficient at generating code, they often fail in practice due to library dependencies, dataset mismatches, or syntax errors. These issues require manual intervention, slowing data interpretation.</p><p><strong>Results: </strong>We present ggplotAgent, a novel multi-modal, self-debugging artificial intelligence agent that automates publication-ready ggplot2 visualizations. It features a dual-layered framework that resolves code execution errors and uses a vision-enabled agent to verify aesthetic correctness. In benchmarks against the DeepSeek-V3 model, ggplotAgent achieved a 100% code executability rate(versus 85%) and a \"Publication-Ready\" score of 1.9 (versus 0.7). Surprisingly, it showcased the ability to act as an expert collaborator by intelligently enhancing plots beyond the user's literal prompt, achieving a positive Insight Score of +0.3 over than the baseline (-0.05). These results demonstrate its ability to reliably produce accurate, high-quality visualizations directly from natural language.</p><p><strong>Availability and implementation: </strong>ggplotAgent is freely accessible as a public web application at https://ggplotagent.databio1.com/ and an offline Streamlit app. The source code is available on GitHub at https://github.com/charlin90/ggplotAgent. This software is distributed under the MIT License.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"6 1","pages":"vbaf332"},"PeriodicalIF":2.8,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12802885/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}