Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf101
Tomasz M Kowalski, Szymon Grabowski
Summary: The FASTQ format remains at the heart of high-throughput sequencing. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs. We present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation. The current version, v2.0, preserves the compression ratio of the previous one, reducing the compression (resp. decompression) time by a factor of 8-9 (resp. 2-2.5) on a 14-core/28-thread machine.
Availability and implementation: PgRC 2.0 can be downloaded from https://github.com/kowallus/PgRC and https://zenodo.org/records/14882486 (10.5281/zenodo.14882486).
{"title":"PgRC2: engineering the compression of sequencing reads.","authors":"Tomasz M Kowalski, Szymon Grabowski","doi":"10.1093/bioinformatics/btaf101","DOIUrl":"10.1093/bioinformatics/btaf101","url":null,"abstract":"<p><strong>Summary: </strong>The FASTQ format remains at the heart of high-throughput sequencing. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs. We present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation. The current version, v2.0, preserves the compression ratio of the previous one, reducing the compression (resp. decompression) time by a factor of 8-9 (resp. 2-2.5) on a 14-core/28-thread machine.</p><p><strong>Availability and implementation: </strong>PgRC 2.0 can be downloaded from https://github.com/kowallus/PgRC and https://zenodo.org/records/14882486 (10.5281/zenodo.14882486).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11908645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf086
Stephan Michalik, Elke Hammer, Leif Steil, Manuela Gesell Salazar, Christian Hentschker, Kristin Surmann, Larissa M Busch, Thomas Sura, Uwe Völker
Summary: Proteome studies frequently encounter challenges in down-stream data analysis due to limited bioinformatics resources, rapid data generation, and variations in analytical methods. To address these issues, we developed SpectroPipeR, an R package designed to streamline data analysis tasks and provide a comprehensive, standardized pipeline for Spectronaut® DIA-MS data. This novel package automates various analytical processes, including XIC plots, ID rate summary, normalization, batch and covariate adjustment, relative protein quantification, multivariate analysis, and statistical analysis, while generating interactive HTML reports for e.g. ELN systems.
Availability and implementation: The SpectroPipeR package (manual: https://stemicha.github.io/SpectroPipeR/) was written in R and is freely available on GitHub (https://github.com/stemicha/SpectroPipeR).
{"title":"SpectroPipeR-a streamlining post Spectronaut® DIA-MS data analysis R package.","authors":"Stephan Michalik, Elke Hammer, Leif Steil, Manuela Gesell Salazar, Christian Hentschker, Kristin Surmann, Larissa M Busch, Thomas Sura, Uwe Völker","doi":"10.1093/bioinformatics/btaf086","DOIUrl":"10.1093/bioinformatics/btaf086","url":null,"abstract":"<p><strong>Summary: </strong>Proteome studies frequently encounter challenges in down-stream data analysis due to limited bioinformatics resources, rapid data generation, and variations in analytical methods. To address these issues, we developed SpectroPipeR, an R package designed to streamline data analysis tasks and provide a comprehensive, standardized pipeline for Spectronaut® DIA-MS data. This novel package automates various analytical processes, including XIC plots, ID rate summary, normalization, batch and covariate adjustment, relative protein quantification, multivariate analysis, and statistical analysis, while generating interactive HTML reports for e.g. ELN systems.</p><p><strong>Availability and implementation: </strong>The SpectroPipeR package (manual: https://stemicha.github.io/SpectroPipeR/) was written in R and is freely available on GitHub (https://github.com/stemicha/SpectroPipeR).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893148/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143476944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf010
Jin Sub Lee, Philip M Kim
Motivation: Accurate prediction of protein side-chain conformations is necessary to understand protein folding, protein-protein interactions and facilitate de novo protein design.
Results: Here, we apply torsional flow matching and equivariant graph attention to develop FlowPacker, a fast and performant model to predict protein side-chain conformations conditioned on the protein sequence and backbone. We show that FlowPacker outperforms previous state-of-the-art baselines across most metrics with improved runtime. We further show that FlowPacker can be used to inpaint missing side-chain coordinates and also for multimeric targets, and exhibits strong performance on a test set of antibody-antigen complexes.
Availability and implementation: Code is available at https://gitlab.com/mjslee0921/flowpacker.
{"title":"FlowPacker: protein side-chain packing with torsional flow matching.","authors":"Jin Sub Lee, Philip M Kim","doi":"10.1093/bioinformatics/btaf010","DOIUrl":"10.1093/bioinformatics/btaf010","url":null,"abstract":"<p><strong>Motivation: </strong>Accurate prediction of protein side-chain conformations is necessary to understand protein folding, protein-protein interactions and facilitate de novo protein design.</p><p><strong>Results: </strong>Here, we apply torsional flow matching and equivariant graph attention to develop FlowPacker, a fast and performant model to predict protein side-chain conformations conditioned on the protein sequence and backbone. We show that FlowPacker outperforms previous state-of-the-art baselines across most metrics with improved runtime. We further show that FlowPacker can be used to inpaint missing side-chain coordinates and also for multimeric targets, and exhibits strong performance on a test set of antibody-antigen complexes.</p><p><strong>Availability and implementation: </strong>Code is available at https://gitlab.com/mjslee0921/flowpacker.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886813/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142960264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in <10 h and 20 GB memory.
Availability and implementation: MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.
{"title":"MUSET: set of utilities for constructing abundance unitig matrices from sequencing data.","authors":"Riccardo Vicedomini, Francesco Andreace, Yoann Dufresne, Rayan Chikhi, Camila Duitama González","doi":"10.1093/bioinformatics/btaf054","DOIUrl":"10.1093/bioinformatics/btaf054","url":null,"abstract":"<p><strong>Summary: </strong>MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in <10 h and 20 GB memory.</p><p><strong>Availability and implementation: </strong>MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf104
Jannik Olbrich, Thomas Büchler, Enno Ohlebusch
Motivation: Since novel long read sequencing technologies allow for de novo assembly of many individuals of a species, high-quality assemblies are becoming widely available. For example, the recently published draft human pangenome reference was based on assemblies composed of contigs. There is an urgent need for a software-tool that is able to generate a multiple alignment of genomes of the same species because current multiple sequence alignment programs cannot deal with such a volume of data.
Results: We show that the combination of a well-known anchor-based method with the technique of prefix-free parsing yields an approach that is able to generate multiple alignments on a pangenomic scale, provided that large-scale structural variants are rare. Furthermore, experiments with real world data show that our software tool PANgenomic Anchor-based Multiple Alignment significantly outperforms current state-of-the art programs.
Availability and implementation: Source code is available at: https://gitlab.com/qwerzuiop/panama, archived at swh:1:dir:e90c9f664995acca9063245cabdd97549cf39694.
{"title":"Generating multiple alignments on a pangenomic scale.","authors":"Jannik Olbrich, Thomas Büchler, Enno Ohlebusch","doi":"10.1093/bioinformatics/btaf104","DOIUrl":"10.1093/bioinformatics/btaf104","url":null,"abstract":"<p><strong>Motivation: </strong>Since novel long read sequencing technologies allow for de novo assembly of many individuals of a species, high-quality assemblies are becoming widely available. For example, the recently published draft human pangenome reference was based on assemblies composed of contigs. There is an urgent need for a software-tool that is able to generate a multiple alignment of genomes of the same species because current multiple sequence alignment programs cannot deal with such a volume of data.</p><p><strong>Results: </strong>We show that the combination of a well-known anchor-based method with the technique of prefix-free parsing yields an approach that is able to generate multiple alignments on a pangenomic scale, provided that large-scale structural variants are rare. Furthermore, experiments with real world data show that our software tool PANgenomic Anchor-based Multiple Alignment significantly outperforms current state-of-the art programs.</p><p><strong>Availability and implementation: </strong>Source code is available at: https://gitlab.com/qwerzuiop/panama, archived at swh:1:dir:e90c9f664995acca9063245cabdd97549cf39694.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143652559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf041
Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Pellegrini, Vwani Roychowdhury
Motivation: DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison.
Results: We bridge this gap by developing a "Embed-Search-Align" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.
Availability and implementation: Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.
{"title":"Embed-Search-Align: DNA sequence alignment using Transformer models.","authors":"Pavan Holur, K C Enevoldsen, Shreyas Rajesh, Lajoyce Mboning, Thalia Georgiou, Louis-S Bouchard, Matteo Pellegrini, Vwani Roychowdhury","doi":"10.1093/bioinformatics/btaf041","DOIUrl":"10.1093/bioinformatics/btaf041","url":null,"abstract":"<p><strong>Motivation: </strong>DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison.</p><p><strong>Results: </strong>We bridge this gap by developing a \"Embed-Search-Align\" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species.</p><p><strong>Availability and implementation: </strong>Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143367052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf094
Mena Kamel, Yiwen Song, Ana Solbas, Sergio Villordo, Amrut Sarangi, Pavel Senin, Mathew Sunaal, Luis Cano Ayestas, Clement Levin, Seqian Wang, Marion Classe, Ziv Bar-Joseph, Albert Pla Planas
Motivation: Spatial transcriptomics (ST) enables the study of gene expression within its spatial context in histopathology samples. To date, a limiting factor has been the resolution of sequencing based ST products. The introduction of the Visium High Definition (HD) technology opens the door to cell resolution ST studies. However, challenges remain in the ability to accurately map transcripts to cells and in assigning cell types based on the transcript data.
Results: We developed ENACT, a self-contained pipeline that integrates advanced cell segmentation with Visium HD transcriptomics data to infer cell types across whole tissue sections. Our pipeline incorporates novel bin-to-cell assignment methods, enhancing the accuracy of single-cell transcript estimates. Validated on diverse synthetic and real datasets, our approach is both scalable to samples with hundreds of thousands of cells and effective, offering a robust solution for spatially resolved transcriptomics analysis.
Availability and implementation: ENACT source code is available at https://github.com/Sanofi-Public/enact-pipeline. Experimental data are available at https://zenodo.org/records/14748859.
{"title":"ENACT: End-to-End Analysis of Visium High Definition (HD) Data.","authors":"Mena Kamel, Yiwen Song, Ana Solbas, Sergio Villordo, Amrut Sarangi, Pavel Senin, Mathew Sunaal, Luis Cano Ayestas, Clement Levin, Seqian Wang, Marion Classe, Ziv Bar-Joseph, Albert Pla Planas","doi":"10.1093/bioinformatics/btaf094","DOIUrl":"10.1093/bioinformatics/btaf094","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) enables the study of gene expression within its spatial context in histopathology samples. To date, a limiting factor has been the resolution of sequencing based ST products. The introduction of the Visium High Definition (HD) technology opens the door to cell resolution ST studies. However, challenges remain in the ability to accurately map transcripts to cells and in assigning cell types based on the transcript data.</p><p><strong>Results: </strong>We developed ENACT, a self-contained pipeline that integrates advanced cell segmentation with Visium HD transcriptomics data to infer cell types across whole tissue sections. Our pipeline incorporates novel bin-to-cell assignment methods, enhancing the accuracy of single-cell transcript estimates. Validated on diverse synthetic and real datasets, our approach is both scalable to samples with hundreds of thousands of cells and effective, offering a robust solution for spatially resolved transcriptomics analysis.</p><p><strong>Availability and implementation: </strong>ENACT source code is available at https://github.com/Sanofi-Public/enact-pipeline. Experimental data are available at https://zenodo.org/records/14748859.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11925495/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143575772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf045
Yan Yan, Beatriz Jiménez, Michael T Judge, Toby Athersuch, Maria De Iorio, Timothy M D Ebbels
Motivation: Metabolomics extensively utilizes nuclear magnetic resonance (NMR) spectroscopy due to its excellent reproducibility and high throughput. Both 1D and 2D NMR spectra provide crucial information for metabolite annotation and quantification, yet present complex overlapping patterns which may require sophisticated machine learning algorithms to decipher. Unfortunately, the limited availability of labeled spectra can hamper application of machine learning, especially deep learning algorithms which require large amounts of labeled data. In this context, simulation of spectral data becomes a tractable solution for algorithm development.
Results: Here, we introduce MetAssimulo 2.0, a comprehensive upgrade of the MetAssimulo 1.b metabolomic 1H NMR simulation tool, reimplemented as a Python-based web application. Where MetAssimulo 1.0 only simulated 1D 1H spectra of human urine, MetAssimulo 2.0 expands functionality to urine, blood, and cerebral spinal fluid, enhancing the realism of blood spectra by incorporating a broad protein background. This enhancement enables a closer approximation to real blood spectra, achieving a Pearson correlation of approximately 0.82. Moreover, this tool now includes simulation capabilities for 2D J-resolved (J-Res) and Correlation Spectroscopy spectra, significantly broadening its utility in complex mixture analysis. MetAssimulo 2.0 simulates both single, and groups, of spectra with both discrete (case-control, e.g. heart transplant versus healthy) and continuous (e.g. body mass index) outcomes and includes inter-metabolite correlations. It thus supports a range of experimental designs and demonstrating associations between metabolite profiles and biomedical responses.By enhancing NMR spectral simulations, MetAssimulo 2.0 is well positioned to support and enhance research at the intersection of deep learning and metabolomics.
Availability and implementation: The code and the detailed instruction/tutorial for MetAssimulo 2.0 is available at https://github.com/yanyan5420/MetAssimulo_2.git. The relevant NMR spectra for metabolites are deposited in MetaboLights with accession number MTBLS12081.
{"title":"MetAssimulo 2.0: a web app for simulating realistic 1D and 2D metabolomic 1H NMR spectra.","authors":"Yan Yan, Beatriz Jiménez, Michael T Judge, Toby Athersuch, Maria De Iorio, Timothy M D Ebbels","doi":"10.1093/bioinformatics/btaf045","DOIUrl":"10.1093/bioinformatics/btaf045","url":null,"abstract":"<p><strong>Motivation: </strong>Metabolomics extensively utilizes nuclear magnetic resonance (NMR) spectroscopy due to its excellent reproducibility and high throughput. Both 1D and 2D NMR spectra provide crucial information for metabolite annotation and quantification, yet present complex overlapping patterns which may require sophisticated machine learning algorithms to decipher. Unfortunately, the limited availability of labeled spectra can hamper application of machine learning, especially deep learning algorithms which require large amounts of labeled data. In this context, simulation of spectral data becomes a tractable solution for algorithm development.</p><p><strong>Results: </strong>Here, we introduce MetAssimulo 2.0, a comprehensive upgrade of the MetAssimulo 1.b metabolomic 1H NMR simulation tool, reimplemented as a Python-based web application. Where MetAssimulo 1.0 only simulated 1D 1H spectra of human urine, MetAssimulo 2.0 expands functionality to urine, blood, and cerebral spinal fluid, enhancing the realism of blood spectra by incorporating a broad protein background. This enhancement enables a closer approximation to real blood spectra, achieving a Pearson correlation of approximately 0.82. Moreover, this tool now includes simulation capabilities for 2D J-resolved (J-Res) and Correlation Spectroscopy spectra, significantly broadening its utility in complex mixture analysis. MetAssimulo 2.0 simulates both single, and groups, of spectra with both discrete (case-control, e.g. heart transplant versus healthy) and continuous (e.g. body mass index) outcomes and includes inter-metabolite correlations. It thus supports a range of experimental designs and demonstrating associations between metabolite profiles and biomedical responses.By enhancing NMR spectral simulations, MetAssimulo 2.0 is well positioned to support and enhance research at the intersection of deep learning and metabolomics.</p><p><strong>Availability and implementation: </strong>The code and the detailed instruction/tutorial for MetAssimulo 2.0 is available at https://github.com/yanyan5420/MetAssimulo_2.git. The relevant NMR spectra for metabolites are deposited in MetaboLights with accession number MTBLS12081.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143043810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf079
Anuradha Wickramarachchi, Shakila Tonni, Sonali Majumdar, Sarvnaz Karimi, Sulev Kõks, Brendan Hosking, Jordi Rambla, Natalie A Twine, Yatish Jain, Denis C Bauer
Motivation: Enabling clinicians and researchers to directly interact with global genomic data resources by removing technological barriers is vital for medical genomics. AskBeacon enables large language models (LLMs) to be applied to securely shared cohorts via the Global Alliance for Genomics and Health Beacon protocol. By simply "asking" Beacon, actionable insights can be gained, analyzed, and made publication-ready.
Results: In the Parkinson's Progression Markers Initiative (PPMI), we use natural language to ask whether the sex-differences observed in Parkinson's disease are due to X-linked or autosomal markers. AskBeacon returns a publication-ready visualization showing that for PPMI the autosomal marker occurred 1.4 times more often in males with Parkinson's disease than females, compared to no differences for the X-linked marker. We evaluate commercial and open-weight LLM models, as well as different architectures to identify the best strategy for translating research questions to Beacon queries. AskBeacon implements extensive safety guardrails to ensure that genomic data is not exposed to the LLM directly, and that generated code for data extraction, analysis and visualization process is sanitized and hallucination resistant, so data cannot be leaked or falsified.
Availability and implementation: AskBeacon is available at https://github.com/aehrc/AskBeacon.
{"title":"AskBeacon-performing genomic data exchange and analytics with natural language.","authors":"Anuradha Wickramarachchi, Shakila Tonni, Sonali Majumdar, Sarvnaz Karimi, Sulev Kõks, Brendan Hosking, Jordi Rambla, Natalie A Twine, Yatish Jain, Denis C Bauer","doi":"10.1093/bioinformatics/btaf079","DOIUrl":"10.1093/bioinformatics/btaf079","url":null,"abstract":"<p><strong>Motivation: </strong>Enabling clinicians and researchers to directly interact with global genomic data resources by removing technological barriers is vital for medical genomics. AskBeacon enables large language models (LLMs) to be applied to securely shared cohorts via the Global Alliance for Genomics and Health Beacon protocol. By simply \"asking\" Beacon, actionable insights can be gained, analyzed, and made publication-ready.</p><p><strong>Results: </strong>In the Parkinson's Progression Markers Initiative (PPMI), we use natural language to ask whether the sex-differences observed in Parkinson's disease are due to X-linked or autosomal markers. AskBeacon returns a publication-ready visualization showing that for PPMI the autosomal marker occurred 1.4 times more often in males with Parkinson's disease than females, compared to no differences for the X-linked marker. We evaluate commercial and open-weight LLM models, as well as different architectures to identify the best strategy for translating research questions to Beacon queries. AskBeacon implements extensive safety guardrails to ensure that genomic data is not exposed to the LLM directly, and that generated code for data extraction, analysis and visualization process is sanitized and hallucination resistant, so data cannot be leaked or falsified.</p><p><strong>Availability and implementation: </strong>AskBeacon is available at https://github.com/aehrc/AskBeacon.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889448/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143476922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf060
Rawan Shraim, Brian Mooney, Karina L Conkrite, Amber K Hamilton, Gregg B Morin, Poul H Sorensen, John M Maris, Sharon J Diskin, Ahmet Sacan
Motivation: Cancer remains a leading cause of mortality globally. Recent improvements in survival have been facilitated by the development of targeted and less toxic immunotherapies, such as chimeric antigen receptor (CAR)-T cells and antibody-drug conjugates (ADCs). These therapies, effective in treating both pediatric and adult patients with solid and hematological malignancies, rely on the identification of cancer-specific surface protein targets. While technologies like RNA sequencing and proteomics exist to survey these targets, identifying optimal targets for immunotherapies remains a challenge in the field.
Results: To address this challenge, we developed ImmunoTar, a novel computational tool designed to systematically prioritize candidate immunotherapeutic targets. ImmunoTar integrates user-provided RNA-sequencing or proteomics data with quantitative features from multiple public databases, selected based on predefined criteria, to generate a score representing the gene's suitability as an immunotherapeutic target. We validated ImmunoTar using three distinct cancer datasets, demonstrating its effectiveness in identifying both known and novel targets across various cancer phenotypes. By compiling diverse data into a unified platform, ImmunoTar enables comprehensive evaluation of surface proteins, streamlining target identification and empowering researchers to efficiently allocate resources, thereby accelerating the development of effective cancer immunotherapies.
Availability and implementation: Code and data to run and test ImmunoTar are available at https://github.com/sacanlab/immunotar.
{"title":"ImmunoTar-integrative prioritization of cell surface targets for cancer immunotherapy.","authors":"Rawan Shraim, Brian Mooney, Karina L Conkrite, Amber K Hamilton, Gregg B Morin, Poul H Sorensen, John M Maris, Sharon J Diskin, Ahmet Sacan","doi":"10.1093/bioinformatics/btaf060","DOIUrl":"10.1093/bioinformatics/btaf060","url":null,"abstract":"<p><strong>Motivation: </strong>Cancer remains a leading cause of mortality globally. Recent improvements in survival have been facilitated by the development of targeted and less toxic immunotherapies, such as chimeric antigen receptor (CAR)-T cells and antibody-drug conjugates (ADCs). These therapies, effective in treating both pediatric and adult patients with solid and hematological malignancies, rely on the identification of cancer-specific surface protein targets. While technologies like RNA sequencing and proteomics exist to survey these targets, identifying optimal targets for immunotherapies remains a challenge in the field.</p><p><strong>Results: </strong>To address this challenge, we developed ImmunoTar, a novel computational tool designed to systematically prioritize candidate immunotherapeutic targets. ImmunoTar integrates user-provided RNA-sequencing or proteomics data with quantitative features from multiple public databases, selected based on predefined criteria, to generate a score representing the gene's suitability as an immunotherapeutic target. We validated ImmunoTar using three distinct cancer datasets, demonstrating its effectiveness in identifying both known and novel targets across various cancer phenotypes. By compiling diverse data into a unified platform, ImmunoTar enables comprehensive evaluation of surface proteins, streamlining target identification and empowering researchers to efficiently allocate resources, thereby accelerating the development of effective cancer immunotherapies.</p><p><strong>Availability and implementation: </strong>Code and data to run and test ImmunoTar are available at https://github.com/sacanlab/immunotar.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143392728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}