Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf051
Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi
Motivation: As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples.
Results: Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.
Availability and implementation: Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.
{"title":"MMnc: multi-modal interpretable representation for non-coding RNA classification and class annotation.","authors":"Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi","doi":"10.1093/bioinformatics/btaf051","DOIUrl":"10.1093/bioinformatics/btaf051","url":null,"abstract":"<p><strong>Motivation: </strong>As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples.</p><p><strong>Results: </strong>Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.</p><p><strong>Availability and implementation: </strong>Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11890286/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf076
Máté Balajti, Rohan Kandhari, Boris Jurič, Mihaela Zavolan, Alexander Kanitz
Summary: The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.
Availability and implementation: HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.
{"title":"HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries.","authors":"Máté Balajti, Rohan Kandhari, Boris Jurič, Mihaela Zavolan, Alexander Kanitz","doi":"10.1093/bioinformatics/btaf076","DOIUrl":"10.1093/bioinformatics/btaf076","url":null,"abstract":"<p><strong>Summary: </strong>The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.</p><p><strong>Availability and implementation: </strong>HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf026
Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron
Motivation: Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data have been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell-type-specific loops.
Results: We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiC to neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell-type-specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiC can effectively capture loops that contain more gene regulatory information.
Availability and implementation: The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the doi: 10.24433/CO.1912913.v1.
{"title":"Significance in scale space for Hi-C data.","authors":"Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron","doi":"10.1093/bioinformatics/btaf026","DOIUrl":"10.1093/bioinformatics/btaf026","url":null,"abstract":"<p><strong>Motivation: </strong>Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data have been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell-type-specific loops.</p><p><strong>Results: </strong>We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiC to neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell-type-specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiC can effectively capture loops that contain more gene regulatory information.</p><p><strong>Availability and implementation: </strong>The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the doi: 10.24433/CO.1912913.v1.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf011
Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li
Motivation: Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; and (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.
Results: We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving an area under the receiver operating characteristic curve of 0.873 in unseen protein settings, an 11% improvement over the best baseline.
Availability and implementation: The model scripts are available at https://github.com/Steven51516/SP-DTI.
{"title":"SP-DTI: subpocket-informed transformer for drug-target interaction prediction.","authors":"Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li","doi":"10.1093/bioinformatics/btaf011","DOIUrl":"10.1093/bioinformatics/btaf011","url":null,"abstract":"<p><strong>Motivation: </strong>Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; and (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.</p><p><strong>Results: </strong>We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving an area under the receiver operating characteristic curve of 0.873 in unseen protein settings, an 11% improvement over the best baseline.</p><p><strong>Availability and implementation: </strong>The model scripts are available at https://github.com/Steven51516/SP-DTI.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.
Results: To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.
Availability and implementation: The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.
{"title":"PharaCon: a new framework for identifying bacteriophages via conditional representation learning.","authors":"Zeheng Bai, Yao-Zhong Zhang, Yuxuan Pang, Seiya Imoto","doi":"10.1093/bioinformatics/btaf085","DOIUrl":"10.1093/bioinformatics/btaf085","url":null,"abstract":"<p><strong>Motivation: </strong>Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.</p><p><strong>Results: </strong>To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.</p><p><strong>Availability and implementation: </strong>The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf046
Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek
Motivation: Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.
Results: We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.
Availability and implementation: The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi: 10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi: 10.5281/zenodo.14656053).
{"title":"Relative quantification of proteins and post-translational modifications in proteomic experiments with shared peptides: a weight-based approach.","authors":"Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek","doi":"10.1093/bioinformatics/btaf046","DOIUrl":"10.1093/bioinformatics/btaf046","url":null,"abstract":"<p><strong>Motivation: </strong>Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.</p><p><strong>Results: </strong>We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.</p><p><strong>Availability and implementation: </strong>The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi: 10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi: 10.5281/zenodo.14656053).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf081
Tornike Onoprishvili, Jui-Hung Yuan, Kamen Petrov, Vijay Ingalalli, Lila Khederlarian, Niklas Leuchtenmuller, Sona Chandra, Aurelien Duarte, Andreas Bender, Yoann Gloaguen
Motivation: Untargeted metabolomics involves a large-scale comparison of the fragmentation pattern of a mass spectrum against a database containing known spectra. Given the number of comparisons involved, this step can be time-consuming.
Results: In this work, we present a GPU-accelerated cosine similarity implementation for Tandem Mass Spectrometry (MS), with an approximately 1000-fold speedup compared to the MatchMS reference implementation, without any loss of accuracy. This improvement enables repository-scale spectral library matching for compound identification without the need for large compute clusters. This impact extends to any spectral comparison-based methods such as molecular networking approaches and analogue search.
Availability and implementation: All code, results, and notebooks supporting are freely available under the MIT license at https://github.com/pangeAI/simms/.
{"title":"SimMS: a GPU-accelerated cosine similarity implementation for tandem mass spectrometry.","authors":"Tornike Onoprishvili, Jui-Hung Yuan, Kamen Petrov, Vijay Ingalalli, Lila Khederlarian, Niklas Leuchtenmuller, Sona Chandra, Aurelien Duarte, Andreas Bender, Yoann Gloaguen","doi":"10.1093/bioinformatics/btaf081","DOIUrl":"10.1093/bioinformatics/btaf081","url":null,"abstract":"<p><strong>Motivation: </strong>Untargeted metabolomics involves a large-scale comparison of the fragmentation pattern of a mass spectrum against a database containing known spectra. Given the number of comparisons involved, this step can be time-consuming.</p><p><strong>Results: </strong>In this work, we present a GPU-accelerated cosine similarity implementation for Tandem Mass Spectrometry (MS), with an approximately 1000-fold speedup compared to the MatchMS reference implementation, without any loss of accuracy. This improvement enables repository-scale spectral library matching for compound identification without the need for large compute clusters. This impact extends to any spectral comparison-based methods such as molecular networking approaches and analogue search.</p><p><strong>Availability and implementation: </strong>All code, results, and notebooks supporting are freely available under the MIT license at https://github.com/pangeAI/simms/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143470193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf072
Jack Kuipers, Mustafa Anıl Tuncel, Pedro F Ferreira, Katharina Jahn, Niko Beerenwinkel
Motivation: Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.
Results: We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.
Availability and implementation: SCICoNE is available at https://github.com/cbg-ethz/SCICoNE.
{"title":"Single-cell copy number calling and event history reconstruction.","authors":"Jack Kuipers, Mustafa Anıl Tuncel, Pedro F Ferreira, Katharina Jahn, Niko Beerenwinkel","doi":"10.1093/bioinformatics/btaf072","DOIUrl":"10.1093/bioinformatics/btaf072","url":null,"abstract":"<p><strong>Motivation: </strong>Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.</p><p><strong>Results: </strong>We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.</p><p><strong>Availability and implementation: </strong>SCICoNE is available at https://github.com/cbg-ethz/SCICoNE.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897432/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143411868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf093
Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy
Motivation: Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.
Results: To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.
Availability and implementation: ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.
{"title":"ORCO: Ollivier-Ricci Curvature-Omics-an unsupervised method for analyzing robustness in biological systems.","authors":"Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy","doi":"10.1093/bioinformatics/btaf093","DOIUrl":"10.1093/bioinformatics/btaf093","url":null,"abstract":"<p><strong>Motivation: </strong>Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.</p><p><strong>Results: </strong>To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.</p><p><strong>Availability and implementation: </strong>ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893153/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-04DOI: 10.1093/bioinformatics/btaf088
Xue Zhang, Quan Zou, Mengting Niu, Chunyu Wang
Motivation: Circular RNAs (circRNAs) have been identified as key players in the progression of several diseases; however, their roles have not yet been determined because of the high financial burden of biological studies. This highlights the urgent need to develop efficient computational models that can predict circRNA-disease associations, offering an alternative approach to overcome the limitations of expensive experimental studies. Although multi-view learning methods have been widely adopted, most approaches fail to fully exploit the latent information across views, while simultaneously overlooking the fact that different views contribute to varying degrees of significance.
Results: This study presents a method that combines multi-view shared units and multichannel attention mechanisms to predict circRNA-disease associations (MSMCDA). MSMCDA first constructs similarity and meta-path networks for circRNAs and diseases by introducing shared units to facilitate interactive learning across distinct network features. Subsequently, multichannel attention mechanisms were used to optimize the weights within similarity networks. Finally, contrastive learning strengthened the similarity features. Experiments on five public datasets demonstrated that MSMCDA significantly outperformed other baseline methods. Additionally, case studies on colorectal cancer, gastric cancer, and nonsmall cell lung cancer confirmed the effectiveness of MSMCDA in uncovering new associations.
Availability and implementation: The source code and data are available at https://github.com/zhangxue2115/MSMCDA.git.
{"title":"Predicting circRNA-disease associations with shared units and multi-channel attention mechanisms.","authors":"Xue Zhang, Quan Zou, Mengting Niu, Chunyu Wang","doi":"10.1093/bioinformatics/btaf088","DOIUrl":"10.1093/bioinformatics/btaf088","url":null,"abstract":"<p><strong>Motivation: </strong>Circular RNAs (circRNAs) have been identified as key players in the progression of several diseases; however, their roles have not yet been determined because of the high financial burden of biological studies. This highlights the urgent need to develop efficient computational models that can predict circRNA-disease associations, offering an alternative approach to overcome the limitations of expensive experimental studies. Although multi-view learning methods have been widely adopted, most approaches fail to fully exploit the latent information across views, while simultaneously overlooking the fact that different views contribute to varying degrees of significance.</p><p><strong>Results: </strong>This study presents a method that combines multi-view shared units and multichannel attention mechanisms to predict circRNA-disease associations (MSMCDA). MSMCDA first constructs similarity and meta-path networks for circRNAs and diseases by introducing shared units to facilitate interactive learning across distinct network features. Subsequently, multichannel attention mechanisms were used to optimize the weights within similarity networks. Finally, contrastive learning strengthened the similarity features. Experiments on five public datasets demonstrated that MSMCDA significantly outperformed other baseline methods. Additionally, case studies on colorectal cancer, gastric cancer, and nonsmall cell lung cancer confirmed the effectiveness of MSMCDA in uncovering new associations.</p><p><strong>Availability and implementation: </strong>The source code and data are available at https://github.com/zhangxue2115/MSMCDA.git.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919450/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}