Bioinformatics (Oxford, England)最新文献_第4页

MMnc: multi-modal interpretable representation for non-coding RNA classification and class annotation.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf051

Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi

Motivation: As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples.

Results: Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.

Availability and implementation: Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.

{"title":"MMnc: multi-modal interpretable representation for non-coding RNA classification and class annotation.","authors":"Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi","doi":"10.1093/bioinformatics/btaf051","DOIUrl":"10.1093/bioinformatics/btaf051","url":null,"abstract":"Motivation: As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples.Results: Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.Availability and implementation: Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11890286/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf076

Máté Balajti, Rohan Kandhari, Boris Jurič, Mihaela Zavolan, Alexander Kanitz

Summary: The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.

Availability and implementation: HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.

{"title":"HTSinfer: inferring metadata from bulk Illumina RNA-Seq libraries.","authors":"Máté Balajti, Rohan Kandhari, Boris Jurič, Mihaela Zavolan, Alexander Kanitz","doi":"10.1093/bioinformatics/btaf076","DOIUrl":"10.1093/bioinformatics/btaf076","url":null,"abstract":"Summary: The Sequencing Read Archive is one of the largest and fastest-growing repositories of sequencing data, containing tens of petabytes of sequenced reads. Its data is used by a wide scientific community, often beyond the primary study that generated them. Such analyses rely on accurate metadata concerning the type of experiment and library, as well as the organism from which the sequenced reads were derived. These metadata are typically entered manually by contributors in an error-prone process, and are frequently incomplete. In addition, easy-to-use computational tools that verify the consistency and completeness of metadata describing the libraries to facilitate data reuse, are largely unavailable. Here, we introduce HTSinfer, a Python-based tool to infer metadata directly and solely from bulk RNA-sequencing data generated on Illumina platforms. HTSinfer leverages genome sequence information and diagnostic genes to rapidly and accurately infer the library source and library type, as well as the relative read orientation, 3' adapter sequence and read length statistics. HTSinfer is written in a modular manner, published under a permissible free and open-source license and encourages contributions by the community, enabling easy addition of new functionalities, e.g. for the inference of additional metrics, or the support of different experiment types or sequencing platforms.Availability and implementation: HTSinfer is released under the Apache License 2.0. Latest code is available via GitHub at https://github.com/zavolanlab/htsinfer, while releases are published on Bioconda. A snapshot of the HTSinfer version described in this article was deposited at Zenodo at 10.5281/zenodo.13985958.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11889452/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143451331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Significance in scale space for Hi-C data.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf026

Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron

Motivation: Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data have been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell-type-specific loops.

Results: We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiC to neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell-type-specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiC can effectively capture loops that contain more gene regulatory information.

Availability and implementation: The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the doi: 10.24433/CO.1912913.v1.

{"title":"Significance in scale space for Hi-C data.","authors":"Rui Liu, Zhengwu Zhang, Hyejung Won, J S Marron","doi":"10.1093/bioinformatics/btaf026","DOIUrl":"10.1093/bioinformatics/btaf026","url":null,"abstract":"Motivation: Hi-C technology has been developed to profile genome-wide chromosome conformation. So far Hi-C data have been generated from a large compendium of different cell types and different tissue types. Among different chromatin conformation units, chromatin loops were found to play a key role in gene regulation across different cell types. While many different loop calling algorithms have been developed, most loop callers identified shared loops as opposed to cell-type-specific loops.Results: We propose SSSHiC, a new loop calling algorithm based on significance in scale space, which can be used to understand data at different levels of resolution. By applying SSSHiC to neuronal and glial Hi-C data, we detected more loops that are potentially engaged in cell-type-specific gene regulation. Compared with other loop callers, such as Mustache, these loops were more frequently anchored to gene promoters of cellular marker genes and had better APA scores. Therefore, our results suggest that SSSHiC can effectively capture loops that contain more gene regulatory information.Availability and implementation: The Hi-C data used in this study can be accessed through the PsychENCODE Knowledge Portal at https://www.synapse.org/#! Synapse: syn21760712. The code utilized for Curvature SSS cited in this study is available at https://github.com/jsmarron/MarronMatlabSoftware/blob/master/Matlab9/Matlab9Combined.zip. All custom code used in this research can be found in the GitHub repository: https://github.com/jerryliu01998/HiC. The code has also been submitted to Code Ocean with the doi: 10.24433/CO.1912913.v1.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879645/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SP-DTI: subpocket-informed transformer for drug-target interaction prediction. SP-DTI：用于药物-靶标相互作用预测的子口袋知情转换器。

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf011

Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li

Motivation: Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; and (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.

Results: We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving an area under the receiver operating characteristic curve of 0.873 in unseen protein settings, an 11% improvement over the best baseline.

Availability and implementation: The model scripts are available at https://github.com/Steven51516/SP-DTI.

动机药物-靶点相互作用（DTI）预测对于药物发现至关重要，它能显著降低在庞大的药物化合物空间中进行实验搜索的成本和时间。虽然深度学习提高了 DTI 预测的准确性，但挑战依然存在：(i) 现有方法往往缺乏通用性，在未见过的蛋白质和跨领域设置上性能大幅下降；(ii) 当前的分子关系学习往往忽略了亚口袋级的相互作用，而这对于详细了解结合位点至关重要：我们介绍了 SP-DTI，这是一种子口袋信息转换器模型，旨在通过以下方法应对这些挑战：(i) 利用空腔识别和分析例程（CAVIAR）进行详细的子口袋分析，以建立全局和局部水平的相互作用模型；(ii) 将预先训练好的语言模型整合到图神经网络中，以编码药物和蛋白质，从而增强对无标记数据的通用性。基准评估表明，SP-DTI 的表现始终优于最先进的模型，在未见蛋白质的情况下，ROC-AUC 达到 0.873，比最佳基准提高了 11%：模型脚本可在 https://github.com/Steven51516/SP-DTI.Contact 和补充信息中获取：如需通信，请联系 xiajun@westlake.edu.cn。补充数据可从 Bioinformatics 在线获取。

{"title":"SP-DTI: subpocket-informed transformer for drug-target interaction prediction.","authors":"Sizhe Liu, Yuchen Liu, Haofeng Xu, Jun Xia, Stan Z Li","doi":"10.1093/bioinformatics/btaf011","DOIUrl":"10.1093/bioinformatics/btaf011","url":null,"abstract":"Motivation: Drug-target interaction (DTI) prediction is crucial for drug discovery, significantly reducing costs and time in experimental searches across vast drug compound spaces. While deep learning has advanced DTI prediction accuracy, challenges remain: (i) existing methods often lack generalizability, with performance dropping significantly on unseen proteins and cross-domain settings; and (ii) current molecular relational learning often overlooks subpocket-level interactions, which are vital for a detailed understanding of binding sites.Results: We introduce SP-DTI, a subpocket-informed transformer model designed to address these challenges through: (i) detailed subpocket analysis using the Cavity Identification and Analysis Routine for interaction modeling at both global and local levels, and (ii) integration of pre-trained language models into graph neural networks to encode drugs and proteins, enhancing generalizability to unlabeled data. Benchmark evaluations show that SP-DTI consistently outperforms state-of-the-art models, achieving an area under the receiver operating characteristic curve of 0.873 in unseen protein settings, an 11% improvement over the best baseline.Availability and implementation: The model scripts are available at https://github.com/Steven51516/SP-DTI.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142973832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PharaCon: a new framework for identifying bacteriophages via conditional representation learning.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf085

Zeheng Bai, Yao-Zhong Zhang, Yuxuan Pang, Seiya Imoto

Motivation: Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.

Results: To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.

Availability and implementation: The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.

{"title":"PharaCon: a new framework for identifying bacteriophages via conditional representation learning.","authors":"Zeheng Bai, Yao-Zhong Zhang, Yuxuan Pang, Seiya Imoto","doi":"10.1093/bioinformatics/btaf085","DOIUrl":"10.1093/bioinformatics/btaf085","url":null,"abstract":"Motivation: Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples.Results: To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning.Availability and implementation: The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143485018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Relative quantification of proteins and post-translational modifications in proteomic experiments with shared peptides: a weight-based approach.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf046

Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek

Motivation: Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.

Results: We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.

Availability and implementation: The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi: 10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi: 10.5281/zenodo.14656053).

{"title":"Relative quantification of proteins and post-translational modifications in proteomic experiments with shared peptides: a weight-based approach.","authors":"Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek","doi":"10.1093/bioinformatics/btaf046","DOIUrl":"10.1093/bioinformatics/btaf046","url":null,"abstract":"Motivation: Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.Results: We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.Availability and implementation: The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi: 10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi: 10.5281/zenodo.14656053).","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879648/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SimMS: a GPU-accelerated cosine similarity implementation for tandem mass spectrometry.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf081

Tornike Onoprishvili, Jui-Hung Yuan, Kamen Petrov, Vijay Ingalalli, Lila Khederlarian, Niklas Leuchtenmuller, Sona Chandra, Aurelien Duarte, Andreas Bender, Yoann Gloaguen

Motivation: Untargeted metabolomics involves a large-scale comparison of the fragmentation pattern of a mass spectrum against a database containing known spectra. Given the number of comparisons involved, this step can be time-consuming.

Results: In this work, we present a GPU-accelerated cosine similarity implementation for Tandem Mass Spectrometry (MS), with an approximately 1000-fold speedup compared to the MatchMS reference implementation, without any loss of accuracy. This improvement enables repository-scale spectral library matching for compound identification without the need for large compute clusters. This impact extends to any spectral comparison-based methods such as molecular networking approaches and analogue search.

Availability and implementation: All code, results, and notebooks supporting are freely available under the MIT license at https://github.com/pangeAI/simms/.

{"title":"SimMS: a GPU-accelerated cosine similarity implementation for tandem mass spectrometry.","authors":"Tornike Onoprishvili, Jui-Hung Yuan, Kamen Petrov, Vijay Ingalalli, Lila Khederlarian, Niklas Leuchtenmuller, Sona Chandra, Aurelien Duarte, Andreas Bender, Yoann Gloaguen","doi":"10.1093/bioinformatics/btaf081","DOIUrl":"10.1093/bioinformatics/btaf081","url":null,"abstract":"Motivation: Untargeted metabolomics involves a large-scale comparison of the fragmentation pattern of a mass spectrum against a database containing known spectra. Given the number of comparisons involved, this step can be time-consuming.Results: In this work, we present a GPU-accelerated cosine similarity implementation for Tandem Mass Spectrometry (MS), with an approximately 1000-fold speedup compared to the MatchMS reference implementation, without any loss of accuracy. This improvement enables repository-scale spectral library matching for compound identification without the need for large compute clusters. This impact extends to any spectral comparison-based methods such as molecular networking approaches and analogue search.Availability and implementation: All code, results, and notebooks supporting are freely available under the MIT license at https://github.com/pangeAI/simms/.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886821/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143470193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Single-cell copy number calling and event history reconstruction. 单细胞拷贝数调用和事件历史重建

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf072

Jack Kuipers, Mustafa Anıl Tuncel, Pedro F Ferreira, Katharina Jahn, Niko Beerenwinkel

Motivation: Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.

Results: We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.

Availability and implementation: SCICoNE is available at https://github.com/cbg-ethz/SCICoNE.

{"title":"Single-cell copy number calling and event history reconstruction.","authors":"Jack Kuipers, Mustafa Anıl Tuncel, Pedro F Ferreira, Katharina Jahn, Niko Beerenwinkel","doi":"10.1093/bioinformatics/btaf072","DOIUrl":"10.1093/bioinformatics/btaf072","url":null,"abstract":"Motivation: Copy number alterations are driving forces of tumour development and the emergence of intra-tumour heterogeneity. A comprehensive picture of these genomic aberrations is therefore essential for the development of personalised and precise cancer diagnostics and therapies. Single-cell sequencing offers the highest resolution for copy number profiling down to the level of individual cells. Recent high-throughput protocols allow for the processing of hundreds of cells through shallow whole-genome DNA sequencing. The resulting low read-depth data poses substantial statistical and computational challenges to the identification of copy number alterations.Results: We developed SCICoNE, a statistical model and MCMC algorithm tailored to single-cell copy number profiling from shallow whole-genome DNA sequencing data. SCICoNE reconstructs the history of copy number events in the tumour and uses these evolutionary relationships to identify the copy number profiles of the individual cells. We show the accuracy of this approach in evaluations on simulated data and demonstrate its practicability in applications to two breast cancer samples from different sequencing protocols.Availability and implementation: SCICoNE is available at https://github.com/cbg-ethz/SCICoNE.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897432/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143411868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ORCO: Ollivier-Ricci Curvature-Omics-an unsupervised method for analyzing robustness in biological systems.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf093

Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy

Motivation: Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.

Results: To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.

Availability and implementation: ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.

{"title":"ORCO: Ollivier-Ricci Curvature-Omics-an unsupervised method for analyzing robustness in biological systems.","authors":"Anish K Simhal, Corey Weistuch, Kevin Murgas, Daniel Grange, Jiening Zhu, Jung Hun Oh, Rena Elkin, Joseph O Deasy","doi":"10.1093/bioinformatics/btaf093","DOIUrl":"10.1093/bioinformatics/btaf093","url":null,"abstract":"Motivation: Although recent advanced sequencing technologies have improved the resolution of genomic and proteomic data to better characterize molecular phenotypes, efficient computational tools to analyze and interpret large-scale omic data are still needed.Results: To address this, we have developed a network-based bioinformatic tool called Ollivier-Ricci curvature for omics (ORCO). ORCO incorporates omics data and a network describing biological relationships between the genes or proteins and computes Ollivier-Ricci curvature (ORC) values for individual interactions. ORC is an edge-based measure that assesses network robustness. It captures functional cooperation in gene signaling using a consistent information-passing measure, which can help investigators identify therapeutic targets and key regulatory modules in biological systems. ORC has identified novel insights in multiple cancer types using genomic data and in neurodevelopmental disorders using brain imaging data. This tool is applicable to any data that can be represented as a network.Availability and implementation: ORCO is an open-source Python package and is publicly available on GitHub at https://github.com/aksimhal/ORC-Omics.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893153/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143560368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting circRNA-disease associations with shared units and multi-channel attention mechanisms.

Bioinformatics (Oxford, England)

Pub Date : 2025-03-04 DOI: 10.1093/bioinformatics/btaf088

Xue Zhang, Quan Zou, Mengting Niu, Chunyu Wang

Motivation: Circular RNAs (circRNAs) have been identified as key players in the progression of several diseases; however, their roles have not yet been determined because of the high financial burden of biological studies. This highlights the urgent need to develop efficient computational models that can predict circRNA-disease associations, offering an alternative approach to overcome the limitations of expensive experimental studies. Although multi-view learning methods have been widely adopted, most approaches fail to fully exploit the latent information across views, while simultaneously overlooking the fact that different views contribute to varying degrees of significance.

Results: This study presents a method that combines multi-view shared units and multichannel attention mechanisms to predict circRNA-disease associations (MSMCDA). MSMCDA first constructs similarity and meta-path networks for circRNAs and diseases by introducing shared units to facilitate interactive learning across distinct network features. Subsequently, multichannel attention mechanisms were used to optimize the weights within similarity networks. Finally, contrastive learning strengthened the similarity features. Experiments on five public datasets demonstrated that MSMCDA significantly outperformed other baseline methods. Additionally, case studies on colorectal cancer, gastric cancer, and nonsmall cell lung cancer confirmed the effectiveness of MSMCDA in uncovering new associations.

Availability and implementation: The source code and data are available at https://github.com/zhangxue2115/MSMCDA.git.

{"title":"Predicting circRNA-disease associations with shared units and multi-channel attention mechanisms.","authors":"Xue Zhang, Quan Zou, Mengting Niu, Chunyu Wang","doi":"10.1093/bioinformatics/btaf088","DOIUrl":"10.1093/bioinformatics/btaf088","url":null,"abstract":"Motivation: Circular RNAs (circRNAs) have been identified as key players in the progression of several diseases; however, their roles have not yet been determined because of the high financial burden of biological studies. This highlights the urgent need to develop efficient computational models that can predict circRNA-disease associations, offering an alternative approach to overcome the limitations of expensive experimental studies. Although multi-view learning methods have been widely adopted, most approaches fail to fully exploit the latent information across views, while simultaneously overlooking the fact that different views contribute to varying degrees of significance.Results: This study presents a method that combines multi-view shared units and multichannel attention mechanisms to predict circRNA-disease associations (MSMCDA). MSMCDA first constructs similarity and meta-path networks for circRNAs and diseases by introducing shared units to facilitate interactive learning across distinct network features. Subsequently, multichannel attention mechanisms were used to optimize the weights within similarity networks. Finally, contrastive learning strengthened the similarity features. Experiments on five public datasets demonstrated that MSMCDA significantly outperformed other baseline methods. Additionally, case studies on colorectal cancer, gastric cancer, and nonsmall cell lung cancer confirmed the effectiveness of MSMCDA in uncovering new associations.Availability and implementation: The source code and data are available at https://github.com/zhangxue2115/MSMCDA.git.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11919450/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0