Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf040
Guy Durant, Fergus Boyles, Kristian Birchall, Brian Marsden, Charlotte M Deane
Motivation: Machine learning-based scoring functions (MLBSFs) have been found to exhibit inconsistent performance on different benchmarks and be prone to learning dataset bias. For the field to develop MLBSFs that learn a generalizable understanding of physics, a more rigorous understanding of how they perform is required.
Results: In this work, we compared the performance of a diverse set of popular MLBSFs (RFScore, SIGN, OnionNet-2, Pafnucy, and PointVS) to our proposed baseline models that can only learn dataset biases on a range of benchmarks. We found that these baseline models were competitive in accuracy to these MLBSFs in almost all proposed benchmarks, indicating these models only learn dataset biases. Our tests and provided platform, ToolBoxSF, will enable researchers to robustly interrogate MLBSF performance and determine the effect of dataset biases on their predictions.
Availability and implementation: https://github.com/guydurant/toolboxsf.
{"title":"Robustly interrogating machine learning-based scoring functions: what are they learning?","authors":"Guy Durant, Fergus Boyles, Kristian Birchall, Brian Marsden, Charlotte M Deane","doi":"10.1093/bioinformatics/btaf040","DOIUrl":"10.1093/bioinformatics/btaf040","url":null,"abstract":"<p><strong>Motivation: </strong>Machine learning-based scoring functions (MLBSFs) have been found to exhibit inconsistent performance on different benchmarks and be prone to learning dataset bias. For the field to develop MLBSFs that learn a generalizable understanding of physics, a more rigorous understanding of how they perform is required.</p><p><strong>Results: </strong>In this work, we compared the performance of a diverse set of popular MLBSFs (RFScore, SIGN, OnionNet-2, Pafnucy, and PointVS) to our proposed baseline models that can only learn dataset biases on a range of benchmarks. We found that these baseline models were competitive in accuracy to these MLBSFs in almost all proposed benchmarks, indicating these models only learn dataset biases. Our tests and provided platform, ToolBoxSF, will enable researchers to robustly interrogate MLBSF performance and determine the effect of dataset biases on their predictions.</p><p><strong>Availability and implementation: </strong>https://github.com/guydurant/toolboxsf.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821266/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf025
Thomas-Otavio Peulen, Katherina Hemmen, Annemarie Greife, Benjamin M Webb, Suren Felekyan, Andrej Sali, Claus A M Seidel, Hugo Sanabria, Katrin G Heinze
Summary: We introduce software for reading, writing and processing fluorescence single-molecule and image spectroscopy data and developing analysis pipelines to unify various spectroscopic analysis tools. Our software can be used for processing multiple experiment types, e.g. for time-resolved single-molecule spectroscopy, laser scanning microscopy, fluorescence correlation spectroscopy and image correlation spectroscopy. The software is file format agnostic and processes multiple time-resolved data formats and outputs. Our software eliminates the need for data conversion and mitigates data archiving issues.
Availability and implementation: tttrlib is available via pip (https://pypi.org/project/tttrlib/) and bioconda while the open-source code is available via GitHub (https://github.com/fluorescence-tools/tttrlib). Presented examples and additional documentation demonstrating how to implement in vitro and live-cell image spectroscopy analysis are available at https://docs.peulen.xyz/tttrlib and https://zenodo.org/records/14002224.
{"title":"tttrlib: modular software for integrating fluorescence spectroscopy, imaging, and molecular modeling.","authors":"Thomas-Otavio Peulen, Katherina Hemmen, Annemarie Greife, Benjamin M Webb, Suren Felekyan, Andrej Sali, Claus A M Seidel, Hugo Sanabria, Katrin G Heinze","doi":"10.1093/bioinformatics/btaf025","DOIUrl":"10.1093/bioinformatics/btaf025","url":null,"abstract":"<p><strong>Summary: </strong>We introduce software for reading, writing and processing fluorescence single-molecule and image spectroscopy data and developing analysis pipelines to unify various spectroscopic analysis tools. Our software can be used for processing multiple experiment types, e.g. for time-resolved single-molecule spectroscopy, laser scanning microscopy, fluorescence correlation spectroscopy and image correlation spectroscopy. The software is file format agnostic and processes multiple time-resolved data formats and outputs. Our software eliminates the need for data conversion and mitigates data archiving issues.</p><p><strong>Availability and implementation: </strong>tttrlib is available via pip (https://pypi.org/project/tttrlib/) and bioconda while the open-source code is available via GitHub (https://github.com/fluorescence-tools/tttrlib). Presented examples and additional documentation demonstrating how to implement in vitro and live-cell image spectroscopy analysis are available at https://docs.peulen.xyz/tttrlib and https://zenodo.org/records/14002224.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11796090/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143018154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf047
Dee Velazquez, Jean Fan
Motivation: Displaying proportional data across many spatially resolved coordinates is a challenging but important data visualization task, particularly for spatially resolved transcriptomics data. Scatter pie plots are one type of commonly used data visualization for such data but present perceptual challenges that may lead to difficulties in interpretation. Increasing the visual saliency of such data visualizations can help viewers more accurately identify proportional trends and compare proportional differences across spatial locations.
Results: We developed scatterbar, an open-source R package that extends ggplot2, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots. We apply scatterbar to visualize deconvolved cell-type proportions from a spatial transcriptomics dataset of the adult mouse brain to demonstrate how scatter stacked bar plots can enhance the distinguishability of proportional distributions compared to scatter pie plots.
Availability and implementation: scatterbar is available on CRAN https://cran.r-project.org/package=scatterbar with additional documentation and tutorials at https://jef.works/scatterbar/.
{"title":"scatterbar: an R package for visualizing proportional data across spatially resolved coordinates.","authors":"Dee Velazquez, Jean Fan","doi":"10.1093/bioinformatics/btaf047","DOIUrl":"10.1093/bioinformatics/btaf047","url":null,"abstract":"<p><strong>Motivation: </strong>Displaying proportional data across many spatially resolved coordinates is a challenging but important data visualization task, particularly for spatially resolved transcriptomics data. Scatter pie plots are one type of commonly used data visualization for such data but present perceptual challenges that may lead to difficulties in interpretation. Increasing the visual saliency of such data visualizations can help viewers more accurately identify proportional trends and compare proportional differences across spatial locations.</p><p><strong>Results: </strong>We developed scatterbar, an open-source R package that extends ggplot2, to visualize proportional data across many spatially resolved coordinates using scatter stacked bar plots. We apply scatterbar to visualize deconvolved cell-type proportions from a spatial transcriptomics dataset of the adult mouse brain to demonstrate how scatter stacked bar plots can enhance the distinguishability of proportional distributions compared to scatter pie plots.</p><p><strong>Availability and implementation: </strong>scatterbar is available on CRAN https://cran.r-project.org/package=scatterbar with additional documentation and tutorials at https://jef.works/scatterbar/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829801/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The classification task based on whole-slide images (WSIs) is a classic problem in computational pathology. Multiple instance learning (MIL) provides a robust framework for analyzing whole slide images with slide-level labels at gigapixel resolution. However, existing MIL models typically focus on modeling the relationships between instances while neglecting the variability across the channel dimensions of instances, which prevents the model from fully capturing critical information in the channel dimension.
Results: To address this issue, we propose a plug-and-play module called Multi-scale Channel Attention Block (MCAB), which models the interdependencies between channels by leveraging local features with different receptive fields. By alternately stacking four layers of Transformer and MCAB, we designed a channel attention-based MIL model (CAMIL) capable of simultaneously modeling both inter-instance relationships and intra-channel dependencies. To verify the performance of the proposed CAMIL in classification tasks, several comprehensive experiments were conducted across three datasets: Camelyon16, TCGA-NSCLC, and TCGA-RCC. Empirical results demonstrate that, whether the feature extractor is pretrained on natural images or on WSIs, our CAMIL surpasses current state-of-the-art MIL models across multiple evaluation metrics.
Availability and implementation: All implementation code is available at https://github.com/maojy0914/CAMIL.
{"title":"CAMIL: channel attention-based multiple instance learning for whole slide image classification.","authors":"Jinyang Mao, Junlin Xu, Xianfang Tang, Yongjin Liu, Heaven Zhao, Geng Tian, Jialiang Yang","doi":"10.1093/bioinformatics/btaf024","DOIUrl":"10.1093/bioinformatics/btaf024","url":null,"abstract":"<p><strong>Motivation: </strong>The classification task based on whole-slide images (WSIs) is a classic problem in computational pathology. Multiple instance learning (MIL) provides a robust framework for analyzing whole slide images with slide-level labels at gigapixel resolution. However, existing MIL models typically focus on modeling the relationships between instances while neglecting the variability across the channel dimensions of instances, which prevents the model from fully capturing critical information in the channel dimension.</p><p><strong>Results: </strong>To address this issue, we propose a plug-and-play module called Multi-scale Channel Attention Block (MCAB), which models the interdependencies between channels by leveraging local features with different receptive fields. By alternately stacking four layers of Transformer and MCAB, we designed a channel attention-based MIL model (CAMIL) capable of simultaneously modeling both inter-instance relationships and intra-channel dependencies. To verify the performance of the proposed CAMIL in classification tasks, several comprehensive experiments were conducted across three datasets: Camelyon16, TCGA-NSCLC, and TCGA-RCC. Empirical results demonstrate that, whether the feature extractor is pretrained on natural images or on WSIs, our CAMIL surpasses current state-of-the-art MIL models across multiple evaluation metrics.</p><p><strong>Availability and implementation: </strong>All implementation code is available at https://github.com/maojy0914/CAMIL.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802473/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143018197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf030
Dmitry Pinchuk, H M A Mohit Chowdhury, Abhishek Pandeya, Oluwatosin Oluwadare
Motivation: The exploration of the 3D organization of DNA within the nucleus in relation to various stages of cellular development has led to experiments generating spatiotemporal Hi-C data. However, there is limited spatiotemporal Hi-C data for many organisms, impeding the study of 3D genome dynamics. To overcome this limitation and advance our understanding of genome organization, it is crucial to develop methods for forecasting Hi-C data at future time points from existing timeseries Hi-C data.
Result: In this work, we designed a novel framework named HiCForecast, adopting a dynamic voxel flow algorithm to forecast future spatiotemporal Hi-C data. We evaluated how well our method generalizes forecasting data across different species and systems, ensuring performance in homogeneous, heterogeneous, and general contexts. Using both computational and biological evaluation metrics, our results show that HiCForecast outperforms the current state-of-the-art algorithm, emerging as an efficient and powerful tool for forecasting future spatiotemporal Hi-C datasets.
Availability and implementation: HiCForecast is publicly available at https://github.com/OluwadareLab/HiCForecast.
{"title":"HiCForecast: dynamic network optical flow estimation algorithm for spatiotemporal Hi-C data forecasting.","authors":"Dmitry Pinchuk, H M A Mohit Chowdhury, Abhishek Pandeya, Oluwatosin Oluwadare","doi":"10.1093/bioinformatics/btaf030","DOIUrl":"10.1093/bioinformatics/btaf030","url":null,"abstract":"<p><strong>Motivation: </strong>The exploration of the 3D organization of DNA within the nucleus in relation to various stages of cellular development has led to experiments generating spatiotemporal Hi-C data. However, there is limited spatiotemporal Hi-C data for many organisms, impeding the study of 3D genome dynamics. To overcome this limitation and advance our understanding of genome organization, it is crucial to develop methods for forecasting Hi-C data at future time points from existing timeseries Hi-C data.</p><p><strong>Result: </strong>In this work, we designed a novel framework named HiCForecast, adopting a dynamic voxel flow algorithm to forecast future spatiotemporal Hi-C data. We evaluated how well our method generalizes forecasting data across different species and systems, ensuring performance in homogeneous, heterogeneous, and general contexts. Using both computational and biological evaluation metrics, our results show that HiCForecast outperforms the current state-of-the-art algorithm, emerging as an efficient and powerful tool for forecasting future spatiotemporal Hi-C datasets.</p><p><strong>Availability and implementation: </strong>HiCForecast is publicly available at https://github.com/OluwadareLab/HiCForecast.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11793695/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf031
Nicholas Matsumoto, Hyunjun Choi, Jay Moran, Miguel E Hernandez, Mythreye Venkatesan, Xi Li, Jui-Hsuan Chang, Paul Wang, Jason H Moore
Motivation: LLMs like GPT-4, despite their advancements, often produce hallucinations and struggle with integrating external knowledge effectively. While Retrieval-Augmented Generation (RAG) attempts to address this by incorporating external information, it faces significant challenges such as context length limitations and imprecise vector similarity search. ESCARGOT aims to overcome these issues by combining LLMs with a dynamic Graph of Thoughts and biomedical knowledge graphs, improving output reliability, and reducing hallucinations.
Result: ESCARGOT significantly outperforms industry-standard RAG methods, particularly in open-ended questions that demand high precision. ESCARGOT also offers greater transparency in its reasoning process, allowing for the vetting of both code and knowledge requests, in contrast to the black-box nature of LLM-only or RAG-based approaches.
Availability and implementation: ESCARGOT is available as a pip package and on GitHub at: https://github.com/EpistasisLab/ESCARGOT.
{"title":"ESCARGOT: an AI agent leveraging large language models, dynamic graph of thoughts, and biomedical knowledge graphs for enhanced reasoning.","authors":"Nicholas Matsumoto, Hyunjun Choi, Jay Moran, Miguel E Hernandez, Mythreye Venkatesan, Xi Li, Jui-Hsuan Chang, Paul Wang, Jason H Moore","doi":"10.1093/bioinformatics/btaf031","DOIUrl":"10.1093/bioinformatics/btaf031","url":null,"abstract":"<p><strong>Motivation: </strong>LLMs like GPT-4, despite their advancements, often produce hallucinations and struggle with integrating external knowledge effectively. While Retrieval-Augmented Generation (RAG) attempts to address this by incorporating external information, it faces significant challenges such as context length limitations and imprecise vector similarity search. ESCARGOT aims to overcome these issues by combining LLMs with a dynamic Graph of Thoughts and biomedical knowledge graphs, improving output reliability, and reducing hallucinations.</p><p><strong>Result: </strong>ESCARGOT significantly outperforms industry-standard RAG methods, particularly in open-ended questions that demand high precision. ESCARGOT also offers greater transparency in its reasoning process, allowing for the vetting of both code and knowledge requests, in contrast to the black-box nature of LLM-only or RAG-based approaches.</p><p><strong>Availability and implementation: </strong>ESCARGOT is available as a pip package and on GitHub at: https://github.com/EpistasisLab/ESCARGOT.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11796095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143026151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf038
Tiantian Liu, Xiangnan Xu, Tao Wang, Peirong Xu
Motivation: Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.
Results: We introduced a novel statistical framework, CRAmed, for inferring the mediating role of the microbiome between treatment and outcome. CRAmed improved the interpretability of the mediation analysis by decomposing the natural indirect effect into two parts, corresponding to the presence-absence and abundance of a microbe, respectively. Comprehensive simulations demonstrated the superior performance of CRAmed in Recall, precision, and F1 score, with a notable level of robustness, compared to existing mediation analysis methods. Furthermore, two real data applications illustrated the effectiveness and interpretability of CRAmed. Our research revealed that CRAmed holds promise for uncovering the mediating role of the microbiome and understanding of the factors influencing host health.
Availability and implementation: The R package CRAmed implementing the proposed methods is available online at https://github.com/liudoubletian/CRAmed.
{"title":"CRAmed: a conditional randomization test for high-dimensional mediation analysis in sparse microbiome data.","authors":"Tiantian Liu, Xiangnan Xu, Tao Wang, Peirong Xu","doi":"10.1093/bioinformatics/btaf038","DOIUrl":"10.1093/bioinformatics/btaf038","url":null,"abstract":"<p><strong>Motivation: </strong>Numerous microbiome studies have revealed significant associations between the microbiome and human health and disease. These findings have motivated researchers to explore the causal role of the microbiome in human complex traits and diseases. However, the complexities of microbiome data pose challenges for statistical analysis and interpretation of causal effects.</p><p><strong>Results: </strong>We introduced a novel statistical framework, CRAmed, for inferring the mediating role of the microbiome between treatment and outcome. CRAmed improved the interpretability of the mediation analysis by decomposing the natural indirect effect into two parts, corresponding to the presence-absence and abundance of a microbe, respectively. Comprehensive simulations demonstrated the superior performance of CRAmed in Recall, precision, and F1 score, with a notable level of robustness, compared to existing mediation analysis methods. Furthermore, two real data applications illustrated the effectiveness and interpretability of CRAmed. Our research revealed that CRAmed holds promise for uncovering the mediating role of the microbiome and understanding of the factors influencing host health.</p><p><strong>Availability and implementation: </strong>The R package CRAmed implementing the proposed methods is available online at https://github.com/liudoubletian/CRAmed.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821267/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf029
Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi
Summary: Virus surveillance programmes are designed to counter the growing threat of viral outbreaks to human health. Nanopore sequencing, in particular, has proven to be suitable for this purpose, as it is readily available and provides rapid results. However, as special bioinformatic programs are required to extract the relevant information from the sequencing data, applications are needed that allow users without extensive bioinformatics knowledge to carry out the relevant analysis steps. We present VirDetector, a bioinformatic pipeline for virus surveillance using nanopore sequencing. The pipeline automatically installs all required programs and databases and allows all its steps to be executed with a single console command. After preprocessing the samples, including the possibility for basecalling, the pipeline classifies each sample taxonomically and reconstructs the viral consensus genomes, which are then used in phylogenetic analyses. This streamlined workflow provides a user-friendly and efficient solution for monitoring viral pathogens.
Availability and implementation: VirDetector is freely available at https://github.com/NLKaiser/VirDetector and https://zenodo.org/records/14637302 (10.5281/zenodo.14637302).
{"title":"VirDetector: a bioinformatic pipeline for virus surveillance using nanopore sequencing.","authors":"Nick Laurenz Kaiser, Martin H Groschup, Balal Sadeghi","doi":"10.1093/bioinformatics/btaf029","DOIUrl":"10.1093/bioinformatics/btaf029","url":null,"abstract":"<p><strong>Summary: </strong>Virus surveillance programmes are designed to counter the growing threat of viral outbreaks to human health. Nanopore sequencing, in particular, has proven to be suitable for this purpose, as it is readily available and provides rapid results. However, as special bioinformatic programs are required to extract the relevant information from the sequencing data, applications are needed that allow users without extensive bioinformatics knowledge to carry out the relevant analysis steps. We present VirDetector, a bioinformatic pipeline for virus surveillance using nanopore sequencing. The pipeline automatically installs all required programs and databases and allows all its steps to be executed with a single console command. After preprocessing the samples, including the possibility for basecalling, the pipeline classifies each sample taxonomically and reconstructs the viral consensus genomes, which are then used in phylogenetic analyses. This streamlined workflow provides a user-friendly and efficient solution for monitoring viral pathogens.</p><p><strong>Availability and implementation: </strong>VirDetector is freely available at https://github.com/NLKaiser/VirDetector and https://zenodo.org/records/14637302 (10.5281/zenodo.14637302).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802467/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143017325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data.
Results: We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods.
Availability and implementation: The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.
{"title":"MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network.","authors":"Yujian Lee, Peng Gao, Yongqi Xu, Ziyang Wang, Shuaicheng Li, Jiaxing Chen","doi":"10.1093/bioinformatics/btaf032","DOIUrl":"10.1093/bioinformatics/btaf032","url":null,"abstract":"<p><strong>Motivation: </strong>The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data.</p><p><strong>Results: </strong>We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods.</p><p><strong>Availability and implementation: </strong>The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11810639/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.
Results: We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.
Availability and implementation: The R package MDSMClust is available at https://github.com/wxy929/MDS-project.
{"title":"Multidimensional scaling improves distance-based clustering for microbiome data.","authors":"Guanhua Chen, Xinyue Wang, Qiang Sun, Zheng-Zheng Tang","doi":"10.1093/bioinformatics/btaf042","DOIUrl":"10.1093/bioinformatics/btaf042","url":null,"abstract":"<p><strong>Motivation: </strong>Clustering patients into subgroups based on their microbial compositions can greatly enhance our understanding of the role of microbes in human health and disease etiology. Distance-based clustering methods, such as partitioning around medoids (PAM), are popular due to their computational efficiency and absence of distributional assumptions. However, the performance of these methods can be suboptimal when true cluster memberships are driven by differences in the abundance of only a few microbes, a situation known as the sparse signal scenario.</p><p><strong>Results: </strong>We demonstrate that classical multidimensional scaling (MDS), a widely used dimensionality reduction technique, effectively denoises microbiome data and enhances the clustering performance of distance-based methods. We propose a two-step procedure that first applies MDS to project high-dimensional microbiome data into a low-dimensional space, followed by distance-based clustering using the low-dimensional data. Our extensive simulations demonstrate that our procedure offers superior performance compared to directly conducting distance-based clustering under the sparse signal scenario. The advantage of our procedure is further showcased in several real data applications.</p><p><strong>Availability and implementation: </strong>The R package MDSMClust is available at https://github.com/wxy929/MDS-project.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814494/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}