Pub Date : 2024-09-19DOI: 10.1101/2024.09.14.612549
Vanda A Gaonac'h-Lovejoy, Martin Sauvageau, John S Mattick, Martin A Smith
Accurate prediction of RNA secondary structures is essential for understanding the evolutionary conservation and functional roles of long noncoding RNAs (lncRNAs) across diverse species. In this study, we benchmarked two leading tools for predicting evolutionarily conserved RNA secondary structures (ECSs), SISSIz and R-scape, using two distinct experimental frameworks: one focusing on well-characterized mitochondrial RNA structures and the other on experimentally validated Rfam structures embedded within simulated genome alignments. While both tools performed comparably overall, each displayed subtle preferences in detecting ECSs. To address these limitations, we evaluated two interpretable machine learning approaches that integrate the strengths of both methods. By balancing thermodynamic stability features from RNALalifold and SISSIz with robust covariation metrics from R-scape, a random forest classifier significantly outperformed both conventional tools. This classifier was implemented in ECSfinder, a new tool that provides a robust, interpretable solution for genome-wide identification of conserved RNA structures, offering valuable insights into lncRNA function and evolutionary conservation. ECSfinder is designed for large-scale comparative genomics applications and promises to facilitate the discovery of novel functional RNA elements.
{"title":"ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences","authors":"Vanda A Gaonac'h-Lovejoy, Martin Sauvageau, John S Mattick, Martin A Smith","doi":"10.1101/2024.09.14.612549","DOIUrl":"https://doi.org/10.1101/2024.09.14.612549","url":null,"abstract":"Accurate prediction of RNA secondary structures is essential for understanding the evolutionary conservation and functional roles of long noncoding RNAs (lncRNAs) across diverse species. In this study, we benchmarked two leading tools for predicting evolutionarily conserved RNA secondary structures (ECSs), SISSIz and R-scape, using two distinct experimental frameworks: one focusing on well-characterized mitochondrial RNA structures and the other on experimentally validated Rfam structures embedded within simulated genome alignments. While both tools performed comparably overall, each displayed subtle preferences in detecting ECSs. To address these limitations, we evaluated two interpretable machine learning approaches that integrate the strengths of both methods. By balancing thermodynamic stability features from RNALalifold and SISSIz with robust covariation metrics from R-scape, a random forest classifier significantly outperformed both conventional tools. This classifier was implemented in ECSfinder, a new tool that provides a robust, interpretable solution for genome-wide identification of conserved RNA structures, offering valuable insights into lncRNA function and evolutionary conservation. ECSfinder is designed for large-scale comparative genomics applications and promises to facilitate the discovery of novel functional RNA elements.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1101/2024.09.12.612666
Junjie Tang, Zihao Chen, Kun Qian, Siyuan Huang, Yang He, Shenyi Yin, Xinyu He, Buqing Ye, Yan Zhuang, Hongxue Meng, Jianzhong Xi, Ruibin Xi
Spatial transcriptomics (ST) technologies have revolutionized tissue architecture studies by capturing gene expression with spatial context. However, high-dimensional ST data often have limited spatial resolution and exhibit considerable noise and sparsity, posing significant challenges in deciphering subtle spatial structures and underlying biological activities. Here, we introduce DeepFuseNMF, a multi-modal dimension reduction framework that enhances spatial resolution by integrating ST gene expression with high-resolution histology images. DeepFuseNMF incorporates non-negative matrix factorization into a neural network architecture, enabling the identification of interpretable, high resolution embeddings. Furthermore, DeepFuseNMF can simultaneously analyze multiple samples and is compatible with various types of histology images. Extensive evaluations on synthetic and real ST datasets from various technologies and tissue types demonstrate that DeepFuseNMF can effectively produce highly interpretable, high-resolution embeddings, and detects refined spatial structures. DeepFuseNMF represents a powerful approach for integrating ST data and histology images, offering deeper insights into complex tissue structures and functions.
空间转录组学(ST)技术通过捕捉具有空间背景的基因表达,彻底改变了组织结构研究。然而,高维空间转录组学数据的空间分辨率往往有限,并表现出相当大的噪声和稀疏性,这给解读微妙的空间结构和潜在的生物活动带来了巨大挑战。在此,我们介绍一种多模态降维框架 DeepFuseNMF,它通过整合 ST 基因表达和高分辨率组织学图像来提高空间分辨率。DeepFuseNMF 将非负矩阵因式分解纳入神经网络架构,从而能够识别可解释的高分辨率嵌入。此外,DeepFuseNMF 还能同时分析多个样本,并兼容各种类型的组织学图像。在来自不同技术和组织类型的合成和真实 ST 数据集上进行的广泛评估表明,DeepFuseNMF 能有效生成可解释性高的高分辨率嵌入,并能检测到精细的空间结构。DeepFuseNMF 是一种整合 ST 数据和组织学图像的强大方法,能让人们更深入地了解复杂的组织结构和功能。
{"title":"Interpretable high-resolution dimension reduction of spatial transcriptomics data by DeepFuseNMF","authors":"Junjie Tang, Zihao Chen, Kun Qian, Siyuan Huang, Yang He, Shenyi Yin, Xinyu He, Buqing Ye, Yan Zhuang, Hongxue Meng, Jianzhong Xi, Ruibin Xi","doi":"10.1101/2024.09.12.612666","DOIUrl":"https://doi.org/10.1101/2024.09.12.612666","url":null,"abstract":"Spatial transcriptomics (ST) technologies have revolutionized tissue architecture studies by capturing gene expression with spatial context. However, high-dimensional ST data often have limited spatial resolution and exhibit considerable noise and sparsity, posing significant challenges in deciphering subtle spatial structures and underlying biological activities. Here, we introduce DeepFuseNMF, a multi-modal dimension reduction framework that enhances spatial resolution by integrating ST gene expression with high-resolution histology images. DeepFuseNMF incorporates non-negative matrix factorization into a neural network architecture, enabling the identification of interpretable, high resolution embeddings. Furthermore, DeepFuseNMF can simultaneously analyze multiple samples and is compatible with various types of histology images. Extensive evaluations on synthetic and real ST datasets from various technologies and tissue types demonstrate that DeepFuseNMF can effectively produce highly interpretable, high-resolution embeddings, and detects refined spatial structures. DeepFuseNMF represents a powerful approach for integrating ST data and histology images, offering deeper insights into complex tissue structures and functions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1101/2024.08.25.609622
Paul P. Gardner
The development of accurate bioinformatic software tools is crucial for the effective analysis of complex biological data. This study examines the relationship between the academic department affiliations of authors and the accuracy of the bioinformatic tools they develop. By analyzing a corpus of previously benchmarked bioinformatic software tools, we mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field. Our results suggest that "Medical Informatics" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation. In contrast, tools developed by authors affiliated with "Bioinformatics" and "Engineering" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant (p>0.05). Our findings reveal no strong association between academic field and bioinformatic software accuracy. These findings suggest that the development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.
{"title":"A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better?","authors":"Paul P. Gardner","doi":"10.1101/2024.08.25.609622","DOIUrl":"https://doi.org/10.1101/2024.08.25.609622","url":null,"abstract":"The development of accurate bioinformatic software tools is crucial for the effective analysis of complex biological data. This study examines the relationship between the academic department affiliations of authors and the accuracy of the bioinformatic tools they develop. By analyzing a corpus of previously benchmarked bioinformatic software tools, we mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field. Our results suggest that \"Medical Informatics\" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation. In contrast, tools developed by authors affiliated with \"Bioinformatics\" and \"Engineering\" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant (<em>p</em>>0.05). Our findings reveal no strong association between academic field and bioinformatic software accuracy. These findings suggest that the development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1101/2024.06.21.600109
Yuyao Song, Irene Papatheodorou, Alvis Brazma
Computational comparison of single cell expression profiles cross-species uncovers functional similarities and differences between cell types. Importantly, it offers the potential to refine evolutionary relationships based on gene expression. Current analysis strategies are limited by the strong hypothesis of ortholog conjecture, which implies that orthologs have similar cell type expression patterns. They also lose expression information from non-orthologs, making them inapplicable in practice for large evolutionary distances. To address these limitations, we devised a novel analytical framework, GeneSpectra, to robustly classify genes by their expression specificity and distribution across cell types. This framework allows for the generalization of the ortholog conjecture by evaluating the degree of ortholog class conservation. We utilise different gene classes to decode species effects on cross-species transcriptomics space and compare sequence conservation with expression specificity similarity across different types of orthologs. We develop contextualised cell type similarity measurements while considering species-unique genes and non-one-to-one orthologs. Finally, we consolidate gene classification results into a knowledge graph, GeneSpectraKG, allowing a hierarchical depiction of cell types and orthologous groups, while continuously integrating new data.
{"title":"GeneSpectra: a method for context-aware comparison of cell type gene expression across species","authors":"Yuyao Song, Irene Papatheodorou, Alvis Brazma","doi":"10.1101/2024.06.21.600109","DOIUrl":"https://doi.org/10.1101/2024.06.21.600109","url":null,"abstract":"Computational comparison of single cell expression profiles cross-species uncovers functional similarities and differences between cell types. Importantly, it offers the potential to refine evolutionary relationships based on gene expression. Current analysis strategies are limited by the strong hypothesis of ortholog conjecture, which implies that orthologs have similar cell type expression patterns. They also lose expression information from non-orthologs, making them inapplicable in practice for large evolutionary distances. To address these limitations, we devised a novel analytical framework, GeneSpectra, to robustly classify genes by their expression specificity and distribution across cell types. This framework allows for the generalization of the ortholog conjecture by evaluating the degree of ortholog class conservation. We utilise different gene classes to decode species effects on cross-species transcriptomics space and compare sequence conservation with expression specificity similarity across different types of orthologs. We develop contextualised cell type similarity measurements while considering species-unique genes and non-one-to-one orthologs. Finally, we consolidate gene classification results into a knowledge graph, GeneSpectraKG, allowing a hierarchical depiction of cell types and orthologous groups, while continuously integrating new data.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.13.612628
Vinnarasi Saravanan, Nessim Raouraoua, Guillaume Brysbaert, Stefano Giordano, Marc F Lensink, Fabrizio Cleri, Ralf Blossey
Uracil-DNA glycosylase (UDG) is the first enzyme in the base-excision repair (BER) pathway, acting on uracil bases in DNA. How UDG finds its targets has not been conclusively resolved yet. Based on available structural and other experimental evidence, two possible pathways are under discussion. In one, the action of UDG on the DNA bases is believed to follow a "pinch-push-pull" model, in which UDG generates the base-flip in an active manner. A second scenario is based on the exploitation of bases flipping out thermally from the DNA. Recent molecular dynamics (MD) studies of DNA in trinucleosome arrays have shown that base-flipping can be readily induced by the action of mechanical forces on DNA alone. This alternative mechanism could possibly enhance the probability for the second scnenario of UDG- uracil interaction via the formation of a recognition complex of UDG with flipped-out base. In this work we describe DNA structures with flipped-out uracil bases generated by MD simulations which we then subject to docking simulations with the UDG enzyme. Our results for the UDG-uracil recognition complex support the view that base-flipping induced by DNA mechanics can be a relevant mechanism of uracil base recognition by the UDG glycosylase in chromatin.
尿嘧啶-DNA 糖基化酶(UDG)是碱基切除修复(BER)途径中的第一种酶,作用于 DNA 中的尿嘧啶碱基。UDG 如何找到它的目标尚未得到最终解决。根据现有的结构和其他实验证据,目前正在讨论两种可能的途径。其一,UDG 对 DNA 碱基的作用被认为遵循 "捏-推-拉 "模式,即 UDG 以主动方式产生碱基翻转。第二种情况是利用碱基在 DNA 中的热翻转。最近对三核体阵列中 DNA 的分子动力学(MD)研究表明,仅通过 DNA 上的机械力就能轻易诱导碱基翻转。这种替代机制有可能通过 UDG 与翻转碱基形成识别复合物,提高 UDG 与尿嘧啶相互作用的第二种情况的发生概率。在这项工作中,我们描述了通过 MD 模拟生成的带有外翻尿嘧啶碱基的 DNA 结构,然后将其与 UDG 酶进行对接模拟。我们对 UDG-尿嘧啶识别复合物的研究结果支持这样一种观点,即 DNA 力学引起的碱基翻转可能是染色质中 UDG 糖基化酶识别尿嘧啶碱基的一种相关机制。
{"title":"The \"very moment\" when UDG recognizes aflipped-out uracil base in dsDNA","authors":"Vinnarasi Saravanan, Nessim Raouraoua, Guillaume Brysbaert, Stefano Giordano, Marc F Lensink, Fabrizio Cleri, Ralf Blossey","doi":"10.1101/2024.09.13.612628","DOIUrl":"https://doi.org/10.1101/2024.09.13.612628","url":null,"abstract":"Uracil-DNA glycosylase (UDG) is the first enzyme in the base-excision repair (BER) pathway, acting on uracil bases in DNA. How UDG finds its targets has not been conclusively resolved yet. Based on available structural and other experimental evidence, two possible pathways are under discussion. In one, the action of UDG on the DNA bases is believed to follow a \"pinch-push-pull\" model, in which UDG generates the base-flip in an active manner. A second scenario is based on the exploitation of bases flipping out thermally from the DNA. Recent molecular dynamics (MD) studies of DNA in trinucleosome arrays have shown that base-flipping can be readily induced by the action of mechanical forces on DNA alone. This alternative mechanism could possibly enhance the probability for the second scnenario of UDG- uracil interaction via the formation of a recognition complex of UDG with flipped-out base. In this work we describe DNA structures with flipped-out uracil bases generated by MD simulations which we then subject to docking simulations with the UDG enzyme. Our results for the UDG-uracil recognition complex support the view that base-flipping induced by DNA mechanics can be a relevant mechanism of uracil base recognition by the UDG glycosylase in chromatin.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"138 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.
{"title":"metagWGS, a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads","authors":"Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede","doi":"10.1101/2024.09.13.612854","DOIUrl":"https://doi.org/10.1101/2024.09.13.612854","url":null,"abstract":"Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"186 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.17.613203
Shivani Srivastava, Saba Ehsan, Linkon Chowdhury, Muhammad Omar Faruk, Abhishek Singh, Anmol S Kapoor, Sidharth Bhinder, Mohan P Singh, Divya Mishra
The integration of whole-genome sequencing (WGS), whole-exome sequencing (WES), and microbiome analysis has become essential for advancing our understanding of complex biological systems. However, the fragmented nature of current analytical tools often complicates the process, leading to inefficiencies and potential data loss. To address this challenge, we present PANOMIQ, a comprehensive software solution that unifies the analysis of WGS, WES, and microbiome data into a single, streamlined pipeline. PANOMIQ is designed to facilitate the entire analysis process from raw data to interpretable results. It is the fastest algorithm that can achieve results much more quickly compared to traditional pipeline approaches of WGS and WES analysis. It incorporates advanced algorithms for high-accuracy variant calling in both WGS and WES, along with robust tools for characterizing microbial communities. The software's modular architecture allows for seamless integration of these diverse data types, enabling researchers to uncover complex interactions between host genomics and microbiomes. In this study, we demonstrate the capabilities of PANOMIQ by applying it to a series of datasets encompassing a wide range of applications, including disease association studies and environmental microbiome profiling. Our results highlight PANOMIQ's ability to deliver comprehensive insights, significantly reducing the time and computational resources required for multi-omic analysis. By providing a unified platform for WGS, WES, and microbiome analysis, PANOMIQ offers a powerful tool for researchers aiming to explore the full spectrum of genomic and microbial diversity. This software not only simplifies the analytical workflow but also enhances the depth of biological interpretation, paving the way for more integrated and holistic studies in genomics and microbiology.
{"title":"PANOMIQ: A Unified Approach to Whole-Genome, Exome, and Microbiome Data Analysis","authors":"Shivani Srivastava, Saba Ehsan, Linkon Chowdhury, Muhammad Omar Faruk, Abhishek Singh, Anmol S Kapoor, Sidharth Bhinder, Mohan P Singh, Divya Mishra","doi":"10.1101/2024.09.17.613203","DOIUrl":"https://doi.org/10.1101/2024.09.17.613203","url":null,"abstract":"The integration of whole-genome sequencing (WGS), whole-exome sequencing (WES), and microbiome analysis has become essential for advancing our understanding of complex biological systems. However, the fragmented nature of current analytical tools often complicates the process, leading to inefficiencies and potential data loss. To address this challenge, we present PANOMIQ, a comprehensive software solution that unifies the analysis of WGS, WES, and microbiome data into a single, streamlined pipeline. PANOMIQ is designed to facilitate the entire analysis process from raw data to interpretable results. It is the fastest algorithm that can achieve results much more quickly compared to traditional pipeline approaches of WGS and WES analysis. It incorporates advanced algorithms for high-accuracy variant calling in both WGS and WES, along with robust tools for characterizing microbial communities. The software's modular architecture allows for seamless integration of these diverse data types, enabling researchers to uncover complex interactions between host genomics and microbiomes. In this study, we demonstrate the capabilities of PANOMIQ by applying it to a series of datasets encompassing a wide range of applications, including disease association studies and environmental microbiome profiling. Our results highlight PANOMIQ's ability to deliver comprehensive insights, significantly reducing the time and computational resources required for multi-omic analysis. By providing a unified platform for WGS, WES, and microbiome analysis, PANOMIQ offers a powerful tool for researchers aiming to explore the full spectrum of genomic and microbial diversity. This software not only simplifies the analytical workflow but also enhances the depth of biological interpretation, paving the way for more integrated and holistic studies in genomics and microbiology.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.13.612391
Davide Buzzao, Emma Persson, Dimitri Guala, Erik L L Sonnhammer
FunCoup 6 (https://funcoup6.scilifelab.se/, will be https://funcoup.org after publication) represents a significant advancement in global functional association networks, aiming to provide researchers with a comprehensive view of the functional coupling interactome. This update introduces novel methodologies and integrated tools for improved network inference and analysis. Major new developments in FunCoup 6 include vastly expanding the coverage of gene regulatory links, a new framework for bin-free Bayesian training, and a new website. FunCoup 6 integrates a new tool for disease and drug target module identification using the TOPAS algorithm. To expand the utility of the resource for biomedical research, it incorporates pathway enrichment analysis using the ANUBIX and EASE algorithms. The unique comparative interactomics analysis in FunCoup provides insights of network conservation, now allowing users to align orthologs only or query each species network independently. Bin-free training was applied to 23 primary species, and in addition networks were generated for all remaining 618 species in InParanoiDB 9. Accompanying these advancements, FunCoup 6 features a new redesigned website, together with updated API functionalities, and represents a pivotal step forward in functional genomics research, offering unique capabilities for exploring the complex landscape of protein interactions.
{"title":"FunCoup 6: advancing functional association networks across species with directed links and improved user experience","authors":"Davide Buzzao, Emma Persson, Dimitri Guala, Erik L L Sonnhammer","doi":"10.1101/2024.09.13.612391","DOIUrl":"https://doi.org/10.1101/2024.09.13.612391","url":null,"abstract":"FunCoup 6 (https://funcoup6.scilifelab.se/, will be https://funcoup.org after publication) represents a significant advancement in global functional association networks, aiming to provide researchers with a comprehensive view of the functional coupling interactome. This update introduces novel methodologies and integrated tools for improved network inference and analysis. Major new developments in FunCoup 6 include vastly expanding the coverage of gene regulatory links, a new framework for bin-free Bayesian training, and a new website. FunCoup 6 integrates a new tool for disease and drug target module identification using the TOPAS algorithm. To expand the utility of the resource for biomedical research, it incorporates pathway enrichment analysis using the ANUBIX and EASE algorithms. The unique comparative interactomics analysis in FunCoup provides insights of network conservation, now allowing users to align orthologs only or query each species network independently. Bin-free training was applied to 23 primary species, and in addition networks were generated for all remaining 618 species in InParanoiDB 9. Accompanying these advancements, FunCoup 6 features a new redesigned website, together with updated API functionalities, and represents a pivotal step forward in functional genomics research, offering unique capabilities for exploring the complex landscape of protein interactions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1101/2024.09.12.612711
Ziyuan Wang, Mei-Juan Tu, Ziyang Liu, Katherine K Wang, Yinshan Fang, Ning Hao, Hao Helen Zhang, Jianwen Que, Xiaoxiao Sun, Ai-Ming Yu, HONGXU DING
Nucleotide modifications deviate nanopore sequencing readouts, therefore generating artifacts during the basecalling of sequence backbones. Here, we present an iterative approach to polish modification-disturbed basecalling results. We show such an approach is able to promote the basecalling accuracy of both artificially-synthesized and real-world molecules. With demonstrated efficacy and reliability, we exploit the approach to precisely basecall therapeutic RNAs consisting of artificial or natural modifications, as the basis for quantifying the purity and integrity of vaccine mRNAs which are transcribed in vitro, and for determining modification hotspots of novel therapeutic RNA interference (RNAi) molecules which are bioengineered (BioRNA) in vivo.
{"title":"An Iterative Approach to Polish the Nanopore Sequencing Basecalling for Therapeutic RNA Quality Control","authors":"Ziyuan Wang, Mei-Juan Tu, Ziyang Liu, Katherine K Wang, Yinshan Fang, Ning Hao, Hao Helen Zhang, Jianwen Que, Xiaoxiao Sun, Ai-Ming Yu, HONGXU DING","doi":"10.1101/2024.09.12.612711","DOIUrl":"https://doi.org/10.1101/2024.09.12.612711","url":null,"abstract":"Nucleotide modifications deviate nanopore sequencing readouts, therefore generating artifacts during the basecalling of sequence backbones. Here, we present an iterative approach to polish modification-disturbed basecalling results. We show such an approach is able to promote the basecalling accuracy of both artificially-synthesized and real-world molecules. With demonstrated efficacy and reliability, we exploit the approach to precisely basecall therapeutic RNAs consisting of artificial or natural modifications, as the basis for quantifying the purity and integrity of vaccine mRNAs which are transcribed in vitro, and for determining modification hotspots of novel therapeutic RNA interference (RNAi) molecules which are bioengineered (BioRNA) in vivo.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cancer poses a significant global health challenge, characterized by a complex disease progression and disrupted growth regulation. A thorough understanding of cellular and molecular biological mechanisms is essential for developing novel treatments and improving the accuracy of patient survival predictions. While prior studies have leveraged gene expression and clinical data to forecast survival outcomes through current machine learning and deep learning approaches, gene mutation data despite being a widely recognized metric has rarely been incorporated due to its limited information, inadequate representation of gene relationships, and data sparsity, which negatively affects the robustness, effectiveness, and interpretability of current survival analysis approaches. To overcome the challenges of mutation data sparsity, we propose RCoxNet, a novel deep learning neural network framework that integrates the Random Walk with Restart (RWR) algorithm with a deep learning Cox Proportional Hazards model. By applying this framework to mutation data from cBioportal, our model achieved an average concordance index of 0.62+-0.05 across four cancer types, outperforming existing deep neural network models. Additionally, we identified clinical features critical for differentiating between predicted high- and low-risk patients, with the relevance of these features being partially supported by previous studies.
{"title":"RCoxNet: deep learning framework for enhanced cancer survival prediction integrating random walk with restart with mutation and clinical data","authors":"Stuti Kumari, Sakshi Gujral, Smruti Panda, Prashant Gupta, Gaurav Ahuja, Debarka Sengupta","doi":"10.1101/2024.09.17.613428","DOIUrl":"https://doi.org/10.1101/2024.09.17.613428","url":null,"abstract":"Cancer poses a significant global health challenge, characterized by a complex disease progression and disrupted growth regulation. A thorough understanding of cellular and molecular biological mechanisms is essential for developing novel treatments and improving the accuracy of patient survival predictions. While prior studies have leveraged gene expression and clinical data to forecast survival outcomes through current machine learning and deep learning approaches, gene mutation data despite being a widely recognized metric has rarely been incorporated due to its limited information, inadequate representation of gene relationships, and data sparsity, which negatively affects the robustness, effectiveness, and interpretability of current survival analysis approaches. To overcome the challenges of mutation data sparsity, we propose RCoxNet, a novel deep learning neural network framework that integrates the Random Walk with Restart (RWR) algorithm with a deep learning Cox Proportional Hazards model. By applying this framework to mutation data from cBioportal, our model achieved an average concordance index of 0.62+-0.05 across four cancer types, outperforming existing deep neural network models. Additionally, we identified clinical features critical for differentiating between predicted high- and low-risk patients, with the relevance of these features being partially supported by previous studies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}