bioRxiv - Bioinformatics最新文献

ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences ECSFinder：从基因组序列优化预测进化保守的 RNA 二级结构

bioRxiv - Bioinformatics

Pub Date : 2024-09-19 DOI: 10.1101/2024.09.14.612549

Vanda A Gaonac'h-Lovejoy, Martin Sauvageau, John S Mattick, Martin A Smith

Accurate prediction of RNA secondary structures is essential for understanding the evolutionary conservation and functional roles of long noncoding RNAs (lncRNAs) across diverse species. In this study, we benchmarked two leading tools for predicting evolutionarily conserved RNA secondary structures (ECSs), SISSIz and R-scape, using two distinct experimental frameworks: one focusing on well-characterized mitochondrial RNA structures and the other on experimentally validated Rfam structures embedded within simulated genome alignments. While both tools performed comparably overall, each displayed subtle preferences in detecting ECSs. To address these limitations, we evaluated two interpretable machine learning approaches that integrate the strengths of both methods. By balancing thermodynamic stability features from RNALalifold and SISSIz with robust covariation metrics from R-scape, a random forest classifier significantly outperformed both conventional tools. This classifier was implemented in ECSfinder, a new tool that provides a robust, interpretable solution for genome-wide identification of conserved RNA structures, offering valuable insights into lncRNA function and evolutionary conservation. ECSfinder is designed for large-scale comparative genomics applications and promises to facilitate the discovery of novel functional RNA elements.

准确预测 RNA 二级结构对于理解不同物种间长非编码 RNA（lncRNA）的进化保护和功能作用至关重要。在这项研究中，我们使用两个不同的实验框架对预测进化保守的 RNA 二级结构（ECSs）的两个主要工具 SISSIz 和 R-scape 进行了基准测试：一个侧重于表征良好的线粒体 RNA 结构，另一个侧重于实验验证的嵌入模拟基因组比对的 Rfam 结构。虽然这两种工具的总体性能相当，但在检测 ECS 方面各有微妙的偏好。为了解决这些局限性，我们评估了两种可解释的机器学习方法，它们综合了两种方法的优势。通过平衡 RNALalifold 和 SISSIz 的热力学稳定性特征与 R-scape 的稳健协变指标，随机森林分类器的表现明显优于这两种传统工具。这种分类器在 ECSfinder 中实现，ECSfinder 是一种新工具，它为全基因组范围内保守 RNA 结构的鉴定提供了一种稳健、可解释的解决方案，为 lncRNA 的功能和进化保护提供了宝贵的见解。ECSfinder 专为大规模比较基因组学应用而设计，有望促进新型功能 RNA 元件的发现。

{"title":"ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences","authors":"Vanda A Gaonac'h-Lovejoy, Martin Sauvageau, John S Mattick, Martin A Smith","doi":"10.1101/2024.09.14.612549","DOIUrl":"https://doi.org/10.1101/2024.09.14.612549","url":null,"abstract":"Accurate prediction of RNA secondary structures is essential for understanding the evolutionary conservation and functional roles of long noncoding RNAs (lncRNAs) across diverse species. In this study, we benchmarked two leading tools for predicting evolutionarily conserved RNA secondary structures (ECSs), SISSIz and R-scape, using two distinct experimental frameworks: one focusing on well-characterized mitochondrial RNA structures and the other on experimentally validated Rfam structures embedded within simulated genome alignments. While both tools performed comparably overall, each displayed subtle preferences in detecting ECSs. To address these limitations, we evaluated two interpretable machine learning approaches that integrate the strengths of both methods. By balancing thermodynamic stability features from RNALalifold and SISSIz with robust covariation metrics from R-scape, a random forest classifier significantly outperformed both conventional tools. This classifier was implemented in ECSfinder, a new tool that provides a robust, interpretable solution for genome-wide identification of conserved RNA structures, offering valuable insights into lncRNA function and evolutionary conservation. ECSfinder is designed for large-scale comparative genomics applications and promises to facilitate the discovery of novel functional RNA elements.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interpretable high-resolution dimension reduction of spatial transcriptomics data by DeepFuseNMF 利用 DeepFuseNMF 对空间转录组学数据进行可解释的高分辨率维度缩减

bioRxiv - Bioinformatics

Pub Date : 2024-09-19 DOI: 10.1101/2024.09.12.612666

Junjie Tang, Zihao Chen, Kun Qian, Siyuan Huang, Yang He, Shenyi Yin, Xinyu He, Buqing Ye, Yan Zhuang, Hongxue Meng, Jianzhong Xi, Ruibin Xi

Spatial transcriptomics (ST) technologies have revolutionized tissue architecture studies by capturing gene expression with spatial context. However, high-dimensional ST data often have limited spatial resolution and exhibit considerable noise and sparsity, posing significant challenges in deciphering subtle spatial structures and underlying biological activities. Here, we introduce DeepFuseNMF, a multi-modal dimension reduction framework that enhances spatial resolution by integrating ST gene expression with high-resolution histology images. DeepFuseNMF incorporates non-negative matrix factorization into a neural network architecture, enabling the identification of interpretable, high resolution embeddings. Furthermore, DeepFuseNMF can simultaneously analyze multiple samples and is compatible with various types of histology images. Extensive evaluations on synthetic and real ST datasets from various technologies and tissue types demonstrate that DeepFuseNMF can effectively produce highly interpretable, high-resolution embeddings, and detects refined spatial structures. DeepFuseNMF represents a powerful approach for integrating ST data and histology images, offering deeper insights into complex tissue structures and functions.

空间转录组学（ST）技术通过捕捉具有空间背景的基因表达，彻底改变了组织结构研究。然而，高维空间转录组学数据的空间分辨率往往有限，并表现出相当大的噪声和稀疏性，这给解读微妙的空间结构和潜在的生物活动带来了巨大挑战。在此，我们介绍一种多模态降维框架 DeepFuseNMF，它通过整合 ST 基因表达和高分辨率组织学图像来提高空间分辨率。DeepFuseNMF 将非负矩阵因式分解纳入神经网络架构，从而能够识别可解释的高分辨率嵌入。此外，DeepFuseNMF 还能同时分析多个样本，并兼容各种类型的组织学图像。在来自不同技术和组织类型的合成和真实 ST 数据集上进行的广泛评估表明，DeepFuseNMF 能有效生成可解释性高的高分辨率嵌入，并能检测到精细的空间结构。DeepFuseNMF 是一种整合 ST 数据和组织学图像的强大方法，能让人们更深入地了解复杂的组织结构和功能。

{"title":"Interpretable high-resolution dimension reduction of spatial transcriptomics data by DeepFuseNMF","authors":"Junjie Tang, Zihao Chen, Kun Qian, Siyuan Huang, Yang He, Shenyi Yin, Xinyu He, Buqing Ye, Yan Zhuang, Hongxue Meng, Jianzhong Xi, Ruibin Xi","doi":"10.1101/2024.09.12.612666","DOIUrl":"https://doi.org/10.1101/2024.09.12.612666","url":null,"abstract":"Spatial transcriptomics (ST) technologies have revolutionized tissue architecture studies by capturing gene expression with spatial context. However, high-dimensional ST data often have limited spatial resolution and exhibit considerable noise and sparsity, posing significant challenges in deciphering subtle spatial structures and underlying biological activities. Here, we introduce DeepFuseNMF, a multi-modal dimension reduction framework that enhances spatial resolution by integrating ST gene expression with high-resolution histology images. DeepFuseNMF incorporates non-negative matrix factorization into a neural network architecture, enabling the identification of interpretable, high resolution embeddings. Furthermore, DeepFuseNMF can simultaneously analyze multiple samples and is compatible with various types of histology images. Extensive evaluations on synthetic and real ST datasets from various technologies and tissue types demonstrate that DeepFuseNMF can effectively produce highly interpretable, high-resolution embeddings, and detects refined spatial structures. DeepFuseNMF represents a powerful approach for integrating ST data and histology images, offering deeper insights into complex tissue structures and functions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142268538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better? 生物信息学家、计算机科学家和遗传学家领导生物信息学工具的开发--哪一个更好？

bioRxiv - Bioinformatics

Pub Date : 2024-09-19 DOI: 10.1101/2024.08.25.609622

Paul P. Gardner

The development of accurate bioinformatic software tools is crucial for the effective analysis of complex biological data. This study examines the relationship between the academic department affiliations of authors and the accuracy of the bioinformatic tools they develop. By analyzing a corpus of previously benchmarked bioinformatic software tools, we mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field. Our results suggest that "Medical Informatics" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation. In contrast, tools developed by authors affiliated with "Bioinformatics" and "Engineering" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant (p>0.05). Our findings reveal no strong association between academic field and bioinformatic software accuracy. These findings suggest that the development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.

开发准确的生物信息软件工具对于有效分析复杂的生物数据至关重要。本研究探讨了作者所属学科与他们开发的生物信息学工具准确性之间的关系。通过分析以前基准生物信息学软件工具的语料库，我们将生物信息学工具与相应作者的学术领域进行了映射，并按领域评估了工具的准确性。我们的结果表明，"医学信息学 "在生物信息学软件准确性方面优于所有其他领域，在准确性排名中获胜的平均比例超过了空期望值。相比之下，隶属于 "生物信息学 "和 "工程学 "领域的作者开发的工具往往准确性较低。然而，经过多重检验校正后，没有任何结果具有统计学意义（p>0.05）。我们的研究结果表明，学术领域与生物信息软件的准确性之间并无密切联系。这些研究结果表明，只要有足够的资源和培训，任何部门都可以有效地开发跨学科应用软件。

{"title":"A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better?","authors":"Paul P. Gardner","doi":"10.1101/2024.08.25.609622","DOIUrl":"https://doi.org/10.1101/2024.08.25.609622","url":null,"abstract":"The development of accurate bioinformatic software tools is crucial for the effective analysis of complex biological data. This study examines the relationship between the academic department affiliations of authors and the accuracy of the bioinformatic tools they develop. By analyzing a corpus of previously benchmarked bioinformatic software tools, we mapped bioinformatic tools to the academic fields of the corresponding authors and evaluated tool accuracy by field. Our results suggest that \"Medical Informatics\" outperforms all other fields in bioinformatic software accuracy, with a mean proportion of wins in accuracy rankings exceeding the null expectation. In contrast, tools developed by authors affiliated with \"Bioinformatics\" and \"Engineering\" fields tend to be less accurate. However, after correcting for multiple testing, no result is statistically significant (<em>p</em>>0.05). Our findings reveal no strong association between academic field and bioinformatic software accuracy. These findings suggest that the development of interdisciplinary software applications can be effectively undertaken by any department with sufficient resources and training.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GeneSpectra: a method for context-aware comparison of cell type gene expression across species GeneSpectra：一种对不同物种细胞类型基因表达进行上下文感知比较的方法

bioRxiv - Bioinformatics

Pub Date : 2024-09-19 DOI: 10.1101/2024.06.21.600109

Yuyao Song, Irene Papatheodorou, Alvis Brazma

Computational comparison of single cell expression profiles cross-species uncovers functional similarities and differences between cell types. Importantly, it offers the potential to refine evolutionary relationships based on gene expression. Current analysis strategies are limited by the strong hypothesis of ortholog conjecture, which implies that orthologs have similar cell type expression patterns. They also lose expression information from non-orthologs, making them inapplicable in practice for large evolutionary distances. To address these limitations, we devised a novel analytical framework, GeneSpectra, to robustly classify genes by their expression specificity and distribution across cell types. This framework allows for the generalization of the ortholog conjecture by evaluating the degree of ortholog class conservation. We utilise different gene classes to decode species effects on cross-species transcriptomics space and compare sequence conservation with expression specificity similarity across different types of orthologs. We develop contextualised cell type similarity measurements while considering species-unique genes and non-one-to-one orthologs. Finally, we consolidate gene classification results into a knowledge graph, GeneSpectraKG, allowing a hierarchical depiction of cell types and orthologous groups, while continuously integrating new data.

通过对跨物种单细胞表达谱进行计算比较，可以发现细胞类型之间的功能异同。重要的是，它为完善基于基因表达的进化关系提供了可能。目前的分析策略受到同源物猜想这一强假设的限制，该猜想意味着同源物具有相似的细胞类型表达模式。此外，它们还会丢失非同源物的表达信息，因此在实际应用中无法适用于大的进化距离。为了解决这些局限性，我们设计了一个新颖的分析框架--GeneSpectra，根据基因在不同细胞类型中的表达特异性和分布对基因进行稳健分类。该框架通过评估直向同源物类别的保护程度来推广直向同源物猜想。我们利用不同的基因类别来解码跨物种转录组学空间的物种效应，并比较不同类型直向同源物的序列保守性和表达特异性相似性。在考虑物种独特基因和非一对一直向同源物的同时，我们还开发了语境化细胞类型相似性测量方法。最后，我们将基因分类结果整合到一个知识图谱 GeneSpectraKG 中，允许对细胞类型和同源组进行分层描述，同时不断整合新数据。

{"title":"GeneSpectra: a method for context-aware comparison of cell type gene expression across species","authors":"Yuyao Song, Irene Papatheodorou, Alvis Brazma","doi":"10.1101/2024.06.21.600109","DOIUrl":"https://doi.org/10.1101/2024.06.21.600109","url":null,"abstract":"Computational comparison of single cell expression profiles cross-species uncovers functional similarities and differences between cell types. Importantly, it offers the potential to refine evolutionary relationships based on gene expression. Current analysis strategies are limited by the strong hypothesis of ortholog conjecture, which implies that orthologs have similar cell type expression patterns. They also lose expression information from non-orthologs, making them inapplicable in practice for large evolutionary distances. To address these limitations, we devised a novel analytical framework, GeneSpectra, to robustly classify genes by their expression specificity and distribution across cell types. This framework allows for the generalization of the ortholog conjecture by evaluating the degree of ortholog class conservation. We utilise different gene classes to decode species effects on cross-species transcriptomics space and compare sequence conservation with expression specificity similarity across different types of orthologs. We develop contextualised cell type similarity measurements while considering species-unique genes and non-one-to-one orthologs. Finally, we consolidate gene classification results into a knowledge graph, GeneSpectraKG, allowing a hierarchical depiction of cell types and orthologous groups, while continuously integrating new data.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The "very moment" when UDG recognizes aflipped-out uracil base in dsDNA UDG 识别 dsDNA 中翻转的尿嘧啶碱基的 "非常时刻"

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.13.612628

Vinnarasi Saravanan, Nessim Raouraoua, Guillaume Brysbaert, Stefano Giordano, Marc F Lensink, Fabrizio Cleri, Ralf Blossey

Uracil-DNA glycosylase (UDG) is the first enzyme in the base-excision repair (BER) pathway, acting on uracil bases in DNA. How UDG finds its targets has not been conclusively resolved yet. Based on available structural and other experimental evidence, two possible pathways are under discussion. In one, the action of UDG on the DNA bases is believed to follow a "pinch-push-pull" model, in which UDG generates the base-flip in an active manner. A second scenario is based on the exploitation of bases flipping out thermally from the DNA. Recent molecular dynamics (MD) studies of DNA in trinucleosome arrays have shown that base-flipping can be readily induced by the action of mechanical forces on DNA alone. This alternative mechanism could possibly enhance the probability for the second scnenario of UDG- uracil interaction via the formation of a recognition complex of UDG with flipped-out base. In this work we describe DNA structures with flipped-out uracil bases generated by MD simulations which we then subject to docking simulations with the UDG enzyme. Our results for the UDG-uracil recognition complex support the view that base-flipping induced by DNA mechanics can be a relevant mechanism of uracil base recognition by the UDG glycosylase in chromatin.

尿嘧啶-DNA 糖基化酶（UDG）是碱基切除修复（BER）途径中的第一种酶，作用于 DNA 中的尿嘧啶碱基。UDG 如何找到它的目标尚未得到最终解决。根据现有的结构和其他实验证据，目前正在讨论两种可能的途径。其一，UDG 对 DNA 碱基的作用被认为遵循 "捏-推-拉 "模式，即 UDG 以主动方式产生碱基翻转。第二种情况是利用碱基在 DNA 中的热翻转。最近对三核体阵列中 DNA 的分子动力学（MD）研究表明，仅通过 DNA 上的机械力就能轻易诱导碱基翻转。这种替代机制有可能通过 UDG 与翻转碱基形成识别复合物，提高 UDG 与尿嘧啶相互作用的第二种情况的发生概率。在这项工作中，我们描述了通过 MD 模拟生成的带有外翻尿嘧啶碱基的 DNA 结构，然后将其与 UDG 酶进行对接模拟。我们对 UDG-尿嘧啶识别复合物的研究结果支持这样一种观点，即 DNA 力学引起的碱基翻转可能是染色质中 UDG 糖基化酶识别尿嘧啶碱基的一种相关机制。

{"title":"The \"very moment\" when UDG recognizes aflipped-out uracil base in dsDNA","authors":"Vinnarasi Saravanan, Nessim Raouraoua, Guillaume Brysbaert, Stefano Giordano, Marc F Lensink, Fabrizio Cleri, Ralf Blossey","doi":"10.1101/2024.09.13.612628","DOIUrl":"https://doi.org/10.1101/2024.09.13.612628","url":null,"abstract":"Uracil-DNA glycosylase (UDG) is the first enzyme in the base-excision repair (BER) pathway, acting on uracil bases in DNA. How UDG finds its targets has not been conclusively resolved yet. Based on available structural and other experimental evidence, two possible pathways are under discussion. In one, the action of UDG on the DNA bases is believed to follow a \"pinch-push-pull\" model, in which UDG generates the base-flip in an active manner. A second scenario is based on the exploitation of bases flipping out thermally from the DNA. Recent molecular dynamics (MD) studies of DNA in trinucleosome arrays have shown that base-flipping can be readily induced by the action of mechanical forces on DNA alone. This alternative mechanism could possibly enhance the probability for the second scnenario of UDG- uracil interaction via the formation of a recognition complex of UDG with flipped-out base. In this work we describe DNA structures with flipped-out uracil bases generated by MD simulations which we then subject to docking simulations with the UDG enzyme. Our results for the UDG-uracil recognition complex support the view that base-flipping induced by DNA mechanics can be a relevant mechanism of uracil base recognition by the UDG glycosylase in chromatin.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"138 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

metagWGS, a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads metagWGS，使用 Illumina 或 PacBio HiFi 读数分析元基因组数据的综合工作流程

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.13.612854

Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede

Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.

背景：为了从分类学和功能上研究微生物群落，现在经常使用元基因组分析。如果没有参考基因目录，就需要从头开始研究。由于基因组比等位基因更容易解读，因此通过对元基因组数据中的等位基因进行分选来恢复元基因组组装基因组（MAG）已成为微生物研究的一项常见任务。然而，在这一过程中，等位基因的组装和分选之间的信息损失很大。这就是为什么必须为所有等位基因生成分类和功能矩阵，而不仅仅是那些包含在正确分选中的等位基因。此外，Pacbio HiFi 读数（长度长、质量好）现在可以替代 Illumina 短读数，尽管价格更贵。因此，我们开发了一种工作流程，该流程易于安装，使用奇异图像固定了依赖关系，并易于在计算集群上使用，既能分析短读数，也能分析长读数，还能根据用户的选择在队列和/或数据集层面进行分析。下面将介绍 metagWGS，这是一种用于元基因组数据分析的全自动工作流程。它使用了一种新工具来细化数据集（称为 Binette），我们将证明这种工具比同类工具更有效。方法：metagWGS 是一个 Nextflow 工作流，附带两个奇异图像和完整的文档，便于安装和使用。由于 metagWGS 的主要原始特征涉及分选（短读取和长读取）和 HiFi 读取的分析，我们将 metagWGS 与 PacBio 提出的 MAG 构建工作流程进行了比较，并对 Pacbio 用于推广其工作流程的公共数据集进行了比较。结果：metagWGS 与现有工作流程的不同之处在于：(i) 提供灵活的组装方法；(ii) 支持短读数（Illumina）或 PacBio HiFi 读数；(iii) 将多种分选算法与新的分选细化工具（称为 Binette）相结合，以实现高质量的基因组分选；(iv) 为所有基因、所有构建的等位基因和分选提供分类和功能注释。与 Pacbio HiFi 专用工作流程（即 HiFi-MAGS-管道）（659 个中等质量 bins 和 231 个高质量 bins）相比，metagWGS 在 11 个人类肠道公共元基因组样本上产生了更多的中等质量 bins（708 个）和高质量 bins（255 个），这主要归功于 Binette 更好的性能。

{"title":"metagWGS, a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads","authors":"Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede","doi":"10.1101/2024.09.13.612854","DOIUrl":"https://doi.org/10.1101/2024.09.13.612854","url":null,"abstract":"Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"186 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PANOMIQ: A Unified Approach to Whole-Genome, Exome, and Microbiome Data Analysis PANOMIQ：全基因组、外显子组和微生物组数据分析的统一方法

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.17.613203

Shivani Srivastava, Saba Ehsan, Linkon Chowdhury, Muhammad Omar Faruk, Abhishek Singh, Anmol S Kapoor, Sidharth Bhinder, Mohan P Singh, Divya Mishra

The integration of whole-genome sequencing (WGS), whole-exome sequencing (WES), and microbiome analysis has become essential for advancing our understanding of complex biological systems. However, the fragmented nature of current analytical tools often complicates the process, leading to inefficiencies and potential data loss. To address this challenge, we present PANOMIQ, a comprehensive software solution that unifies the analysis of WGS, WES, and microbiome data into a single, streamlined pipeline. PANOMIQ is designed to facilitate the entire analysis process from raw data to interpretable results. It is the fastest algorithm that can achieve results much more quickly compared to traditional pipeline approaches of WGS and WES analysis. It incorporates advanced algorithms for high-accuracy variant calling in both WGS and WES, along with robust tools for characterizing microbial communities. The software's modular architecture allows for seamless integration of these diverse data types, enabling researchers to uncover complex interactions between host genomics and microbiomes. In this study, we demonstrate the capabilities of PANOMIQ by applying it to a series of datasets encompassing a wide range of applications, including disease association studies and environmental microbiome profiling. Our results highlight PANOMIQ's ability to deliver comprehensive insights, significantly reducing the time and computational resources required for multi-omic analysis. By providing a unified platform for WGS, WES, and microbiome analysis, PANOMIQ offers a powerful tool for researchers aiming to explore the full spectrum of genomic and microbial diversity. This software not only simplifies the analytical workflow but also enhances the depth of biological interpretation, paving the way for more integrated and holistic studies in genomics and microbiology.

全基因组测序（WGS）、全外显子组测序（WES）和微生物组分析的整合对于增进我们对复杂生物系统的了解至关重要。然而，当前分析工具的零散性往往使这一过程复杂化，导致效率低下和潜在的数据丢失。为了应对这一挑战，我们推出了 PANOMIQ，这是一种综合软件解决方案，可将 WGS、WES 和微生物组数据的分析统一到一个精简的管道中。PANOMIQ 旨在促进从原始数据到可解释结果的整个分析过程。它是最快的算法，与传统的 WGS 和 WES 分析管道方法相比，能更快地得出结果。它采用了先进的算法，可在 WGS 和 WES 中进行高精度的变异调用，同时还提供了用于描述微生物群落特征的强大工具。该软件的模块化架构允许无缝整合这些不同的数据类型，使研究人员能够发现宿主基因组学与微生物组之间复杂的相互作用。在本研究中，我们将 PANOMIQ 应用于一系列数据集，包括疾病关联研究和环境微生物组剖析等广泛应用，从而展示了 PANOMIQ 的能力。我们的研究结果凸显了 PANOMIQ 提供全面见解的能力，大大减少了多组学分析所需的时间和计算资源。通过为 WGS、WES 和微生物组分析提供统一的平台，PANOMIQ 为旨在探索基因组和微生物多样性的研究人员提供了一个强大的工具。该软件不仅简化了分析工作流程，还提高了生物学解释的深度，为基因组学和微生物学领域更综合、更全面的研究铺平了道路。

{"title":"PANOMIQ: A Unified Approach to Whole-Genome, Exome, and Microbiome Data Analysis","authors":"Shivani Srivastava, Saba Ehsan, Linkon Chowdhury, Muhammad Omar Faruk, Abhishek Singh, Anmol S Kapoor, Sidharth Bhinder, Mohan P Singh, Divya Mishra","doi":"10.1101/2024.09.17.613203","DOIUrl":"https://doi.org/10.1101/2024.09.17.613203","url":null,"abstract":"The integration of whole-genome sequencing (WGS), whole-exome sequencing (WES), and microbiome analysis has become essential for advancing our understanding of complex biological systems. However, the fragmented nature of current analytical tools often complicates the process, leading to inefficiencies and potential data loss. To address this challenge, we present PANOMIQ, a comprehensive software solution that unifies the analysis of WGS, WES, and microbiome data into a single, streamlined pipeline. PANOMIQ is designed to facilitate the entire analysis process from raw data to interpretable results. It is the fastest algorithm that can achieve results much more quickly compared to traditional pipeline approaches of WGS and WES analysis. It incorporates advanced algorithms for high-accuracy variant calling in both WGS and WES, along with robust tools for characterizing microbial communities. The software's modular architecture allows for seamless integration of these diverse data types, enabling researchers to uncover complex interactions between host genomics and microbiomes. In this study, we demonstrate the capabilities of PANOMIQ by applying it to a series of datasets encompassing a wide range of applications, including disease association studies and environmental microbiome profiling. Our results highlight PANOMIQ's ability to deliver comprehensive insights, significantly reducing the time and computational resources required for multi-omic analysis. By providing a unified platform for WGS, WES, and microbiome analysis, PANOMIQ offers a powerful tool for researchers aiming to explore the full spectrum of genomic and microbial diversity. This software not only simplifies the analytical workflow but also enhances the depth of biological interpretation, paving the way for more integrated and holistic studies in genomics and microbiology.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FunCoup 6: advancing functional association networks across species with directed links and improved user experience FunCoup 6：通过定向链接推进跨物种功能关联网络并改善用户体验

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.13.612391

Davide Buzzao, Emma Persson, Dimitri Guala, Erik L L Sonnhammer

FunCoup 6 (https://funcoup6.scilifelab.se/, will be https://funcoup.org after publication) represents a significant advancement in global functional association networks, aiming to provide researchers with a comprehensive view of the functional coupling interactome. This update introduces novel methodologies and integrated tools for improved network inference and analysis. Major new developments in FunCoup 6 include vastly expanding the coverage of gene regulatory links, a new framework for bin-free Bayesian training, and a new website. FunCoup 6 integrates a new tool for disease and drug target module identification using the TOPAS algorithm. To expand the utility of the resource for biomedical research, it incorporates pathway enrichment analysis using the ANUBIX and EASE algorithms. The unique comparative interactomics analysis in FunCoup provides insights of network conservation, now allowing users to align orthologs only or query each species network independently. Bin-free training was applied to 23 primary species, and in addition networks were generated for all remaining 618 species in InParanoiDB 9. Accompanying these advancements, FunCoup 6 features a new redesigned website, together with updated API functionalities, and represents a pivotal step forward in functional genomics research, offering unique capabilities for exploring the complex landscape of protein interactions.

FunCoup 6（https://funcoup6.scilifelab.se/，出版后将为 https://funcoup.org）代表了全球功能关联网络的重大进展，旨在为研究人员提供功能耦合相互作用组的全面视图。这次更新引入了新的方法和集成工具，以改进网络推断和分析。FunCoup 6 的主要新进展包括：极大地扩展了基因调控联系的覆盖范围、无二进制贝叶斯训练新框架和新网站。FunCoup 6 集成了一个新工具，可使用 TOPAS 算法识别疾病和药物靶标模块。为了扩大该资源在生物医学研究中的实用性，它采用了 ANUBIX 和 EASE 算法进行通路富集分析。FunCoup 中独特的比较相互作用组学分析提供了网络保护的见解，现在用户可以只对齐同源物或独立查询每个物种的网络。对 23 个主要物种进行了无比对训练，此外还为 InParanoiDB 9 中的所有剩余 618 个物种生成了网络。伴随着这些进步，FunCoup 6 采用了全新设计的网站和更新的 API 功能，在功能基因组学研究领域迈出了关键的一步，为探索复杂的蛋白质相互作用提供了独特的功能。

{"title":"FunCoup 6: advancing functional association networks across species with directed links and improved user experience","authors":"Davide Buzzao, Emma Persson, Dimitri Guala, Erik L L Sonnhammer","doi":"10.1101/2024.09.13.612391","DOIUrl":"https://doi.org/10.1101/2024.09.13.612391","url":null,"abstract":"FunCoup 6 (https://funcoup6.scilifelab.se/, will be https://funcoup.org after publication) represents a significant advancement in global functional association networks, aiming to provide researchers with a comprehensive view of the functional coupling interactome. This update introduces novel methodologies and integrated tools for improved network inference and analysis. Major new developments in FunCoup 6 include vastly expanding the coverage of gene regulatory links, a new framework for bin-free Bayesian training, and a new website. FunCoup 6 integrates a new tool for disease and drug target module identification using the TOPAS algorithm. To expand the utility of the resource for biomedical research, it incorporates pathway enrichment analysis using the ANUBIX and EASE algorithms. The unique comparative interactomics analysis in FunCoup provides insights of network conservation, now allowing users to align orthologs only or query each species network independently. Bin-free training was applied to 23 primary species, and in addition networks were generated for all remaining 618 species in InParanoiDB 9. Accompanying these advancements, FunCoup 6 features a new redesigned website, together with updated API functionalities, and represents a pivotal step forward in functional genomics research, offering unique capabilities for exploring the complex landscape of protein interactions.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Iterative Approach to Polish the Nanopore Sequencing Basecalling for Therapeutic RNA Quality Control 迭代法打磨用于治疗 RNA 质量控制的纳米孔测序基线信号

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.12.612711

Ziyuan Wang, Mei-Juan Tu, Ziyang Liu, Katherine K Wang, Yinshan Fang, Ning Hao, Hao Helen Zhang, Jianwen Que, Xiaoxiao Sun, Ai-Ming Yu, HONGXU DING

Nucleotide modifications deviate nanopore sequencing readouts, therefore generating artifacts during the basecalling of sequence backbones. Here, we present an iterative approach to polish modification-disturbed basecalling results. We show such an approach is able to promote the basecalling accuracy of both artificially-synthesized and real-world molecules. With demonstrated efficacy and reliability, we exploit the approach to precisely basecall therapeutic RNAs consisting of artificial or natural modifications, as the basis for quantifying the purity and integrity of vaccine mRNAs which are transcribed in vitro, and for determining modification hotspots of novel therapeutic RNA interference (RNAi) molecules which are bioengineered (BioRNA) in vivo.

核苷酸修饰会使纳米孔测序读数出现偏差，从而在序列骨架的基底调用过程中产生伪影。在这里，我们提出了一种迭代方法来抛光修饰干扰的基调结果。我们表明，这种方法能够提高人工合成分子和真实世界分子的基线计算精度。我们证明了这种方法的有效性和可靠性，并利用这种方法精确呼叫由人工或天然修饰组成的治疗 RNA，以此为基础量化体外转录的疫苗 mRNA 的纯度和完整性，并确定体内生物工程（BioRNA）的新型治疗 RNA 干扰（RNAi）分子的修饰热点。

引用次数: 0

RCoxNet: deep learning framework for enhanced cancer survival prediction integrating random walk with restart with mutation and clinical data RCoxNet：用于增强癌症生存预测的深度学习框架，将随机漫步与突变和临床数据重新开始整合在一起

bioRxiv - Bioinformatics

Pub Date : 2024-09-18 DOI: 10.1101/2024.09.17.613428

Stuti Kumari, Sakshi Gujral, Smruti Panda, Prashant Gupta, Gaurav Ahuja, Debarka Sengupta

Cancer poses a significant global health challenge, characterized by a complex disease progression and disrupted growth regulation. A thorough understanding of cellular and molecular biological mechanisms is essential for developing novel treatments and improving the accuracy of patient survival predictions. While prior studies have leveraged gene expression and clinical data to forecast survival outcomes through current machine learning and deep learning approaches, gene mutation data despite being a widely recognized metric has rarely been incorporated due to its limited information, inadequate representation of gene relationships, and data sparsity, which negatively affects the robustness, effectiveness, and interpretability of current survival analysis approaches. To overcome the challenges of mutation data sparsity, we propose RCoxNet, a novel deep learning neural network framework that integrates the Random Walk with Restart (RWR) algorithm with a deep learning Cox Proportional Hazards model. By applying this framework to mutation data from cBioportal, our model achieved an average concordance index of 0.62+-0.05 across four cancer types, outperforming existing deep neural network models. Additionally, we identified clinical features critical for differentiating between predicted high- and low-risk patients, with the relevance of these features being partially supported by previous studies.

癌症对全球健康构成重大挑战，其特点是疾病进展复杂，生长调节紊乱。透彻了解细胞和分子生物学机制对于开发新型治疗方法和提高患者生存预测的准确性至关重要。虽然之前的研究已通过当前的机器学习和深度学习方法利用基因表达和临床数据预测生存结果，但基因突变数据尽管是一个广受认可的指标，却很少被纳入其中，原因在于其信息有限、基因关系表示不充分以及数据稀疏，这对当前生存分析方法的稳健性、有效性和可解释性产生了负面影响。为了克服突变数据稀少带来的挑战，我们提出了一种新颖的深度学习神经网络框架--RCoxNet，它将随机行走与重启（RWR）算法与深度学习考克斯比例危害模型整合在一起。通过将该框架应用于来自 cBioportal 的突变数据，我们的模型在四种癌症类型中实现了 0.62+-0.05 的平均一致性指数，优于现有的深度神经网络模型。此外，我们还发现了区分预测的高风险和低风险患者的关键临床特征，这些特征的相关性得到了先前研究的部分支持。

{"title":"RCoxNet: deep learning framework for enhanced cancer survival prediction integrating random walk with restart with mutation and clinical data","authors":"Stuti Kumari, Sakshi Gujral, Smruti Panda, Prashant Gupta, Gaurav Ahuja, Debarka Sengupta","doi":"10.1101/2024.09.17.613428","DOIUrl":"https://doi.org/10.1101/2024.09.17.613428","url":null,"abstract":"Cancer poses a significant global health challenge, characterized by a complex disease progression and disrupted growth regulation. A thorough understanding of cellular and molecular biological mechanisms is essential for developing novel treatments and improving the accuracy of patient survival predictions. While prior studies have leveraged gene expression and clinical data to forecast survival outcomes through current machine learning and deep learning approaches, gene mutation data despite being a widely recognized metric has rarely been incorporated due to its limited information, inadequate representation of gene relationships, and data sparsity, which negatively affects the robustness, effectiveness, and interpretability of current survival analysis approaches. To overcome the challenges of mutation data sparsity, we propose RCoxNet, a novel deep learning neural network framework that integrates the Random Walk with Restart (RWR) algorithm with a deep learning Cox Proportional Hazards model. By applying this framework to mutation data from cBioportal, our model achieved an average concordance index of 0.62+-0.05 across four cancer types, outperforming existing deep neural network models. Additionally, we identified clinical features critical for differentiating between predicted high- and low-risk patients, with the relevance of these features being partially supported by previous studies.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0