Summary: Common approaches for deciphering biological networks involve network embedding algorithms. These approaches strictly focus on clustering the genes' embedding vectors and interpreting such clusters to reveal the hidden information of the networks. However, the difficulty in interpreting the genes' clusters and the limitations of the functional annotations' resources hinder the identification of the currently unknown cell's functioning mechanisms. We propose a new approach that shifts this functional exploration from the embedding vectors of genes in space to the axes of the space itself. Our methodology better disentangles biological information from the embedding space than the classic gene-centric approach. Moreover, it uncovers new data-driven functional interactions that are unregistered in the functional ontologies, but biologically coherent. Furthermore, we exploit these interactions to define new higher-level annotations that we term Axes-Specific Functional Annotations and validate them through literature curation. Finally, we leverage our methodology to discover evolutionary connections between cellular functions and the evolution of species.
Availability and implementation: Data and source code can be accessed at https://gitlab.bsc.es/sdoria/axes-of-biology.git.
{"title":"The axes of biology: a novel axes-based network embedding paradigm to decipher the functional mechanisms of the cell.","authors":"Sergio Doria-Belenguer, Alexandros Xenos, Gaia Ceddia, Noël Malod-Dognin, Nataša Pržulj","doi":"10.1093/bioadv/vbae075","DOIUrl":"10.1093/bioadv/vbae075","url":null,"abstract":"<p><strong>Summary: </strong>Common approaches for deciphering biological networks involve network embedding algorithms. These approaches strictly focus on clustering the genes' embedding vectors and interpreting such clusters to reveal the hidden information of the networks. However, the difficulty in interpreting the genes' clusters and the limitations of the functional annotations' resources hinder the identification of the currently unknown cell's functioning mechanisms. We propose a new approach that shifts this functional exploration from the embedding vectors of genes in space to the axes of the space itself. Our methodology better disentangles biological information from the embedding space than the classic gene-centric approach. Moreover, it uncovers new data-driven functional interactions that are unregistered in the functional ontologies, but biologically coherent. Furthermore, we exploit these interactions to define new higher-level annotations that we term Axes-Specific Functional Annotations and validate them through literature curation. Finally, we leverage our methodology to discover evolutionary connections between cellular functions and the evolution of species.</p><p><strong>Availability and implementation: </strong>Data and source code can be accessed at https://gitlab.bsc.es/sdoria/axes-of-biology.git.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11142626/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141201302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felipe Rojas-Rodríguez, Marjanka K Schmidt, S. Canisius
Most cancer driver gene identification tools have been developed for whole-exome sequencing data. Targeted sequencing is a popular alternative to whole-exome sequencing for large cancer studies due to its greater depth at a lower cost per tumor. Unlike whole-exome sequencing, targeted sequencing only enables mutation calling for a selected subset of genes. Whether existing driver gene identification tools remain valid in that context has not previously been studied. We evaluated the validity of seven popular driver gene identification tools when applied to targeted sequencing data. Based on whole-exome data of 14 different cancer types from TCGA, we constructed matching targeted datasets by keeping only the mutations overlapping with the pan-cancer MSK-IMPACT panel and, in the case of breast cancer, also the breast-cancer-specific B-CAST panel. We then compared the driver gene predictions obtained on whole-exome and targeted mutation data for each of the seven tools. Differences in how the tools model background mutation rates were the most important determinant of their validity on targeted sequencing data. Based on our results, we recommend OncodriveFML, OncodriveCLUSTL, 20/20+, dNdSCv, and ActiveDriver for driver gene identification in targeted sequencing data, whereas MutSigCV and DriverML are best avoided in that context. Supplementary data are available at Bioinformatics Advances online.
{"title":"Assessing the validity of driver gene identification tools for targeted genome sequencing data","authors":"Felipe Rojas-Rodríguez, Marjanka K Schmidt, S. Canisius","doi":"10.1093/bioadv/vbae073","DOIUrl":"https://doi.org/10.1093/bioadv/vbae073","url":null,"abstract":"\u0000 \u0000 \u0000 Most cancer driver gene identification tools have been developed for whole-exome sequencing data. Targeted sequencing is a popular alternative to whole-exome sequencing for large cancer studies due to its greater depth at a lower cost per tumor. Unlike whole-exome sequencing, targeted sequencing only enables mutation calling for a selected subset of genes. Whether existing driver gene identification tools remain valid in that context has not previously been studied.\u0000 \u0000 \u0000 \u0000 We evaluated the validity of seven popular driver gene identification tools when applied to targeted sequencing data. Based on whole-exome data of 14 different cancer types from TCGA, we constructed matching targeted datasets by keeping only the mutations overlapping with the pan-cancer MSK-IMPACT panel and, in the case of breast cancer, also the breast-cancer-specific B-CAST panel. We then compared the driver gene predictions obtained on whole-exome and targeted mutation data for each of the seven tools. Differences in how the tools model background mutation rates were the most important determinant of their validity on targeted sequencing data. Based on our results, we recommend OncodriveFML, OncodriveCLUSTL, 20/20+, dNdSCv, and ActiveDriver for driver gene identification in targeted sequencing data, whereas MutSigCV and DriverML are best avoided in that context.\u0000 \u0000 \u0000 \u0000 Supplementary data are available at Bioinformatics Advances online.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141104097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae069
Christos A Ouzounis
Summary: We explore the nuanced temporal and epistemological distinctions among natural sciences, particularly the contrasting treatment of time and the interplay between theory and experimentation. Physics, an exemplar of mature science, relies on theoretical models for predictability and simulations. In contrast, biology, traditionally experimental, is witnessing a computational surge, with data analytics and simulations reshaping its research paradigms. Despite these strides, a unified theoretical framework in biology remains elusive. We propose that contemporary global challenges might usher in a renewed emphasis, presenting an opportunity for the establishment of a novel theoretical underpinning for the life sciences.
Availability and implementation: https://github.com/ouzounis/CLS-emerges Data in Json format, Images in PNG format.
{"title":"Biology's transformation: from observation through experiment to computation.","authors":"Christos A Ouzounis","doi":"10.1093/bioadv/vbae069","DOIUrl":"10.1093/bioadv/vbae069","url":null,"abstract":"<p><strong>Summary: </strong>We explore the nuanced temporal and epistemological distinctions among natural sciences, particularly the contrasting treatment of time and the interplay between theory and experimentation. Physics, an exemplar of mature science, relies on theoretical models for predictability and simulations. In contrast, biology, traditionally experimental, is witnessing a computational surge, with data analytics and simulations reshaping its research paradigms. Despite these strides, a unified theoretical framework in biology remains elusive. We propose that contemporary global challenges might usher in a renewed emphasis, presenting an opportunity for the establishment of a novel theoretical underpinning for the life sciences.</p><p><strong>Availability and implementation: </strong>https://github.com/ouzounis/CLS-emerges Data in Json format, Images in PNG format.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11127110/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141154743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DeGeCI is a command line tool that generates fully automated de novo gene predictions from mitochondrial nucleotide sequences by using a reference database of annotated mitogenomes which is represented as a de Bruijngraph. The input genome is mapped to this graph, creating a subgraph, which is then post-processed by a clustering routine. Version 1.1 of DeGeCI offers a web front-end for GUI-based input. It also introduces a new taxonomic filter pipeline that allows the species in the reference database to be restricted to a user-specified taxonomic classification and allows for gene boundary optimization when providing the translation table of the input genome. The web platform is accessible at https://degeci.informatik.uni-leipzig.de. Source code is freely available at https://git.informatik.uni-leipzig.de/lfiedler/degeci.
{"title":"DeGeCI 1.1: a web platform for gene annotation of mitochondrial genomes","authors":"Lisa Fiedler, Matthias Bernt, Martin Middendorf","doi":"10.1093/bioadv/vbae072","DOIUrl":"https://doi.org/10.1093/bioadv/vbae072","url":null,"abstract":"\u0000 \u0000 \u0000 DeGeCI is a command line tool that generates fully automated de novo gene predictions from mitochondrial nucleotide sequences by using a reference database of annotated mitogenomes which is represented as a de Bruijngraph. The input genome is mapped to this graph, creating a subgraph, which is then post-processed by a clustering routine. Version 1.1 of DeGeCI offers a web front-end for GUI-based input. It also introduces a new taxonomic filter pipeline that allows the species in the reference database to be restricted to a user-specified taxonomic classification and allows for gene boundary optimization when providing the translation table of the input genome.\u0000 \u0000 \u0000 \u0000 The web platform is accessible at https://degeci.informatik.uni-leipzig.de. Source code is freely available at https://git.informatik.uni-leipzig.de/lfiedler/degeci.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140985216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daejin Hyung, Soo Young Cho, Kyubin Lee, Namhee Yu, Sehwa Hong, Charny Park
Alternative splicing (AS) is a key regulatory mechanism that confers genetic diversity and phenotypic plasticity of human. The exons and their flanking regions include comprehensive junction-incorporating sequence features like splicing factor binding sites, and protein domains. These elements involve in exon usage, and finally contribute to isoform-specific biological functions. Splicing-associated sequence features are involved in the multilayered regulation encompassing DNA and proteins. However, most analysis applications have investigated limited sequence features, like protein domains. It is insufficient to explain the comprehensive cause and effect of exon-specific biological processes. With the advent of RNA-seq technology, global AS event analysis has deduced more precise results. As accumulating analysis results, it could be a challenge to identify multi-omics sequence features for AS events. Therefore, application to investigate multi-omics sequence features is useful to scan critical evidence. ASpedia-R is an R package to interrogate junction-incorporating sequence features for human genes. Our database collected the heterogeneous profile encompassed from DNA to protein. Additionally, knowledge-based splicing genes were collected using text-mining to test the association with specific pathway terms. Our package retrieves AS events for high-throughput data analysis results via AS event ID converter. Finally, result profile could be visualized and saved to multiple formats: sequence feature result table, genome track figure, protein-protein interaction network, and gene set enrichment test result table. Our package is a convenient tool to understand global regulation mechanisms by splicing. The package source code is freely available to non-commercial users at https://github.com/ncc-bioinfo/ASpedia-R. Supplementary data are available at Bioinformatics Advances online.
替代剪接(AS)是赋予人类遗传多样性和表型可塑性的一种关键调控机制。外显子及其侧翼区域包括全面的接合序列特征,如剪接因子结合位点和蛋白质结构域。这些元素涉及外显子的使用,并最终促成了异构体特异的生物学功能。剪接相关序列特征涉及 DNA 和蛋白质的多层调控。然而,大多数分析应用研究的是有限的序列特征,如蛋白质结构域。这不足以全面解释外显子特异性生物学过程的因果关系。随着 RNA-seq 技术的出现,全局 AS 事件分析推导出了更精确的结果。随着分析结果的不断积累,如何识别AS事件的多组学序列特征可能是一个挑战。因此,应用多组学序列特征研究有助于扫描关键证据。ASpedia-R 是一个 R 软件包,用于分析人类基因的结合序列特征。我们的数据库收集了从 DNA 到蛋白质的异质性特征。此外,我们还利用文本挖掘技术收集了基于知识的剪接基因,以测试其与特定通路术语的关联性。我们的软件包通过 AS 事件 ID 转换器为高通量数据分析结果检索 AS 事件。最后,结果档案可视化并保存为多种格式:序列特征结果表、基因组轨迹图、蛋白质-蛋白质相互作用网络和基因组富集测试结果表。我们的软件包是了解剪接全局调控机制的便捷工具。 软件包源代码可在 https://github.com/ncc-bioinfo/ASpedia-R 网站上免费提供给非商业用户。补充数据可在 Bioinformatics Advances 在线查阅。
{"title":"ASpedia-R: a package to retrieve junction-incorporating features and knowledge-based functions of human alternative splicing events","authors":"Daejin Hyung, Soo Young Cho, Kyubin Lee, Namhee Yu, Sehwa Hong, Charny Park","doi":"10.1093/bioadv/vbae071","DOIUrl":"https://doi.org/10.1093/bioadv/vbae071","url":null,"abstract":"\u0000 \u0000 \u0000 Alternative splicing (AS) is a key regulatory mechanism that confers genetic diversity and phenotypic plasticity of human. The exons and their flanking regions include comprehensive junction-incorporating sequence features like splicing factor binding sites, and protein domains. These elements involve in exon usage, and finally contribute to isoform-specific biological functions. Splicing-associated sequence features are involved in the multilayered regulation encompassing DNA and proteins. However, most analysis applications have investigated limited sequence features, like protein domains. It is insufficient to explain the comprehensive cause and effect of exon-specific biological processes.\u0000 With the advent of RNA-seq technology, global AS event analysis has deduced more precise results. As accumulating analysis results, it could be a challenge to identify multi-omics sequence features for AS events. Therefore, application to investigate multi-omics sequence features is useful to scan critical evidence.\u0000 ASpedia-R is an R package to interrogate junction-incorporating sequence features for human genes. Our database collected the heterogeneous profile encompassed from DNA to protein. Additionally, knowledge-based splicing genes were collected using text-mining to test the association with specific pathway terms. Our package retrieves AS events for high-throughput data analysis results via AS event ID converter. Finally, result profile could be visualized and saved to multiple formats: sequence feature result table, genome track figure, protein-protein interaction network, and gene set enrichment test result table. Our package is a convenient tool to understand global regulation mechanisms by splicing.\u0000 \u0000 \u0000 \u0000 The package source code is freely available to non-commercial users at https://github.com/ncc-bioinfo/ASpedia-R.\u0000 Supplementary data are available at Bioinformatics Advances online.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140988292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hulshof, Bram Nap, Filippo Martinelli, Ines Thiele
Computational approaches to the functional characterisation of the microbiome, such as the Microbiome Modelling Toolbox, require precise information on microbial composition and relative abundances. However, challenges arise from homosynonyms—different names referring to the same taxon, which can hinder the mapping process and lead to missed species mapping when using microbial metabolic reconstruction resources, such as AGORA and APOLLO. We introduce the integrated MARS pipeline, a user-friendly Python-based solution that addresses these challenges. MARS automates the extraction of relative abundances from metagenomic reads, maps species and genera onto microbial metabolic reconstructions, and accounts for alternative taxonomic names. It normalises microbial reads, provides an optional cut-off for low-abundance taxa, and produces relative abundance tables apt for integration with the Microbiome Modelling Toolbox. A sub-component of the pipeline automates the task of identifying homosynonyms, leveraging web scraping to find taxonomic IDs of given species, searching NCBI for alternative names, and cross-reference them with microbial reconstruction resources. Taken together, MARS streamlines the entire process from processed metagenomic reads to relative abundance, thereby significantly reducing time and effort when working with microbiome data. MARS is implemented in Python. It can be found as an interactive application here: https://mars-pipeline.streamlit.app/along with a detailed documentation here: https://github.com/ThieleLab/mars-pipeline.
微生物组功能表征的计算方法(如微生物组建模工具箱)需要有关微生物组成和相对丰度的精确信息。然而,在使用 AGORA 和 APOLLO 等微生物代谢重建资源时,同源异名(指同一分类群的不同名称)会阻碍绘图过程并导致错过物种绘图。 我们介绍了集成的 MARS 管道,这是一种基于 Python 的用户友好型解决方案,可以解决这些难题。MARS 可自动从元基因组读数中提取相对丰度,将物种和属映射到微生物代谢重建上,并考虑到其他分类名称。它对微生物读数进行归一化处理,为低丰度类群提供可选的截止值,并生成适合与微生物组建模工具箱整合的相对丰度表。该管道的一个子组件可自动识别同义词,利用网络搜索功能查找给定物种的分类标识,在 NCBI 中搜索替代名称,并与微生物重建资源相互参照。总之,MARS 简化了从处理元基因组读数到相对丰度的整个过程,从而大大减少了处理微生物组数据的时间和精力。 MARS 使用 Python 实现。它的交互式应用程序见 https://mars-pipeline.streamlit.app/along,详细文档见 https://github.com/ThieleLab/mars-pipeline。
{"title":"Microbial abundances retrieved from sequencing data—automated NCBI taxonomy (MARS): a pipeline to create relative microbial abundance data for the microbiome modelling toolbox and utilising homosynonyms for efficient mapping to resources","authors":"T. Hulshof, Bram Nap, Filippo Martinelli, Ines Thiele","doi":"10.1093/bioadv/vbae068","DOIUrl":"https://doi.org/10.1093/bioadv/vbae068","url":null,"abstract":"\u0000 \u0000 \u0000 Computational approaches to the functional characterisation of the microbiome, such as the Microbiome Modelling Toolbox, require precise information on microbial composition and relative abundances. However, challenges arise from homosynonyms—different names referring to the same taxon, which can hinder the mapping process and lead to missed species mapping when using microbial metabolic reconstruction resources, such as AGORA and APOLLO.\u0000 \u0000 \u0000 \u0000 We introduce the integrated MARS pipeline, a user-friendly Python-based solution that addresses these challenges. MARS automates the extraction of relative abundances from metagenomic reads, maps species and genera onto microbial metabolic reconstructions, and accounts for alternative taxonomic names. It normalises microbial reads, provides an optional cut-off for low-abundance taxa, and produces relative abundance tables apt for integration with the Microbiome Modelling Toolbox. A sub-component of the pipeline automates the task of identifying homosynonyms, leveraging web scraping to find taxonomic IDs of given species, searching NCBI for alternative names, and cross-reference them with microbial reconstruction resources. Taken together, MARS streamlines the entire process from processed metagenomic reads to relative abundance, thereby significantly reducing time and effort when working with microbiome data.\u0000 \u0000 \u0000 \u0000 MARS is implemented in Python. It can be found as an interactive application here: https://mars-pipeline.streamlit.app/along with a detailed documentation here: https://github.com/ThieleLab/mars-pipeline.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140992636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Akdeniz, O. Frei, Espen Hagen, T. T. Filiz, Sandeep Karthikeyan, Joëlle Pasman, Andreas Jangmo, Jacob Bergstedt, John R Shorter, Richard Zetterberg, J. Meijsen, I. Sønderby, Alfonso Buil, M. Tesli, Yi Lu, Patrick Sullivan, Ole A Andreassen, E. Hovig
The collection and analysis of sensitive data in large-scale consortia for statistical genetics is hampered by multiple challenges, due to their non-shareable nature. Time-consuming issues in installing software frequently arise due to different operating systems, software dependencies, and limited internet access. For federated analysis across sites, it can be challenging to resolve different problems, including format requirements, data wrangling, setting up analysis on high-performance computing facilities, etc. Easier, more standardized, automated protocols and pipelines can be solutions to overcome these issues. We have developed one such solution for statistical genetic data analysis using software container technologies. This solution, named COSGAP: “COntainerized Statistical Genetics Analysis Pipelines”, consists of already established software tools placed into Singularity containers, alongside corresponding code and instructions on how to perform statistical genetic analyses, such as genome-wide association studies, polygenic scoring, LD score regression, Gaussian Mixture Models, and gene-set analysis. Using provided helper scripts written in Python, users can obtain auto-generated scripts to conduct the desired analysis either on HPC facilities or on a personal computer. COSGAP is actively being applied by users from different countries and projects to conduct genetic data analyses without spending much effort on software installation, converting data formats, and other technical requirements. COSGAP is freely available on GitHub (https://github.com/comorment/containers) under the GPLv3 license.
{"title":"COSGAP: COntainerized statistical genetics analysis pipelines","authors":"B. Akdeniz, O. Frei, Espen Hagen, T. T. Filiz, Sandeep Karthikeyan, Joëlle Pasman, Andreas Jangmo, Jacob Bergstedt, John R Shorter, Richard Zetterberg, J. Meijsen, I. Sønderby, Alfonso Buil, M. Tesli, Yi Lu, Patrick Sullivan, Ole A Andreassen, E. Hovig","doi":"10.1093/bioadv/vbae067","DOIUrl":"https://doi.org/10.1093/bioadv/vbae067","url":null,"abstract":"\u0000 \u0000 \u0000 The collection and analysis of sensitive data in large-scale consortia for statistical genetics is hampered by multiple challenges, due to their non-shareable nature. Time-consuming issues in installing software frequently arise due to different operating systems, software dependencies, and limited internet access. For federated analysis across sites, it can be challenging to resolve different problems, including format requirements, data wrangling, setting up analysis on high-performance computing facilities, etc. Easier, more standardized, automated protocols and pipelines can be solutions to overcome these issues. We have developed one such solution for statistical genetic data analysis using software container technologies. This solution, named COSGAP: “COntainerized Statistical Genetics Analysis Pipelines”, consists of already established software tools placed into Singularity containers, alongside corresponding code and instructions on how to perform statistical genetic analyses, such as genome-wide association studies, polygenic scoring, LD score regression, Gaussian Mixture Models, and gene-set analysis. Using provided helper scripts written in Python, users can obtain auto-generated scripts to conduct the desired analysis either on HPC facilities or on a personal computer. COSGAP is actively being applied by users from different countries and projects to conduct genetic data analyses without spending much effort on software installation, converting data formats, and other technical requirements.\u0000 \u0000 \u0000 \u0000 COSGAP is freely available on GitHub (https://github.com/comorment/containers) under the GPLv3 license.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140996941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae066
Willow Kion-Crosby, Lars Barquist
Summary: The increasing number of publicly available bacterial gene expression data sets provides an unprecedented resource for the study of gene regulation in diverse conditions, but emphasizes the need for self-supervised methods for the automated generation of new hypotheses. One approach for inferring coordinated regulation from bacterial expression data is through neural networks known as denoising autoencoders (DAEs) which encode large datasets in a reduced bottleneck layer. We have generalized this application of DAEs to include deep networks and explore the effects of network architecture on gene set inference using deep learning. We developed a DAE-based pipeline to extract gene sets from transcriptomic data in Escherichia coli, validate our method by comparing inferred gene sets with known pathways, and have used this pipeline to explore how the choice of network architecture impacts gene set recovery. We find that increasing network depth leads the DAEs to explain gene expression in terms of fewer, more concisely defined gene sets, and that adjusting the width results in a tradeoff between generalizability and biological inference. Finally, leveraging our understanding of the impact of DAE architecture, we apply our pipeline to an independent uropathogenic E.coli dataset to identify genes uniquely induced during human colonization.
Availability and implementation: https://github.com/BarquistLab/DAE_architecture_exploration.
{"title":"Network depth affects inference of gene sets from bacterial transcriptomes using denoising autoencoders.","authors":"Willow Kion-Crosby, Lars Barquist","doi":"10.1093/bioadv/vbae066","DOIUrl":"10.1093/bioadv/vbae066","url":null,"abstract":"<p><strong>Summary: </strong>The increasing number of publicly available bacterial gene expression data sets provides an unprecedented resource for the study of gene regulation in diverse conditions, but emphasizes the need for self-supervised methods for the automated generation of new hypotheses. One approach for inferring coordinated regulation from bacterial expression data is through neural networks known as denoising autoencoders (DAEs) which encode large datasets in a reduced bottleneck layer. We have generalized this application of DAEs to include deep networks and explore the effects of network architecture on gene set inference using deep learning. We developed a DAE-based pipeline to extract gene sets from transcriptomic data in <i>Escherichia coli</i>, validate our method by comparing inferred gene sets with known pathways, and have used this pipeline to explore how the choice of network architecture impacts gene set recovery. We find that increasing network depth leads the DAEs to explain gene expression in terms of fewer, more concisely defined gene sets, and that adjusting the width results in a tradeoff between generalizability and biological inference. Finally, leveraging our understanding of the impact of DAE architecture, we apply our pipeline to an independent uropathogenic <i>E.coli</i> dataset to identify genes uniquely induced during human colonization.</p><p><strong>Availability and implementation: </strong>https://github.com/BarquistLab/DAE_architecture_exploration.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11256956/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae063
Jingcheng Yang, Mo Sun, Zihan Ran, Taehwan Yang, Deepali L Kundnani, Francesca Storici, Penghao Xu
Motivation: Ribonucleoside monophosphates (rNMPs) are the most abundant non-standard nucleotides embedded in genomic DNA. If the presence of rNMP in DNA cannot be controlled, it can lead to genome instability. The actual regulatory functions of rNMPs in DNA remain mainly unknown. Considering the association between rNMP embedment and various diseases and cancer, the phenomenon of rNMP embedment in DNA has become a prominent area of research in recent years.
Results: We introduce the rNMPID database, which is the first database revealing rNMP-embedment characteristics, strand bias, and preferred incorporation patterns in the genomic DNA of samples from bacterial to human cells of different genetic backgrounds. The rNMPID database uses datasets generated by different rNMP-mapping techniques. It provides the researchers with a solid foundation to explore the features of rNMP embedded in the genomic DNA of multiple sources, and their association with cellular functions, and, in future, disease. It also significantly benefits researchers in the fields of genetics and genomics who aim to integrate their studies with the rNMP-embedment data.
Availability and implementation: rNMPID is freely accessible on the web at https://www.rnmpid.org.
动机核糖核苷单磷酸(rNMPs)是基因组 DNA 中含量最高的非标准核苷酸。如果不能控制 DNA 中 rNMP 的存在,就会导致基因组不稳定。DNA中rNMPs的实际调控功能主要还不为人知。考虑到rNMP嵌入与各种疾病和癌症之间的关联,DNA中的rNMP嵌入现象近年来已成为一个突出的研究领域:我们介绍了 rNMPID 数据库,这是首个揭示从细菌到人类细胞等不同遗传背景样本基因组 DNA 中 rNMP 嵌入特征、链偏差和优先结合模式的数据库。rNMPID 数据库使用不同 rNMP 图谱技术生成的数据集。它为研究人员提供了一个坚实的基础,以探索嵌入多种来源基因组 DNA 中的 rNMP 特征及其与细胞功能的关联,以及未来与疾病的关联。它还对遗传学和基因组学领域的研究人员大有裨益,这些研究人员的目标是将他们的研究与 rNMP 嵌入数据结合起来。可用性和实施:rNMPID 可在 https://www.rnmpid.org 网站上免费访问。
{"title":"rNMPID: a database for riboNucleoside MonoPhosphates in DNA.","authors":"Jingcheng Yang, Mo Sun, Zihan Ran, Taehwan Yang, Deepali L Kundnani, Francesca Storici, Penghao Xu","doi":"10.1093/bioadv/vbae063","DOIUrl":"10.1093/bioadv/vbae063","url":null,"abstract":"<p><strong>Motivation: </strong>Ribonucleoside monophosphates (rNMPs) are the most abundant non-standard nucleotides embedded in genomic DNA. If the presence of rNMP in DNA cannot be controlled, it can lead to genome instability. The actual regulatory functions of rNMPs in DNA remain mainly unknown. Considering the association between rNMP embedment and various diseases and cancer, the phenomenon of rNMP embedment in DNA has become a prominent area of research in recent years.</p><p><strong>Results: </strong>We introduce the rNMPID database, which is the first database revealing rNMP-embedment characteristics, strand bias, and preferred incorporation patterns in the genomic DNA of samples from bacterial to human cells of different genetic backgrounds. The rNMPID database uses datasets generated by different rNMP-mapping techniques. It provides the researchers with a solid foundation to explore the features of rNMP embedded in the genomic DNA of multiple sources, and their association with cellular functions, and, in future, disease. It also significantly benefits researchers in the fields of genetics and genomics who aim to integrate their studies with the rNMP-embedment data.</p><p><strong>Availability and implementation: </strong>rNMPID is freely accessible on the web at https://www.rnmpid.org.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11088741/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140913559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ashley V. Schwartz, Karilyn E. Sant, Uduak Z. George
Understanding the pathways and biological processes underlying differential gene expression is fundamental for characterizing gene expression changes in response to an experimental condition. Zebrafish, with a transcriptome closely mirroring that of humans, are frequently utilized as a model for human development and disease. However, a challenge arises due to the incomplete annotations of zebrafish pathways and biological processes, with more comprehensive annotations existing in humans. This incompleteness may result in biased functional enrichment findings and loss of knowledge. danRerLib, a versatile Python package for zebrafish transcriptomics researchers, overcomes this challenge and provides a suite of tools to be executed in Python including gene ID mapping, orthology mapping for the zebrafish and human taxonomy, and functional enrichment analysis utilizing the latest updated Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. danRerLib enables functional enrichment analysis for GO and KEGG pathways, even when they lack direct zebrafish annotations through the orthology of human-annotated functional annotations. This approach enables researchers to extend their analysis to a wider range of pathways, elucidating additional mechanisms of interest and greater insight into experimental results. danRerLib, along with comprehensive documentation and tutorials, is freely available. The source code is available at https://github.com/sdsucomptox/danrerlib/ with associated documentation and tutorials at https://sdsucomptox.github.io/danrerlib/. The package has been developed with Python 3.9 and is available for installation on the package management systems PIP (https://pypi.org/project/danrerlib/) and Conda (https://anaconda.org/sdsu_comptox/danrerlib) with additional installation instructions on the documentation website.
{"title":"danRerLib: a python package for zebrafish transcriptomics","authors":"Ashley V. Schwartz, Karilyn E. Sant, Uduak Z. George","doi":"10.1093/bioadv/vbae065","DOIUrl":"https://doi.org/10.1093/bioadv/vbae065","url":null,"abstract":"\u0000 \u0000 \u0000 Understanding the pathways and biological processes underlying differential gene expression is fundamental for characterizing gene expression changes in response to an experimental condition. Zebrafish, with a transcriptome closely mirroring that of humans, are frequently utilized as a model for human development and disease. However, a challenge arises due to the incomplete annotations of zebrafish pathways and biological processes, with more comprehensive annotations existing in humans. This incompleteness may result in biased functional enrichment findings and loss of knowledge. danRerLib, a versatile Python package for zebrafish transcriptomics researchers, overcomes this challenge and provides a suite of tools to be executed in Python including gene ID mapping, orthology mapping for the zebrafish and human taxonomy, and functional enrichment analysis utilizing the latest updated Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. danRerLib enables functional enrichment analysis for GO and KEGG pathways, even when they lack direct zebrafish annotations through the orthology of human-annotated functional annotations. This approach enables researchers to extend their analysis to a wider range of pathways, elucidating additional mechanisms of interest and greater insight into experimental results.\u0000 \u0000 \u0000 \u0000 danRerLib, along with comprehensive documentation and tutorials, is freely available. The source code is available at https://github.com/sdsucomptox/danrerlib/ with associated documentation and tutorials at https://sdsucomptox.github.io/danrerlib/. The package has been developed with Python 3.9 and is available for installation on the package management systems PIP (https://pypi.org/project/danrerlib/) and Conda (https://anaconda.org/sdsu_comptox/danrerlib) with additional installation instructions on the documentation website.\u0000","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141007114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}