Pub Date : 2024-06-26eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae094
Jiaying Lai, Yi Yang, Yunzhou Liu, Robert B Scharpf, Rachel Karchin
Summary: Neoplastic tumors originate from a single cell, and their evolution can be traced through lineages characterized by mutations, copy number alterations, and structural variants. These lineages are reconstructed and mapped onto evolutionary trees with algorithmic approaches. However, without ground truth benchmark sets, the validity of an algorithm remains uncertain, limiting potential clinical applicability. With a growing number of algorithms available, there is urgent need for standardized benchmark sets to evaluate their merits. Benchmark sets rely on in silico simulations of tumor sequence, but there are no accepted standards for simulation tools, presenting a major obstacle to progress in this field.
Availability and implementation: All analysis done in the paper was based on publicly available data from the publication of each accessed tool.
{"title":"Assessing the merits: an opinion on the effectiveness of simulation techniques in tumor subclonal reconstruction.","authors":"Jiaying Lai, Yi Yang, Yunzhou Liu, Robert B Scharpf, Rachel Karchin","doi":"10.1093/bioadv/vbae094","DOIUrl":"10.1093/bioadv/vbae094","url":null,"abstract":"<p><strong>Summary: </strong>Neoplastic tumors originate from a single cell, and their evolution can be traced through lineages characterized by mutations, copy number alterations, and structural variants. These lineages are reconstructed and mapped onto evolutionary trees with algorithmic approaches. However, without ground truth benchmark sets, the validity of an algorithm remains uncertain, limiting potential clinical applicability. With a growing number of algorithms available, there is urgent need for standardized benchmark sets to evaluate their merits. Benchmark sets rely on <i>in silico</i> simulations of tumor sequence, but there are no accepted standards for simulation tools, presenting a major obstacle to progress in this field.</p><p><strong>Availability and implementation: </strong>All analysis done in the paper was based on publicly available data from the publication of each accessed tool.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213631/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141473191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: In recent years, applying computational modeling to systems biology has caused a substantial surge in both discovery and practical applications and a significant shift in our understanding of the complexity inherent in biological systems.
Results: In this perspective article, we briefly overview computational modeling in biology, highlighting recent advancements such as multi-scale modeling due to the omics revolution, single-cell technology, and integration of artificial intelligence and machine learning approaches. We also discuss the primary challenges faced: integration, standardization, model complexity, scalability, and interdisciplinary collaboration. Lastly, we highlight the contribution made by the Computational Modeling of Biological Systems (SysMod) Community of Special Interest (COSI) associated with the International Society of Computational Biology (ISCB) in driving progress within this rapidly evolving field through community engagement (via both in person and virtual meetings, social media interactions), webinars, and conferences.
Availability and implementation: Additional information about SysMod is available at https://sysmod.info.
{"title":"Perspectives on computational modeling of biological systems and the significance of the SysMod community.","authors":"Bhanwar Lal Puniya, Meghna Verma, Chiara Damiani, Shaimaa Bakr, Andreas Dräger","doi":"10.1093/bioadv/vbae090","DOIUrl":"10.1093/bioadv/vbae090","url":null,"abstract":"<p><strong>Motivation: </strong>In recent years, applying computational modeling to systems biology has caused a substantial surge in both discovery and practical applications and a significant shift in our understanding of the complexity inherent in biological systems.</p><p><strong>Results: </strong>In this perspective article, we briefly overview computational modeling in biology, highlighting recent advancements such as multi-scale modeling due to the omics revolution, single-cell technology, and integration of artificial intelligence and machine learning approaches. We also discuss the primary challenges faced: integration, standardization, model complexity, scalability, and interdisciplinary collaboration. Lastly, we highlight the contribution made by the Computational Modeling of Biological Systems (SysMod) Community of Special Interest (COSI) associated with the International Society of Computational Biology (ISCB) in driving progress within this rapidly evolving field through community engagement (via both in person and virtual meetings, social media interactions), webinars, and conferences.</p><p><strong>Availability and implementation: </strong>Additional information about SysMod is available at https://sysmod.info.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141474678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-26eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae084
Melina Klostermann, Kathi Zarnack
Motivation: A vast variety of biological questions connected to RNA-binding proteins can be tackled with UV crosslinking and immunoprecipitation (CLIP) experiments. However, the processing and analysis of CLIP data are rather complex. Moreover, different types of CLIP experiments like iCLIP or eCLIP are often processed in different ways, reducing comparability between multiple experiments. Therefore, we aimed to build an easy-to-use computational tool for the processing of CLIP data that can be used for both iCLIP and eCLIP data, as well as data from other truncation-based CLIP methods.
Results: Here, we introduce racoon_clip, a sustainable and fully automated pipeline for the complete processing of iCLIP and eCLIP data to extract RNA binding signal at single-nucleotide resolution. racoon_clip is easy to install and execute, with multiple pre-settings and fully customizable parameters, and outputs a conclusive summary report with visualizations and statistics for all analysis steps.
Availability and implementation: racoon_clip is implemented as a Snakemake-powered command line tool (Snakemake version ≥7.22, Python version ≥3.9). The latest release can be downloaded from GitHub (https://github.com/ZarnackGroup/racoon_clip/tree/main) and installed via pip. A detailed documentation, including installation, usage, and customization, can be found at https://racoon-clip.readthedocs.io/en/latest/. The example datasets can be downloaded from the Short Read Archive (SRA; iCLIP: SRR5646576, SRR5646577, SRR5646578) or the ENCODE Project (eCLIP: ENCSR202BFN).
{"title":"racoon_clip-a complete pipeline for single-nucleotide analyses of iCLIP and eCLIP data.","authors":"Melina Klostermann, Kathi Zarnack","doi":"10.1093/bioadv/vbae084","DOIUrl":"10.1093/bioadv/vbae084","url":null,"abstract":"<p><strong>Motivation: </strong>A vast variety of biological questions connected to RNA-binding proteins can be tackled with UV crosslinking and immunoprecipitation (CLIP) experiments. However, the processing and analysis of CLIP data are rather complex. Moreover, different types of CLIP experiments like iCLIP or eCLIP are often processed in different ways, reducing comparability between multiple experiments. Therefore, we aimed to build an easy-to-use computational tool for the processing of CLIP data that can be used for both iCLIP and eCLIP data, as well as data from other truncation-based CLIP methods.</p><p><strong>Results: </strong>Here, we introduce racoon_clip, a sustainable and fully automated pipeline for the complete processing of iCLIP and eCLIP data to extract RNA binding signal at single-nucleotide resolution. racoon_clip is easy to install and execute, with multiple pre-settings and fully customizable parameters, and outputs a conclusive summary report with visualizations and statistics for all analysis steps.</p><p><strong>Availability and implementation: </strong>racoon_clip is implemented as a Snakemake-powered command line tool (Snakemake version ≥7.22, Python version ≥3.9). The latest release can be downloaded from GitHub (https://github.com/ZarnackGroup/racoon_clip/tree/main) and installed via pip. A detailed documentation, including installation, usage, and customization, can be found at https://racoon-clip.readthedocs.io/en/latest/. The example datasets can be downloaded from the Short Read Archive (SRA; iCLIP: SRR5646576, SRR5646577, SRR5646578) or the ENCODE Project (eCLIP: ENCSR202BFN).</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141473193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-25eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae093
Mehmet Eren Ahsen, Robert Vogel, Gustavo Stolovitzky
Motivation: The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data.
Results: To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data.
Availability and implementation: GitHub repository, https://github.com/robert-vogel/moca.
{"title":"Optimal linear ensemble of binary classifiers.","authors":"Mehmet Eren Ahsen, Robert Vogel, Gustavo Stolovitzky","doi":"10.1093/bioadv/vbae093","DOIUrl":"10.1093/bioadv/vbae093","url":null,"abstract":"<p><strong>Motivation: </strong>The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data.</p><p><strong>Results: </strong>To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data.</p><p><strong>Availability and implementation: </strong>GitHub repository, https://github.com/robert-vogel/moca.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11249386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141621894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-21eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae095
Kiran Deol, Griffin M Weber, Yun William Yu
Motivation: Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where datasets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem. In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data.
Results: Thus, we present SlowMoMan (SLOW Motions on MANifolds), a web application which allows the user to draw a one-dimensional path onto a 2D embedding. Then, by back-projecting the manifold to the original, high-dimensional space, we sort the original features such that those most discriminative along the manifold are ranked highly. We show a number of pertinent use cases for our tool, including trajectory inference, spatial transcriptomics, and automatic cell classification.
Availability and implementation: Software: https://yunwilliamyu.github.io/SlowMoMan/; Code: https://github.com/yunwilliamyu/SlowMoMan.
动机非线性低维嵌入允许人类将高维数据可视化,这在生物信息学中很常见,因为数据集可能有成千上万个维度。然而,将非线性嵌入的轴与原始维度相关联是一个非难解决的问题。特别是,人类可以识别出嵌入中的模式或有趣的分段,但却无法轻易识别出这些模式在原始数据中的对应关系:因此,我们提出了SlowMoMan(SLOW Motions on MANifolds),这是一个网络应用程序,允许用户在二维嵌入上绘制一维路径。然后,通过将流形反向投影到原始的高维空间,我们对原始特征进行排序,使那些沿流形最具辨别力的特征排名靠前。我们展示了我们工具的一些相关用例,包括轨迹推断、空间转录组学和自动细胞分类:软件:https://yunwilliamyu.github.io/SlowMoMan/;代码:https://github.com/yunwilliamyu/SlowMoMan。
{"title":"SlowMoMan: a web app for discovery of important features along user-drawn trajectories in 2D embeddings.","authors":"Kiran Deol, Griffin M Weber, Yun William Yu","doi":"10.1093/bioadv/vbae095","DOIUrl":"10.1093/bioadv/vbae095","url":null,"abstract":"<p><strong>Motivation: </strong>Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where datasets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem. In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data.</p><p><strong>Results: </strong>Thus, we present SlowMoMan (SLOW Motions on MANifolds), a web application which allows the user to draw a one-dimensional path onto a 2D embedding. Then, by back-projecting the manifold to the original, high-dimensional space, we sort the original features such that those most discriminative along the manifold are ranked highly. We show a number of pertinent use cases for our tool, including trajectory inference, spatial transcriptomics, and automatic cell classification.</p><p><strong>Availability and implementation: </strong>Software: https://yunwilliamyu.github.io/SlowMoMan/; Code: https://github.com/yunwilliamyu/SlowMoMan.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11220466/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141499797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-20eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae077
Serhan Yılmaz, Filipa Blasco Tavares Pereira Lopes, Daniela Schlatzer, Marzieh Ayati, Mark R Chance, Mehmet Koyutürk
Summary: We present RokaiXplorer, an intuitive web tool designed to address the scarcity of user-friendly solutions for proteomics and phospho-proteomics data analysis and visualization. RokaiXplorer streamlines data processing, analysis, and visualization through an interactive online interface, making it accessible to researchers without specialized training in proteomics or data science. With its comprehensive suite of modules, RokaiXplorer facilitates phospho-proteomic analysis at the level of phosphosites, proteins, kinases, biological processes, and pathways. The tool offers functionalities such as data normalization, statistical testing, activity inference, pathway enrichment, subgroup analysis, automated report generation, and multiple visualizations, including volcano plots, bar plots, heat maps, and network views. As a unique feature, RokaiXplorer allows researchers to effortlessly deploy their own data browsers, enabling interactive sharing of research data and findings. Overall, RokaiXplorer fills an important gap in phospho-proteomic data analysis by providing the ability to comprehensively analyze data at multiple levels within a single application.
Availability and implementation: Access RokaiXplorer at: http://explorer.rokai.io.
{"title":"Making proteomics accessible: RokaiXplorer for interactive analysis of phospho-proteomic data.","authors":"Serhan Yılmaz, Filipa Blasco Tavares Pereira Lopes, Daniela Schlatzer, Marzieh Ayati, Mark R Chance, Mehmet Koyutürk","doi":"10.1093/bioadv/vbae077","DOIUrl":"10.1093/bioadv/vbae077","url":null,"abstract":"<p><strong>Summary: </strong>We present RokaiXplorer, an intuitive web tool designed to address the scarcity of user-friendly solutions for proteomics and phospho-proteomics data analysis and visualization. RokaiXplorer streamlines data processing, analysis, and visualization through an interactive online interface, making it accessible to researchers without specialized training in proteomics or data science. With its comprehensive suite of modules, RokaiXplorer facilitates phospho-proteomic analysis at the level of phosphosites, proteins, kinases, biological processes, and pathways. The tool offers functionalities such as data normalization, statistical testing, activity inference, pathway enrichment, subgroup analysis, automated report generation, and multiple visualizations, including volcano plots, bar plots, heat maps, and network views. As a unique feature, RokaiXplorer allows researchers to effortlessly deploy their own data browsers, enabling interactive sharing of research data and findings. Overall, RokaiXplorer fills an important gap in phospho-proteomic data analysis by providing the ability to comprehensively analyze data at multiple levels within a single application.</p><p><strong>Availability and implementation: </strong>Access RokaiXplorer at: http://explorer.rokai.io.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11415779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142302317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-19eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae082
Matthew Macaulay, Mathieu Fourment
Motivation: Navigating the high dimensional space of discrete trees for phylogenetics presents a challenging problem for tree optimization. To address this, hyperbolic embeddings of trees offer a promising approach to encoding trees efficiently in continuous spaces. However, they require a differentiable tree decoder to optimize the phylogenetic likelihood. We present soft-NJ, a differentiable version of neighbour joining that enables gradient-based optimization over the space of trees.
Results: We illustrate the potential for differentiable optimization over tree space for maximum likelihood inference. We then perform variational Bayesian phylogenetics by optimizing embedding distributions in hyperbolic space. We compare the performance of this approximation technique on eight benchmark datasets to state-of-the-art methods. Results indicate that, while this technique is not immune from local optima, it opens a plethora of powerful and parametrically efficient approach to phylogenetics via tree embeddings.
Availability and implementation: Dodonaphy is freely available on the web at https://www.github.com/mattapow/dodonaphy. It includes an implementation of soft-NJ.
{"title":"Differentiable phylogenetics <i>via</i> hyperbolic embeddings with Dodonaphy.","authors":"Matthew Macaulay, Mathieu Fourment","doi":"10.1093/bioadv/vbae082","DOIUrl":"10.1093/bioadv/vbae082","url":null,"abstract":"<p><strong>Motivation: </strong>Navigating the high dimensional space of discrete trees for phylogenetics presents a challenging problem for tree optimization. To address this, hyperbolic embeddings of trees offer a promising approach to encoding trees efficiently in continuous spaces. However, they require a differentiable tree decoder to optimize the phylogenetic likelihood. We present soft-NJ, a differentiable version of neighbour joining that enables gradient-based optimization over the space of trees.</p><p><strong>Results: </strong>We illustrate the potential for differentiable optimization over tree space for maximum likelihood inference. We then perform variational Bayesian phylogenetics by optimizing embedding distributions in hyperbolic space. We compare the performance of this approximation technique on eight benchmark datasets to state-of-the-art methods. Results indicate that, while this technique is not immune from local optima, it opens a plethora of powerful and parametrically efficient approach to phylogenetics <i>via</i> tree embeddings.</p><p><strong>Availability and implementation: </strong>Dodonaphy is freely available on the web at https://www.github.com/mattapow/dodonaphy. It includes an implementation of soft-NJ.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11310108/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-18eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae092
Anna Kennedy, Ella Richardson, Jonathan Higham, Panagiotis Kotsantis, Richard Mort, Barbara Bo-Ju Shih
Motivation: The data sharing of large comprehensive cancer research projects, such as The Cancer Genome Atlas (TCGA), has improved the availability of high-quality data to research labs around the world. However, due to the volume and inherent complexity of high-throughput omics data, analysis of this is limited by the capacity for performing data processing through programming languages such as R or Python. Existing webtools lack functionality that supports large-scale analysis; typically, users can only input one gene, or a gene list condensed into a gene set, instead of individual gene-level analysis. Furthermore, analysis results are usually displayed without other sample-level molecular or clinical annotations. To address these gaps in the existing webtools, we have developed Evergene using R and Shiny.
Results: Evergene is a user-friendly webtool that utilizes RNA-sequencing data, alongside other sample and clinical annotation, for large-scale gene-centric analysis, including principal component analysis (PCA), survival analysis (SA), and correlation analysis (CA). Moreover, Evergene achieves in-depth analysis of cancer transcriptomic data which can be explored through dimensional reduction methods, relating gene expression with clinical events or other sample information, such as ethnicity, histological classification, and molecular indices. Lastly, users can upload custom data to Evergene for analysis.
Availability and implementation: Evergene webtool is available at https://bshihlab.shinyapps.io/evergene/. The source code and example user input dataset are available at https://github.com/bshihlab/evergene.
动机癌症基因组图谱(TCGA)等大型综合癌症研究项目的数据共享,提高了世界各地研究实验室对高质量数据的可用性。然而,由于高通量 omics 数据的数量和固有的复杂性,通过 R 或 Python 等编程语言进行数据处理的能力限制了对这些数据的分析。现有的网络工具缺乏支持大规模分析的功能;通常情况下,用户只能输入一个基因或浓缩成一个基因集的基因列表,而不能进行单个基因层面的分析。此外,分析结果的显示通常没有其他样本级分子或临床注释。为了填补现有网络工具的这些空白,我们使用 R 和 Shiny.Results 开发了 Evergene:Evergene是一个用户友好型网络工具,它利用RNA测序数据以及其他样本和临床注释,进行以基因为中心的大规模分析,包括主成分分析(PCA)、生存分析(SA)和相关分析(CA)。此外,Evergene 还能对癌症转录组数据进行深入分析,并通过降维方法将基因表达与临床事件或其他样本信息(如种族、组织学分类和分子指数)联系起来。最后,用户还可以将自定义数据上传到 Evergene 进行分析:Evergene 网络工具可从 https://bshihlab.shinyapps.io/evergene/ 网站获取。源代码和用户输入数据集示例见 https://github.com/bshihlab/evergene。
{"title":"Evergene: an interactive webtool for large-scale gene-centric analysis of primary tumours.","authors":"Anna Kennedy, Ella Richardson, Jonathan Higham, Panagiotis Kotsantis, Richard Mort, Barbara Bo-Ju Shih","doi":"10.1093/bioadv/vbae092","DOIUrl":"10.1093/bioadv/vbae092","url":null,"abstract":"<p><strong>Motivation: </strong>The data sharing of large comprehensive cancer research projects, such as The Cancer Genome Atlas (TCGA), has improved the availability of high-quality data to research labs around the world. However, due to the volume and inherent complexity of high-throughput omics data, analysis of this is limited by the capacity for performing data processing through programming languages such as R or Python. Existing webtools lack functionality that supports large-scale analysis; typically, users can only input one gene, or a gene list condensed into a gene set, instead of individual gene-level analysis. Furthermore, analysis results are usually displayed without other sample-level molecular or clinical annotations. To address these gaps in the existing webtools, we have developed Evergene using R and Shiny.</p><p><strong>Results: </strong>Evergene is a user-friendly webtool that utilizes RNA-sequencing data, alongside other sample and clinical annotation, for large-scale gene-centric analysis, including principal component analysis (PCA), survival analysis (SA), and correlation analysis (CA). Moreover, Evergene achieves in-depth analysis of cancer transcriptomic data which can be explored through dimensional reduction methods, relating gene expression with clinical events or other sample information, such as ethnicity, histological classification, and molecular indices. Lastly, users can upload custom data to Evergene for analysis.</p><p><strong>Availability and implementation: </strong>Evergene webtool is available at https://bshihlab.shinyapps.io/evergene/. The source code and example user input dataset are available at https://github.com/bshihlab/evergene.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11213629/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141473192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-17eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae089
Priyanka Banerjee, Oliver Eulenstein, Iddo Friedberg
Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences.
Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes.
Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.
{"title":"Discovering genomic islands in unannotated bacterial genomes using sequence embedding.","authors":"Priyanka Banerjee, Oliver Eulenstein, Iddo Friedberg","doi":"10.1093/bioadv/vbae089","DOIUrl":"10.1093/bioadv/vbae089","url":null,"abstract":"<p><strong>Motivation: </strong>Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences.</p><p><strong>Results: </strong>Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes.</p><p><strong>Availability and implementation: </strong>TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11193100/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141443854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14eCollection Date: 2024-01-01DOI: 10.1093/bioadv/vbae088
Genevieve R Krause, Walt Shands, Travis J Wheeler
Summary: We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based translated sequence annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long-read sequencing data and in the context of pseudogenes.
Availability and implementation: The software is available at https://github.com/TravisWheelerLab/BATH.
摘要:我们介绍的 BATH 是一种对蛋白质编码 DNA 进行高灵敏度注释的工具,它基于 DNA 与蛋白质序列数据库或轮廓隐马尔可夫模型(pHMM)的直接比对。BATH 建立在 HMMER3 代码基础之上,通过提供简单明了的输入界面和易于理解的输出结果,简化了基于 pHMM 的翻译序列注释工作流程。BATH 还引入了新颖的帧移感知算法,以检测帧移诱导的核苷酸插入和缺失(indels)。在注释不含错误的序列时,BATH 的准确性与 HMMER3 相当,而在注释含核苷酸嵌合的序列时,其准确性优于所有测试工具。这些结果表明,当需要高注释灵敏度时,尤其是当换帧错误可能会打断蛋白质编码区时,应使用 BATH,长读数测序数据和假基因的情况就是如此:该软件可在 https://github.com/TravisWheelerLab/BATH 上获取。
{"title":"Sensitive and error-tolerant annotation of protein-coding DNA with BATH.","authors":"Genevieve R Krause, Walt Shands, Travis J Wheeler","doi":"10.1093/bioadv/vbae088","DOIUrl":"10.1093/bioadv/vbae088","url":null,"abstract":"<p><strong>Summary: </strong>We present BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs). BATH is built on top of the HMMER3 code base, and simplifies the annotation workflow for pHMM-based translated sequence annotation by providing a straightforward input interface and easy-to-interpret output. BATH also introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions and deletions (indels). BATH matches the accuracy of HMMER3 for annotation of sequences containing no errors, and produces superior accuracy to all tested tools for annotation of sequences containing nucleotide indels. These results suggest that BATH should be used when high annotation sensitivity is required, particularly when frameshift errors are expected to interrupt protein-coding regions, as is true with long-read sequencing data and in the context of pseudogenes.</p><p><strong>Availability and implementation: </strong>The software is available at https://github.com/TravisWheelerLab/BATH.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":null,"pages":null},"PeriodicalIF":2.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11223822/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141536125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}