The estimation of haplotype structure and frequencies provides crucial information about the composition of genomes. Techniques, such as single-individual haplotyping, aim to reconstruct individual haplotypes from diploid genome sequencing data. However, our focus is distinct. We address the challenge of reconstructing haplotype structure and frequencies from pooled sequencing samples where multiple individuals are sequenced simultaneously. A frequentist method to address this issue has recently been proposed. In contrast to this and other methods that compute point estimates, our proposed Bayesian hierarchical model delivers a posterior that permits us to also quantify uncertainty. Since matching permutations in both haplotype structure and corresponding frequency matrix lead to the same reconstruction of their product, we introduce an order-preserving shrinkage prior that ensures identifiability with respect to permutations. For inference, we introduce a blocked Gibbs sampler that enforces the required constraints. In a simulation study, we assessed the performance of our method. Furthermore, by using our approach on two distinct sets of real data, we demonstrate that our Bayesian approach can reconstruct the dominant haplotypes in a challenging, high-dimensional set-up.
{"title":"Estimating Haplotype Structure and Frequencies: A Bayesian Approach to Unknown Design in Pooled Genomic Data.","authors":"Yuexuan Wang, Ritabrata Dutta, Andreas Futschik","doi":"10.1089/cmb.2023.0211","DOIUrl":"https://doi.org/10.1089/cmb.2023.0211","url":null,"abstract":"<p><p>The estimation of haplotype structure and frequencies provides crucial information about the composition of genomes. Techniques, such as single-individual haplotyping, aim to reconstruct individual haplotypes from diploid genome sequencing data. However, our focus is distinct. We address the challenge of reconstructing haplotype structure and frequencies from pooled sequencing samples where multiple individuals are sequenced simultaneously. A frequentist method to address this issue has recently been proposed. In contrast to this and other methods that compute point estimates, our proposed Bayesian hierarchical model delivers a posterior that permits us to also quantify uncertainty. Since matching permutations in both haplotype structure and corresponding frequency matrix lead to the same reconstruction of their product, we introduce an order-preserving shrinkage prior that ensures identifiability with respect to permutations. For inference, we introduce a blocked Gibbs sampler that enforces the required constraints. In a simulation study, we assessed the performance of our method. Furthermore, by using our approach on two distinct sets of real data, we demonstrate that our Bayesian approach can reconstruct the dominant haplotypes in a challenging, high-dimensional set-up.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141492186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tamilarasi M, Kumarganesh S, K Martin Sagayam, Andrew J
The prompt and precise identification and delineation of tumor regions within glioma brain images are critical for mitigating the risks associated with this life-threatening ailment. In this study, we employ the UNet convolutional neural network (CNN) architecture for glioma tumor detection. Our proposed methodology comprises a transformation module, a feature extraction module, and a tumor segmentation module. The spatial domain representation of brain magnetic resonance imaging images undergoes decomposition into low- and high-frequency subbands via a non-subsampled shearlet transform. Leveraging the selective and directive characteristics of this transform enhances the classification efficacy of our proposed system. Shearlet features are extracted from both low- and high-frequency subbands and subsequently classified using the UNet-CNN architecture to identify tumor regions within glioma brain images. We validate our proposed glioma tumor detection methodology using publicly available datasets, namely Brain Tumor Segmentation (BRATS) 2019 and The Cancer Genome Atlas (TCGA). The mean classification rates achieved by our system are 99.1% for the BRATS 2019 dataset and 97.8% for the TCGA dataset. Furthermore, our system demonstrates notable performance metrics on the BRATS 2019 dataset, including 98.2% sensitivity, 98.7% specificity, 98.9% accuracy, 98.7% intersection over union, and 98.5% disc similarity coefficient. Similarly, on the TCGA dataset, our system achieves 97.7% sensitivity, 98.2% specificity, 98.7% accuracy, 98.6% intersection over union, and 98.4% disc similarity coefficient. Comparative analysis against state-of-the-art methods underscores the efficacy of our proposed glioma brain tumor detection approach.
{"title":"Detection and Segmentation of Glioma Tumors Utilizing a UNet Convolutional Neural Network Approach with Non-Subsampled Shearlet Transform.","authors":"Tamilarasi M, Kumarganesh S, K Martin Sagayam, Andrew J","doi":"10.1089/cmb.2023.0339","DOIUrl":"https://doi.org/10.1089/cmb.2023.0339","url":null,"abstract":"<p><p>The prompt and precise identification and delineation of tumor regions within glioma brain images are critical for mitigating the risks associated with this life-threatening ailment. In this study, we employ the UNet convolutional neural network (CNN) architecture for glioma tumor detection. Our proposed methodology comprises a transformation module, a feature extraction module, and a tumor segmentation module. The spatial domain representation of brain magnetic resonance imaging images undergoes decomposition into low- and high-frequency subbands via a non-subsampled shearlet transform. Leveraging the selective and directive characteristics of this transform enhances the classification efficacy of our proposed system. Shearlet features are extracted from both low- and high-frequency subbands and subsequently classified using the UNet-CNN architecture to identify tumor regions within glioma brain images. We validate our proposed glioma tumor detection methodology using publicly available datasets, namely Brain Tumor Segmentation (BRATS) 2019 and The Cancer Genome Atlas (TCGA). The mean classification rates achieved by our system are 99.1% for the BRATS 2019 dataset and 97.8% for the TCGA dataset. Furthermore, our system demonstrates notable performance metrics on the BRATS 2019 dataset, including 98.2% sensitivity, 98.7% specificity, 98.9% accuracy, 98.7% intersection over union, and 98.5% disc similarity coefficient. Similarly, on the TCGA dataset, our system achieves 97.7% sensitivity, 98.2% specificity, 98.7% accuracy, 98.6% intersection over union, and 98.4% disc similarity coefficient. Comparative analysis against state-of-the-art methods underscores the efficacy of our proposed glioma brain tumor detection approach.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141457116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Filipp Martin Rondel, Hafsa Farooq, Roya Hosseini, Akshay Juyal, Sergey Knyazev, Serghei Mangul, Artem S Rogovskyy, Alexander Zelikovsky
Evaluating changes in metabolic pathway activity is essential for studying disease mechanisms and developing new treatments, with significant benefits extending to human health. Here, we propose EMPathways2, a maximum likelihood pipeline that is based on the expectation-maximization algorithm, which is capable of evaluating enzyme expression and metabolic pathway activity level. We first estimate enzyme expression from RNA-seq data that is used for simultaneous estimation of pathway activity levels using enzyme participation levels in each pathway. We implement the novel pipeline to RNA-seq data from several groups of mice, which provides a deeper look at the biochemical changes occurring as a result of bacterial infection, disease, and immune response. Our results show that estimated enzyme expression, pathway activity levels, and enzyme participation levels in each pathway are robust and stable across all samples. Estimated activity levels of a significant number of metabolic pathways strongly correlate with the infected and uninfected status of the respective rodent types.
{"title":"Estimating Enzyme Expression and Metabolic Pathway Activity in <i>Borreliella</i>-Infected and Uninfected Mice.","authors":"Filipp Martin Rondel, Hafsa Farooq, Roya Hosseini, Akshay Juyal, Sergey Knyazev, Serghei Mangul, Artem S Rogovskyy, Alexander Zelikovsky","doi":"10.1089/cmb.2024.0564","DOIUrl":"https://doi.org/10.1089/cmb.2024.0564","url":null,"abstract":"<p><p>Evaluating changes in metabolic pathway activity is essential for studying disease mechanisms and developing new treatments, with significant benefits extending to human health. Here, we propose EMPathways2, a maximum likelihood pipeline that is based on the expectation-maximization algorithm, which is capable of evaluating enzyme expression and metabolic pathway activity level. We first estimate enzyme expression from RNA-seq data that is used for simultaneous estimation of pathway activity levels using enzyme participation levels in each pathway. We implement the novel pipeline to RNA-seq data from several groups of mice, which provides a deeper look at the biochemical changes occurring as a result of bacterial infection, disease, and immune response. Our results show that estimated enzyme expression, pathway activity levels, and enzyme participation levels in each pathway are robust and stable across all samples. Estimated activity levels of a significant number of metabolic pathways strongly correlate with the infected and uninfected status of the respective rodent types.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141457117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
JūratĖ ŠaltytĖ Benth, Fred Espen Benth, Espen Rostrup Nakstad
While the world recovers from the COVID-19 pandemic, another outbreak of contagious disease remains the most likely future risk to public safety. Now is therefore the time to equip health authorities with effective tools to ensure they are operationally prepared for future events. We propose a direct approach to obtain reliable nearly instantaneous time-varying reproduction numbers for contagious diseases, using only the number of infected individuals as input and utilising the dynamics of the susceptible-infected-recovered (SIR) model. Our approach is based on a multivariate nonlinear regression model simultaneously assessing parameters describing the transmission and recovery rate as a function of the SIR model. Shortly after start of a pandemic, our approach enables estimation of daily reproduction numbers. It avoids numerous sources of additional variation and provides a generic tool for monitoring the instantaneous reproduction numbers. We use Norwegian COVID-19 data as case study and demonstrate that our results are well aligned with changes in the number of infected individuals and the change points following policy interventions. Our estimated reproduction numbers are notably less volatile, provide more credible short-time predictions for the number of infected individuals, and are thus clearly favorable compared with the results obtained by two other popular approaches used for monitoring a pandemic. The proposed approach contributes to increased preparedness to future pandemics of contagious diseases, as it can be used as a simple yet powerful tool to monitor the pandemics, provide short-term predictions, and thus support decision making regarding timely and targeted control measures.
在全球从 COVID-19 大流行中恢复过来的同时,传染性疾病的再次爆发仍然是未来公共安全最可能面临的风险。因此,现在正是为卫生部门提供有效工具的时候,以确保他们为未来事件做好行动准备。我们提出了一种直接方法,仅使用受感染个体的数量作为输入,并利用易感-感染-恢复(SIR)模型的动态变化,就能获得可靠的近乎瞬时的传染病时变繁殖数。我们的方法基于一个多变量非线性回归模型,同时评估作为 SIR 模型函数的描述传播率和恢复率的参数。大流行开始后不久,我们的方法就能估算出每天的繁殖数量。它避免了许多额外的变化来源,为监测瞬时繁殖数量提供了通用工具。我们使用挪威 COVID-19 数据作为案例研究,并证明我们的结果与感染人数的变化以及政策干预后的变化点非常吻合。我们估计的繁殖数量波动明显较小,对感染者数量的短时预测更加可信,因此与其他两种用于监测大流行病的流行方法相比,我们的结果明显更有优势。所提出的方法有助于提高对未来传染病大流行的防范能力,因为它可以作为一种简单而强大的工具来监测大流行,提供短期预测,从而为及时采取有针对性的控制措施提供决策支持。
{"title":"Nearly Instantaneous Time-Varying Reproduction Number for Contagious Diseases-a Direct Approach Based on Nonlinear Regression.","authors":"JūratĖ ŠaltytĖ Benth, Fred Espen Benth, Espen Rostrup Nakstad","doi":"10.1089/cmb.2023.0414","DOIUrl":"https://doi.org/10.1089/cmb.2023.0414","url":null,"abstract":"<p><p>While the world recovers from the COVID-19 pandemic, another outbreak of contagious disease remains the most likely future risk to public safety. Now is therefore the time to equip health authorities with effective tools to ensure they are operationally prepared for future events. We propose a direct approach to obtain reliable nearly instantaneous time-varying reproduction numbers for contagious diseases, using only the number of infected individuals as input and utilising the dynamics of the susceptible-infected-recovered (SIR) model. Our approach is based on a multivariate nonlinear regression model simultaneously assessing parameters describing the transmission and recovery rate as a function of the SIR model. Shortly after start of a pandemic, our approach enables estimation of daily reproduction numbers. It avoids numerous sources of additional variation and provides a generic tool for monitoring the instantaneous reproduction numbers. We use Norwegian COVID-19 data as case study and demonstrate that our results are well aligned with changes in the number of infected individuals and the change points following policy interventions. Our estimated reproduction numbers are notably less volatile, provide more credible short-time predictions for the number of infected individuals, and are thus clearly favorable compared with the results obtained by two other popular approaches used for monitoring a pandemic. The proposed approach contributes to increased preparedness to future pandemics of contagious diseases, as it can be used as a simple yet powerful tool to monitor the pandemics, provide short-term predictions, and thus support decision making regarding timely and targeted control measures.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141457118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Noncoding RNA (NcRNA)-protein interactions (NPIs) play fundamentally important roles in carrying out cellular activities. Although various predictors based on molecular features and graphs have been published to boost the identification of NPIs, most of them often ignore the information between known NPIs or exhibit insufficient learning ability from graphs, posing a significant challenge in effectively identifying NPIs. To develop a more reliable and accurate predictor for NPIs, in this article, we propose NPI-DCGNN, an end-to-end NPI predictor based on a dual-channel graph neural network (DCGNN). NPI-DCGNN initially treats the known NPIs as an ncRNA-protein bipartite graph. Subsequently, for each ncRNA-protein pair, NPI-DCGNN extracts two local subgraphs centered around the ncRNA and protein, respectively, from the bipartite graph. After that, it utilizes a dual-channel graph representation learning layer based on GNN to generate high-level feature representations for the ncRNA-protein pair. Finally, it employs a fully connected network and output layer to predict whether an interaction exists between the pair of ncRNA and protein. Experimental results on four experimentally validated datasets demonstrate that NPI-DCGNN outperforms several state-of-the-art NPI predictors. Our case studies on the NPInter database further demonstrate the prediction power of NPI-DCGNN in predicting NPIs. With the availability of the source codes (https://github.com/zhangxin11111/NPI-DCGNN), we anticipate that NPI-DCGNN could facilitate the studies of ncRNA interactome by providing highly reliable NPI candidates for further experimental validation.
{"title":"NPI-DCGNN: An Accurate Tool for Identifying ncRNA-Protein Interactions Using a Dual-Channel Graph Neural Network.","authors":"Xin Zhang, Liangwei Zhao, Ziyi Chai, Hao Wu, Wei Yang, Chen Li, Yu Jiang, Quanzhong Liu","doi":"10.1089/cmb.2023.0449","DOIUrl":"https://doi.org/10.1089/cmb.2023.0449","url":null,"abstract":"<p><p>Noncoding RNA (NcRNA)-protein interactions (NPIs) play fundamentally important roles in carrying out cellular activities. Although various predictors based on molecular features and graphs have been published to boost the identification of NPIs, most of them often ignore the information between known NPIs or exhibit insufficient learning ability from graphs, posing a significant challenge in effectively identifying NPIs. To develop a more reliable and accurate predictor for NPIs, in this article, we propose NPI-DCGNN, an end-to-end NPI predictor based on a dual-channel graph neural network (DCGNN). NPI-DCGNN initially treats the known NPIs as an ncRNA-protein bipartite graph. Subsequently, for each ncRNA-protein pair, NPI-DCGNN extracts two local subgraphs centered around the ncRNA and protein, respectively, from the bipartite graph. After that, it utilizes a dual-channel graph representation learning layer based on GNN to generate high-level feature representations for the ncRNA-protein pair. Finally, it employs a fully connected network and output layer to predict whether an interaction exists between the pair of ncRNA and protein. Experimental results on four experimentally validated datasets demonstrate that NPI-DCGNN outperforms several state-of-the-art NPI predictors. Our case studies on the NPInter database further demonstrate the prediction power of NPI-DCGNN in predicting NPIs. With the availability of the source codes (https://github.com/zhangxin11111/NPI-DCGNN), we anticipate that NPI-DCGNN could facilitate the studies of ncRNA interactome by providing highly reliable NPI candidates for further experimental validation.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141457119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The combined effect of shape and electrostatic complementarities (Sc, EC) at the interface of the interacting protein partners (PPI) serves as the physical basis for such associations and is a strong determinant of their binding energetics. EnCPdock (https://www.scinetmol.in/EnCPdock/) presents a comprehensive web platform for the direct conjoint comparative analyses of complementarity and binding energetics in PPIs. It elegantly interlinks the dual nature of local (Sc) and nonlocal complementarity (EC) in PPIs using the complementarity plot. It further derives an AI-based ΔGbinding with a prediction accuracy comparable to the state of the art. This book chapter presents a practical manual to conceptualize and implement EnCPdock with its various features and functionalities, collectively having the potential to serve as a valuable protein engineering tool in the design of novel protein interfaces.
在相互作用的蛋白质伙伴(PPI)界面上,形状和静电互补性(Sc、EC)的共同作用是这种结合的物理基础,也是其结合能量的重要决定因素。EnCPdock (https://www.scinetmol.in/EnCPdock/) 提供了一个综合网络平台,用于直接联合比较分析互补性和 PPI 的结合能量。它利用互补图将 PPI 中的局部互补性(Sc)和非局部互补性(EC)的双重性质巧妙地联系在一起。它还进一步推导出了基于人工智能的 ΔG结合,其预测准确度可媲美目前的技术水平。本书的这一章介绍了一份实用手册,用于构思和实施 EnCPdock 及其各种特性和功能,这些特性和功能有可能成为设计新型蛋白质界面的重要蛋白质工程工具。
{"title":"Combining Complementarity and Binding Energetics in the Assessment of Protein Interactions: EnCPdock-A Practical Manual.","authors":"Gargi Biswas, Debasish Mukherjee, Sankar Basu","doi":"10.1089/cmb.2024.0554","DOIUrl":"https://doi.org/10.1089/cmb.2024.0554","url":null,"abstract":"<p><p>The combined effect of shape and electrostatic complementarities (Sc, EC) at the interface of the interacting protein partners (PPI) serves as the physical basis for such associations and is a strong determinant of their binding energetics. EnCPdock (https://www.scinetmol.in/EnCPdock/) presents a comprehensive web platform for the direct conjoint comparative analyses of complementarity and binding energetics in PPIs. It elegantly interlinks the dual nature of local (Sc) and nonlocal complementarity (EC) in PPIs using the complementarity plot. It further derives an AI-based ΔG<sub>binding</sub> with a prediction accuracy comparable to the <i>state of the art</i>. This book chapter presents a practical manual to conceptualize and implement EnCPdock with its various features and functionalities, collectively having the potential to serve as a valuable protein engineering tool in the design of novel protein interfaces.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141419324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The single-matrix amino acid (AA) substitution models are widely used in phylogenetic analyses; however, they are unable to properly model the heterogeneity of AA substitution rates among sites. The multi-matrix mixture models can handle the site rate heterogeneity and outperform the single-matrix models. Estimating multi-matrix mixture models is a complex process and no computer program is available for this task. In this study, we implemented a computer program of the so-called QMix based on the algorithm of LG4X and LG4M with several enhancements to automatically estimate multi-matrix mixture models from large datasets. QMix employs QMaker algorithm instead of XRATE algorithm to accurately and rapidly estimate the parameters of models. It is able to estimate mixture models with different number of matrices and supports multi-threading computing to efficiently estimate models from thousands of genes. We re-estimate mixture models LG4X and LG4M from 1471 HSSP alignments. The re-estimated models (HP4X and HP4M) are slightly better than LG4X and LG4M in building maximum likelihood trees from HSSP and TreeBASE datasets. QMix program required about 10 hours on a computer with 18 cores to estimate a mixture model with four matrices from 200 HSSP alignments. It is easy to use and freely available for researchers.
{"title":"QMix: An Efficient Program to Automatically Estimate Multi-Matrix Mixture Models for Amino Acid Substitution Process.","authors":"Nguyen Huy Tinh, Cuong Cao Dang, Le Sy Vinh","doi":"10.1089/cmb.2023.0403","DOIUrl":"https://doi.org/10.1089/cmb.2023.0403","url":null,"abstract":"<p><p>The single-matrix amino acid (AA) substitution models are widely used in phylogenetic analyses; however, they are unable to properly model the heterogeneity of AA substitution rates among sites. The multi-matrix mixture models can handle the site rate heterogeneity and outperform the single-matrix models. Estimating multi-matrix mixture models is a complex process and no computer program is available for this task. In this study, we implemented a computer program of the so-called QMix based on the algorithm of LG4X and LG4M with several enhancements to automatically estimate multi-matrix mixture models from large datasets. QMix employs QMaker algorithm instead of XRATE algorithm to accurately and rapidly estimate the parameters of models. It is able to estimate mixture models with different number of matrices and supports multi-threading computing to efficiently estimate models from thousands of genes. We re-estimate mixture models LG4X and LG4M from 1471 HSSP alignments. The re-estimated models (HP4X and HP4M) are slightly better than LG4X and LG4M in building maximum likelihood trees from HSSP and TreeBASE datasets. QMix program required about 10 hours on a computer with 18 cores to estimate a mixture model with four matrices from 200 HSSP alignments. It is easy to use and freely available for researchers.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141300786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-05-17DOI: 10.1089/cmb.2024.0520
Casper Asbjørn Eriksen, Jakob Lykke Andersen, Rolf Fagerberg, Daniel Merkle
Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, including metabolomics, systems biology, and drug discovery. No such database can be complete and it is often necessary to incorporate data from several sources. However, the molecular structure for a given compound is not necessarily consistent between databases. This article presents StructRecon, a novel tool for resolving unique molecular structures from database identifiers. Currently, identifiers from BiGG, ChEBI,Escherichia coli Metabolome Database (ECMDB), MetaNetX, and PubChem are supported. StructRecon traverses the cross-links between entries in different databases to construct what we call identifier graphs. The goal of these graphs is to offer a more complete view of the total information available on a given compound across all the supported databases. To reconcile discrepancies met during the traversal of the databases, we develop an extensible model for molecular structure supporting multiple independent levels of detail, which allows standardization of the structure to be applied iteratively. In some cases, our standardization approach results in multiple candidate structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternatives. As a case study, we applied StructRecon to the EColiCore2 model. We found at least one structure for 98.66% of its compounds, which is more than twice as many as possible when using the databases in more standard ways not considering the complex network of cross-database references captured by our identifier graphs. StructRecon is open-source and modular, which enables support for more databases in the future.
{"title":"Toward the Reconciliation of Inconsistent Molecular Structures from Biochemical Databases.","authors":"Casper Asbjørn Eriksen, Jakob Lykke Andersen, Rolf Fagerberg, Daniel Merkle","doi":"10.1089/cmb.2024.0520","DOIUrl":"10.1089/cmb.2024.0520","url":null,"abstract":"<p><p><b>Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, including metabolomics, systems biology, and drug discovery. No such database can be complete and it is often necessary to incorporate data from several sources. However, the molecular structure for a given compound is not necessarily consistent between databases. This article presents StructRecon, a novel tool for resolving unique molecular structures from database identifiers. Currently, identifiers from BiGG, ChEBI,</b> <i>Escherichia coli</i> Metabolome Database <b>(ECMDB), MetaNetX, and PubChem are supported. StructRecon traverses the cross-links between entries in different databases to construct what we call identifier graphs. The goal of these graphs is to offer a more complete view of the total information available on a given compound across all the supported databases. To reconcile discrepancies met during the traversal of the databases, we develop an extensible model for molecular structure supporting multiple independent levels of detail, which allows standardization of the structure to be applied iteratively. In some cases, our standardization approach results in multiple candidate structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternatives. As a case study, we applied StructRecon to the <i>EColiCore2</i> model. We found at least one structure for 98.66% of its compounds, which is more than twice as many as possible when using the databases in more standard ways not considering the complex network of cross-database references captured by our identifier graphs. StructRecon is open-source and modular, which enables support for more databases in the future.</b></p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140957701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}