Pub Date : 2024-10-14DOI: 10.1109/TCBB.2024.3480150
Luca Cattelani, Arindam Ghosh, Teemu Rintala, Vittorio Fortino
Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.
{"title":"A comprehensive evaluation framework for benchmarking multi-objective feature selection in omics-based biomarker discovery.","authors":"Luca Cattelani, Arindam Ghosh, Teemu Rintala, Vittorio Fortino","doi":"10.1109/TCBB.2024.3480150","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3480150","url":null,"abstract":"<p><p>Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14DOI: 10.1109/TCBB.2024.3480088
Fangfang Su, Chong Teng, Fei Li, Bobo Li, Jun Zhou, Donghong Ji
Currently, biomedical event extraction has received considerable attention in various fields, including natural language processing, bioinformatics, and computational biomedicine. This has led to the emergence of numerous machine learning and deep learning models that have been proposed and applied to tackle this complex task. While existing models typically adopt an extraction-based approach, which requires breaking down the extraction of biomedical events into multiple subtasks for sequential processing, making it prone to cascading errors. This paper presents a novel approach by constructing a biomedical event generation model based on the framework of the pre-trained language model T5. We employ a sequence-tosequence generation paradigm to obtain events, the model utilizes constrained decoding algorithm to guide sequence generation, and a curriculum learning algorithm for efficient model learning. To demonstrate the effectiveness of our model, we evaluate it on two public benchmark datasets, Genia 2011 and Genia 2013. Our model achieves superior performance, illustrating the effectiveness of generative modeling of biomedical events.
{"title":"Generative Biomedical Event Extraction with Constrained Decoding Strategy.","authors":"Fangfang Su, Chong Teng, Fei Li, Bobo Li, Jun Zhou, Donghong Ji","doi":"10.1109/TCBB.2024.3480088","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3480088","url":null,"abstract":"<p><p>Currently, biomedical event extraction has received considerable attention in various fields, including natural language processing, bioinformatics, and computational biomedicine. This has led to the emergence of numerous machine learning and deep learning models that have been proposed and applied to tackle this complex task. While existing models typically adopt an extraction-based approach, which requires breaking down the extraction of biomedical events into multiple subtasks for sequential processing, making it prone to cascading errors. This paper presents a novel approach by constructing a biomedical event generation model based on the framework of the pre-trained language model T5. We employ a sequence-tosequence generation paradigm to obtain events, the model utilizes constrained decoding algorithm to guide sequence generation, and a curriculum learning algorithm for efficient model learning. To demonstrate the effectiveness of our model, we evaluate it on two public benchmark datasets, Genia 2011 and Genia 2013. Our model achieves superior performance, illustrating the effectiveness of generative modeling of biomedical events.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-11DOI: 10.1109/TCBB.2024.3477909
Ghulam Murtaza, Justin Wagner, Justin M Zook, Ritambhara Singh
Hi-C experiments allow researchers to study and understand the 3D genome organization and its regulatory function. Unfortunately, sequencing costs and technical constraints severely restrict access to high-quality Hi-C data for many cell types. Existing frameworks rely on a sparse Hi-C dataset or cheaper-to-acquire ChIP-seq data to predict Hi-C contact maps with high read coverage. However, these methods fail to generalize to sparse or cross-cell-type inputs because they do not account for the contributions of epigenomic features or the impact of the structural neighborhood in predicting Hi-C reads. We propose GrapHiC, which combines Hi-C and ChIP-seq in a graph representation, allowing more accurate embedding of structural and epigenomic features. Each node represents a binned genomic region, and we assign edge weights using the observed Hi-C reads. Additionally, we embed ChIP-seq and relative positional information as node attributes, allowing our representation to capture structural neighborhoods and the contributions of proteins and their modifications for predicting Hi-C reads. We show that GrapHiC generalizes better than the current state-of-the-art on cross-cell-type settings and sparse Hi-C inputs. Moreover, we can utilize our framework to impute Hi-C reads even when no Hi-C contact map is available, thus making high-quality Hi-C data accessible for many cell types. Availability: https://github.com/rsinghlab/GrapHiC.
{"title":"GrapHiC: An integrative graph based approach for imputing missing Hi-C reads.","authors":"Ghulam Murtaza, Justin Wagner, Justin M Zook, Ritambhara Singh","doi":"10.1109/TCBB.2024.3477909","DOIUrl":"10.1109/TCBB.2024.3477909","url":null,"abstract":"<p><p>Hi-C experiments allow researchers to study and understand the 3D genome organization and its regulatory function. Unfortunately, sequencing costs and technical constraints severely restrict access to high-quality Hi-C data for many cell types. Existing frameworks rely on a sparse Hi-C dataset or cheaper-to-acquire ChIP-seq data to predict Hi-C contact maps with high read coverage. However, these methods fail to generalize to sparse or cross-cell-type inputs because they do not account for the contributions of epigenomic features or the impact of the structural neighborhood in predicting Hi-C reads. We propose GrapHiC, which combines Hi-C and ChIP-seq in a graph representation, allowing more accurate embedding of structural and epigenomic features. Each node represents a binned genomic region, and we assign edge weights using the observed Hi-C reads. Additionally, we embed ChIP-seq and relative positional information as node attributes, allowing our representation to capture structural neighborhoods and the contributions of proteins and their modifications for predicting Hi-C reads. We show that GrapHiC generalizes better than the current state-of-the-art on cross-cell-type settings and sparse Hi-C inputs. Moreover, we can utilize our framework to impute Hi-C reads even when no Hi-C contact map is available, thus making high-quality Hi-C data accessible for many cell types. Availability: https://github.com/rsinghlab/GrapHiC.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142406376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3390374
Zhipeng Cai;Alexander Zelikovsky
{"title":"Guest Editors' Introduction to the Special Section on Bioinformatics Research and Applications","authors":"Zhipeng Cai;Alexander Zelikovsky","doi":"10.1109/TCBB.2024.3390374","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3390374","url":null,"abstract":"","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10712175","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142397043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3477592
Dengwei Zhao, Jingyuan Zhou, Shikui Tu, Lei Xu
Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A ∗ search, path consistency (PC), i.e., f values on one optimal path should be identical, is employed as the objective function in the update of the f value estimator to train a multi-objective de novo drug designer. By incorporating the f value into the decision-making process of beam search, the DrugBA∗ algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-theart algorithm QADD in multiple molecular properties of the generated molecules.
在广阔的化学空间内从零开始生成高质量的类药物分子是药物发现领域的一项重大挑战。在之前的研究中,基于价值的强化学习算法被用来迭代生成具有多种所需特性的分子。即时奖励被定义为每一步对中间状态分子的评估,学习目标是最大化生成路径上所有分子的预期累积评估分数。然而,这种对奖励的定义有误导性,因为实际上,优化目标应该只是最终生成的分子的评价得分。此外,在以前的研究中,决策过程中引入了随机性,从而可以生成多种分子,但不再追求未来的最大回报。在本文中,即时奖励被定义为通过对分子的修改所实现的改进,从而使最终生成的分子的评价得分最大化。路径一致性(PC)源于 A ∗ 搜索,即一条最优路径上的 f 值应完全相同,它被用作更新 f 值估计器的目标函数,以训练多目标全新药物设计器。通过将 f 值纳入波束搜索的决策过程,DrugBA∗ 算法得以大规模生成高质量和多样性的分子。实验结果表明,与最先进的 QADD 算法相比,所生成分子的多种分子特性都有大幅提升。
{"title":"De Novo Drug Design by Multi-Objective Path Consistency Learning with Beam A∗ Search.","authors":"Dengwei Zhao, Jingyuan Zhou, Shikui Tu, Lei Xu","doi":"10.1109/TCBB.2024.3477592","DOIUrl":"10.1109/TCBB.2024.3477592","url":null,"abstract":"<p><p>Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A ∗ search, path consistency (PC), i.e., f values on one optimal path should be identical, is employed as the objective function in the update of the f value estimator to train a multi-objective de novo drug designer. By incorporating the f value into the decision-making process of beam search, the DrugBA∗ algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-theart algorithm QADD in multiple molecular properties of the generated molecules.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3429784
Da Yan;Catia Pesquita;Carsten Görg;Jake Y. Chen
{"title":"Guest Editorial Selected Papers From BIOKDD 2022","authors":"Da Yan;Catia Pesquita;Carsten Görg;Jake Y. Chen","doi":"10.1109/TCBB.2024.3429784","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3429784","url":null,"abstract":"","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10712183","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142397044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3476619
Xiangwen Wang, Qiaoying Jin, Li Zou, Xianghong Lin, Yonggang Lu
Three-dimensional (3D) reconstruction in single-particle cryo-electron microscopy (cryo-EM) is a critical technique for recovering and studying the fine 3D structure of proteins and other biological macromolecules, where the primary issue is to determine the orientations of projection images with high levels of noise. This paper proposes a method to determine the orientations of cryo-EM projection images using reliable common lines and spherical embeddings. First, the reliability of common lines between projection images is evaluated using a weighted voting algorithm based on an iterative improvement technique and binarized weighting. Then, the reliable common lines are used to calculate the normal vectors and local X-axis vectors of projection images after two spherical embeddings. Finally, the orientations of projection images are determined by aligning the results of the two spherical embeddings using an orthogonal constraint. Experimental results on both synthetic and real cryo-EM projection image datasets demonstrate that the proposed method can achieve higher accuracy in estimating the orientations of projection images and higher resolution in reconstructing preliminary 3D structures than some common line-based methods, indicating that the proposed method is effective in single-particle cryo-EM 3D reconstruction.
单颗粒冷冻电镜(cryo-EM)中的三维(3D)重建是恢复和研究蛋白质及其他生物大分子精细三维结构的关键技术,其中的首要问题是确定高噪声投影图像的方向。本文提出了一种利用可靠的共线和球形嵌入确定冷冻电镜投影图像方向的方法。首先,使用基于迭代改进技术和二值化加权的加权投票算法评估投影图像之间公共线的可靠性。然后,利用可靠的公共线计算经过两次球形嵌入后投影图像的法向量和局部 X 轴向量。最后,利用正交约束对两个球形嵌入的结果进行对齐,从而确定投影图像的方向。在合成和真实冷冻电镜投影图像数据集上的实验结果表明,与一些常见的基于线的方法相比,所提出的方法在估计投影图像的方向方面能达到更高的精度,在重建初步的三维结构方面能达到更高的分辨率,这表明所提出的方法在单粒子冷冻电镜三维重建方面是有效的。
{"title":"Orientation Determination of Cryo-EM Projection Images Using Reliable Common Lines and Spherical Embeddings.","authors":"Xiangwen Wang, Qiaoying Jin, Li Zou, Xianghong Lin, Yonggang Lu","doi":"10.1109/TCBB.2024.3476619","DOIUrl":"10.1109/TCBB.2024.3476619","url":null,"abstract":"<p><p>Three-dimensional (3D) reconstruction in single-particle cryo-electron microscopy (cryo-EM) is a critical technique for recovering and studying the fine 3D structure of proteins and other biological macromolecules, where the primary issue is to determine the orientations of projection images with high levels of noise. This paper proposes a method to determine the orientations of cryo-EM projection images using reliable common lines and spherical embeddings. First, the reliability of common lines between projection images is evaluated using a weighted voting algorithm based on an iterative improvement technique and binarized weighting. Then, the reliable common lines are used to calculate the normal vectors and local X-axis vectors of projection images after two spherical embeddings. Finally, the orientations of projection images are determined by aligning the results of the two spherical embeddings using an orthogonal constraint. Experimental results on both synthetic and real cryo-EM projection image datasets demonstrate that the proposed method can achieve higher accuracy in estimating the orientations of projection images and higher resolution in reconstructing preliminary 3D structures than some common line-based methods, indicating that the proposed method is effective in single-particle cryo-EM 3D reconstruction.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3477410
Jian Zhong, Haochen Zhao, Qichang Zhao, Jianxin Wang
Precisely predicting Drug-Drug Interactions (DDIs) carries the potential to elevate the quality and safety of drug therapies, protecting the well-being of patients, and providing essential guidance and decision support at every stage of the drug development process. In recent years, leveraging large-scale biomedical knowledge graphs has improved DDI prediction performance. However, the feature extraction procedures in these methods are still rough. More refined features may further improve the quality of predictions. To overcome these limitations, we develop a knowledge graph-based method for multi-typed DDI prediction with contrastive learning (KG-CLDDI). In KG-CLDDI, we combine drug knowledge aggregation features from the knowledge graph with drug topological aggregation features from the DDI graph. Additionally, we build a contrastive learning module that uses horizontal reversal and dropout operations to produce high-quality embeddings for drug-drug pairs. The comparison results indicate that KG-CLDDI is superior to state-of-the-art models in both the transductive and inductive settings. Notably, for the inductive setting, KG-CLDDI outperforms the previous best method by 17.49% and 24.97% in terms of AUC and AUPR, respectively. Furthermore, we conduct the ablation analysis and case study to show the effectiveness of KG-CLDDI. These findings illustrate the potential significance of KG-CLDDI in advancing DDI research and its clinical applications. The codes of KG-CLDDI are available at https://github.com/jianzhong123/KG-CLDDI.
{"title":"A knowledge graph-based method for drug-drug interaction prediction with contrastive learning.","authors":"Jian Zhong, Haochen Zhao, Qichang Zhao, Jianxin Wang","doi":"10.1109/TCBB.2024.3477410","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3477410","url":null,"abstract":"<p><p>Precisely predicting Drug-Drug Interactions (DDIs) carries the potential to elevate the quality and safety of drug therapies, protecting the well-being of patients, and providing essential guidance and decision support at every stage of the drug development process. In recent years, leveraging large-scale biomedical knowledge graphs has improved DDI prediction performance. However, the feature extraction procedures in these methods are still rough. More refined features may further improve the quality of predictions. To overcome these limitations, we develop a knowledge graph-based method for multi-typed DDI prediction with contrastive learning (KG-CLDDI). In KG-CLDDI, we combine drug knowledge aggregation features from the knowledge graph with drug topological aggregation features from the DDI graph. Additionally, we build a contrastive learning module that uses horizontal reversal and dropout operations to produce high-quality embeddings for drug-drug pairs. The comparison results indicate that KG-CLDDI is superior to state-of-the-art models in both the transductive and inductive settings. Notably, for the inductive setting, KG-CLDDI outperforms the previous best method by 17.49% and 24.97% in terms of AUC and AUPR, respectively. Furthermore, we conduct the ablation analysis and case study to show the effectiveness of KG-CLDDI. These findings illustrate the potential significance of KG-CLDDI in advancing DDI research and its clinical applications. The codes of KG-CLDDI are available at https://github.com/jianzhong123/KG-CLDDI.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TCBB.2024.3477313
Aditya Malusare, Vaneet Aggarwal
Recent advancements in generative models have established state-of-the-art benchmarks in the generation of molecules and novel drug candidates. Despite these successes, a significant gap persists between generative models and the utilization of extensive biomedical knowledge, often systematized within knowledge graphs, whose potential to inform and enhance generative processes has not been realized. In this paper, we present a novel approach that bridges this divide by developing a framework for knowledge-enhanced generative models called KARL. We develop a scalable methodology to extend the functionality of knowledge graphs while preserving semantic integrity, and incorporate this contextual information into a generative framework to guide a diffusion-based model. The integration of knowledge graph embeddings with our generative model furnishes a robust mechanism for producing novel drug candidates possessing specific characteristics while ensuring validity and synthesizability. KARL outperforms state-of-the-art generative models on both unconditional and targeted generation tasks.
生成模型的最新进展为分子和新型候选药物的生成建立了最先进的基准。尽管取得了这些成就,但在生成模型与利用广泛的生物医学知识(通常在知识图谱中系统化)之间仍然存在着巨大的差距,而这些知识为生成过程提供信息和增强生成过程的潜力尚未实现。在本文中,我们提出了一种新颖的方法,通过开发一个名为 KARL 的知识增强生成模型框架来弥合这一鸿沟。我们开发了一种可扩展的方法来扩展知识图谱的功能,同时保持语义的完整性,并将这种上下文信息纳入生成框架,以指导基于扩散的模型。知识图谱嵌入与我们的生成模型相结合,提供了一种稳健的机制,用于生成具有特定特征的新型候选药物,同时确保有效性和可合成性。KARL 在无条件生成和目标生成任务上的表现都优于最先进的生成模型。
{"title":"Improving Molecule Generation and Drug Discovery with a Knowledge-enhanced Generative Model.","authors":"Aditya Malusare, Vaneet Aggarwal","doi":"10.1109/TCBB.2024.3477313","DOIUrl":"10.1109/TCBB.2024.3477313","url":null,"abstract":"<p><p>Recent advancements in generative models have established state-of-the-art benchmarks in the generation of molecules and novel drug candidates. Despite these successes, a significant gap persists between generative models and the utilization of extensive biomedical knowledge, often systematized within knowledge graphs, whose potential to inform and enhance generative processes has not been realized. In this paper, we present a novel approach that bridges this divide by developing a framework for knowledge-enhanced generative models called KARL. We develop a scalable methodology to extend the functionality of knowledge graphs while preserving semantic integrity, and incorporate this contextual information into a generative framework to guide a diffusion-based model. The integration of knowledge graph embeddings with our generative model furnishes a robust mechanism for producing novel drug candidates possessing specific characteristics while ensuring validity and synthesizability. KARL outperforms state-of-the-art generative models on both unconditional and targeted generation tasks.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-08DOI: 10.1109/TCBB.2024.3476453
Kiefer Andre Bedoya Benites, Wilser Andres Garcia-Quispes
Polymerase chain reaction - Restriction Fragment Length Polymorphism (PCR-RFLP) is an established molecular biology technique leveraging DNA sequence variability for organism identification, genetic disease detection, biodiversity analysis, etc. Traditional PCR-RFLP requires wet-laboratory procedures that can result in technical errors, procedural challenges, and financial costs. With the aim of providing an accessible and efficient PCR-RFLP technique complement, we introduce RFLP-inator. This is a comprehensive web-based platform developed in R using the package Shiny, which simulates the PCR-RFLP technique, integrates analysis capabilities, and offers complementary tools for both pre- and post-evaluation of in vitro results. We developed the RFLP-inator's algorithm independently and our platform offers seven dynamic tools: RFLP simulator, Pattern identifier, Enzyme selector, RFLP analyzer, Multiplex PCR, Restriction map maker, and Gel plotter. Moreover, the software includes a restriction pattern database of more than 250,000 sequences of the bacterial 16S rRNA gene. We successfully validated the core tools against published research findings. This new platform is open access and user-friendly, offering a valuable resource for researchers, educators, and students specializing in molecular genetics. RFLP-inator not only streamlines RFLP technique application but also supports pedagogical efforts in genetics, illustrating its utility and reliability. The software is available for free at https://kodebio.shinyapps.io/RFLP-inator/.
{"title":"RFLP-inator: interactive web platform for in silico simulation and complementary tools of the PCR-RFLP technique.","authors":"Kiefer Andre Bedoya Benites, Wilser Andres Garcia-Quispes","doi":"10.1109/TCBB.2024.3476453","DOIUrl":"10.1109/TCBB.2024.3476453","url":null,"abstract":"<p><p>Polymerase chain reaction - Restriction Fragment Length Polymorphism (PCR-RFLP) is an established molecular biology technique leveraging DNA sequence variability for organism identification, genetic disease detection, biodiversity analysis, etc. Traditional PCR-RFLP requires wet-laboratory procedures that can result in technical errors, procedural challenges, and financial costs. With the aim of providing an accessible and efficient PCR-RFLP technique complement, we introduce RFLP-inator. This is a comprehensive web-based platform developed in R using the package Shiny, which simulates the PCR-RFLP technique, integrates analysis capabilities, and offers complementary tools for both pre- and post-evaluation of in vitro results. We developed the RFLP-inator's algorithm independently and our platform offers seven dynamic tools: RFLP simulator, Pattern identifier, Enzyme selector, RFLP analyzer, Multiplex PCR, Restriction map maker, and Gel plotter. Moreover, the software includes a restriction pattern database of more than 250,000 sequences of the bacterial 16S rRNA gene. We successfully validated the core tools against published research findings. This new platform is open access and user-friendly, offering a valuable resource for researchers, educators, and students specializing in molecular genetics. RFLP-inator not only streamlines RFLP technique application but also supports pedagogical efforts in genetics, illustrating its utility and reliability. The software is available for free at https://kodebio.shinyapps.io/RFLP-inator/.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}