Pub Date : 2024-07-26DOI: 10.1109/TCBB.2024.3434461
Siyuan Guo, Jihong Guan, Shuigeng Zhou
In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g., QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g., pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fr´echet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models. The Code of D2L-OMP is available at https://github.com/bz99bz/D2L-OMP.
在过去十年中,人工智能(AI)驱动的药物设计与发现一直是人工智能领域的研究热点,其中一个重要分支是通过生成模型生成分子,从基于 GAN 的模型、基于 VAE 的模型到最新的基于扩散的模型。然而,大多数现有模型主要追求生成分子的有效性和唯一性等基本属性,少数模型则进一步明确优化某个重要的分子属性(如 QED 或 PlogP),这使得大多数生成的分子在实践中用处不大。在本文中,我们提出了一种生成具有理想特性的分子的新方法,通过多种创新设计扩展了扩散模型框架。新颖之处有两方面。一方面,考虑到分子结构复杂多样,而分子特性通常由一些子结构(如药理结构)决定,我们建议分别在分子和分子片段这两个结构层次上进行扩散,从而获得混合高斯分布的反向扩散过程。为了得到理想的分子片段,我们开发了一种基于电子效应的新型破碎方法。另一方面,我们介绍了在扩散模型框架下明确优化多种分子特性的两种方法。首先,由于潜在药物分子必须具有化学有效性,我们通过能量引导函数来优化分子有效性。其次,由于潜在药物分子应具有各种理想特性,我们采用了一种多目标机制来同时优化多种分子特性。用两个基准数据集 QM9 和 ZINC250k 进行的大量实验表明,我们提出的方法生成的分子在有效性、唯一性、新颖性、Fr´echet ChemNet Distance (FCD)、QED 和 PlogP 等方面都优于目前的 SOTA 模型。D2L-OMP 的代码见 https://github.com/bz99bz/D2L-OMP。
{"title":"Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties.","authors":"Siyuan Guo, Jihong Guan, Shuigeng Zhou","doi":"10.1109/TCBB.2024.3434461","DOIUrl":"10.1109/TCBB.2024.3434461","url":null,"abstract":"<p><p>In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g., QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g., pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fr´echet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models. The Code of D2L-OMP is available at https://github.com/bz99bz/D2L-OMP.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-24DOI: 10.1109/TCBB.2024.3432740
Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu
Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.
{"title":"MLRR-ATV: A Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization for scRNA-seq Data Clustering.","authors":"Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu","doi":"10.1109/TCBB.2024.3432740","DOIUrl":"10.1109/TCBB.2024.3432740","url":null,"abstract":"<p><p>Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141758476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms. The source codes of dwMLCS can be downloaded from web site https://github.com/BioLab310/dwMLCS.
为多个序列寻找最长公共子序列(MLCS)是一个计算密集且极具挑战性的问题,在文本比较、模式识别和基因诊断等多个领域都有重要应用。目前,基于点的主流 MLCS 算法已成为流行算法并得到广泛研究。一般来说,这些算法会构建匹配点的有向无环图(DAG),并将 MLCS 问题转换为搜索 DAG 中的最长路径。目前已做了一些改进,主要是减小模型大小和减少冗余计算。这些改进包括:1)消除重复节点的哈希方法;2)支持较小 DAG 的动态结构;3)路径剪枝策略等。然而,在面对大规模 MLCS 问题时,这些算法的局限性仍然很大,原因在于:1)动态结构的维护过于耗时;2)路径剪枝在很大程度上依赖于 MLCS 下界和上界的紧密性。这些因素导致大规模 MLCS 问题仍然是一个难题。我们针对大规模 MLCS 问题提出了一种新算法,命名为 dwMLCS。它基于两个模型:一个是既节省空间又节省时间的动态 DAG 模型。它能显著减少 DAG 的大小。另一个是带有新后继策略的加权 DAG 模型。利用该模型,我们设计了一种算法,用于找到更严格的 MLCS 下限。然后,进行路径剪枝以进一步缩小 DAG 的大小并消除冗余计算。此外,我们还提出了一种提高路径剪枝策略效率的上界方法。实验结果表明,所提出的模型和算法的有效性和效率均优于最先进的算法。dwMLCS 的源代码可从网站 https://github.com/BioLab310/dwMLCS 下载。
{"title":"dwMLCS: An Efficient MLCS Algorithm based on Dynamic and Weighted Directed Acyclic Graph.","authors":"Changyong Yu, Dekuan Gao, Xu Guo, Haitao Ma, Yuhai Zhao, Guoren Wang","doi":"10.1109/TCBB.2024.3431558","DOIUrl":"10.1109/TCBB.2024.3431558","url":null,"abstract":"<p><p>The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms. The source codes of dwMLCS can be downloaded from web site https://github.com/BioLab310/dwMLCS.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141748107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1109/TCBB.2024.3431688
Aditya Kumar, Deepak Singh
The virus poses a longstanding and enduring danger to various forms of life. Despite the ongoing endeavors to combat viral diseases, there exists a necessity to explore and develop novel therapeutic options. Antiviral peptides are bioactive molecules with a favorable toxicity profile, making them promising alternatives for viral infection treatment. Therefore, this article employed a generative adversarial network for antiviral peptide augmentation and a novel two-step authentication process for augmented synthetic peptides to enhance antiviral activity prediction. Additionally, five widely utilized deep learning models were employed for classification purposes. Initially, a GAN was used to augment the antiviral peptide. In a two-step authentication process, the NCBI-BLAST was utilized to identify the antiviral activity resemblance between the synthetic and real peptide. Subsequently, the hydrophobicity, hydrophilicity, hydroxylic nature, positive charge, and negative charge of synthetic and authentic antiviral peptides were compared before their utilization. Later, to examine the impact of authenticated peptide augmentation in the prediction of antiviral peptides, a comparison is conducted with the outcomes of non-peptide augmented prediction. The study demonstrates that the 1-D convolution neural network with augmented peptide exhibits superior performance compared to other employed classifiers and state-of-the-art models. The network attains a mean classification accuracy of 95.41%, an AUC value of 0.95, and an MCC value of 0.90 on the benchmark antiviral and anti-corona peptides dataset. Thus, the performance of the proposed model indicates its efficacy in predicting the antiviral activity of peptides.
病毒对各种生命形式构成了长期而持久的威胁。尽管人们一直在努力防治病毒性疾病,但仍有必要探索和开发新的治疗方案。抗病毒肽是一种生物活性分子,具有良好的毒性,是治疗病毒感染的理想选择。因此,本文采用生成式对抗网络进行抗病毒肽扩增,并对扩增合成肽采用新颖的两步验证流程,以增强抗病毒活性预测。此外,本文还采用了五种广泛使用的深度学习模型进行分类。最初,使用 GAN 来增强抗病毒肽。在两步验证过程中,利用 NCBI-BLAST 来识别合成肽与真实肽之间的抗病毒活性相似性。随后,在使用前比较了合成肽和真实抗病毒肽的疏水性、亲水性、羟基性、正电荷和负电荷。随后,为了检验经鉴定的肽增强对预测抗病毒肽的影响,将其与非肽增强预测的结果进行了比较。研究结果表明,与其他使用的分类器和最先进的模型相比,添加了多肽的一维卷积神经网络表现出更优越的性能。在抗病毒和抗晕肽基准数据集上,该网络的平均分类准确率为 95.41%,AUC 值为 0.95,MCC 值为 0.90。因此,所提模型的性能表明它在预测多肽的抗病毒活性方面非常有效。
{"title":"Generative Adversarial Network-Based Augmentation with Noval 2-step Authentication for Anti-coronavirus Peptide Prediction.","authors":"Aditya Kumar, Deepak Singh","doi":"10.1109/TCBB.2024.3431688","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3431688","url":null,"abstract":"<p><p>The virus poses a longstanding and enduring danger to various forms of life. Despite the ongoing endeavors to combat viral diseases, there exists a necessity to explore and develop novel therapeutic options. Antiviral peptides are bioactive molecules with a favorable toxicity profile, making them promising alternatives for viral infection treatment. Therefore, this article employed a generative adversarial network for antiviral peptide augmentation and a novel two-step authentication process for augmented synthetic peptides to enhance antiviral activity prediction. Additionally, five widely utilized deep learning models were employed for classification purposes. Initially, a GAN was used to augment the antiviral peptide. In a two-step authentication process, the NCBI-BLAST was utilized to identify the antiviral activity resemblance between the synthetic and real peptide. Subsequently, the hydrophobicity, hydrophilicity, hydroxylic nature, positive charge, and negative charge of synthetic and authentic antiviral peptides were compared before their utilization. Later, to examine the impact of authenticated peptide augmentation in the prediction of antiviral peptides, a comparison is conducted with the outcomes of non-peptide augmented prediction. The study demonstrates that the 1-D convolution neural network with augmented peptide exhibits superior performance compared to other employed classifiers and state-of-the-art models. The network attains a mean classification accuracy of 95.41%, an AUC value of 0.95, and an MCC value of 0.90 on the benchmark antiviral and anti-corona peptides dataset. Thus, the performance of the proposed model indicates its efficacy in predicting the antiviral activity of peptides.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141748108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1109/TCBB.2024.3429546
Wenkang Wang, Xiangmao Meng, Ju Xiang, Hayat Dino Bedru, Min Li
Identification of protein complex is an important issue in the field of system biology, which is crucial to understanding the cellular organization and inferring protein functions. Recently, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) networks. However, most of these methods only focus on local information of proteins in the PPI network, which are easily affected by the noise in the PPI network. Meanwhile, it's still challenging to detect protein complexes, especially for overlapping cases. To address these issues, we propose a new method, named Dopcc, to detect overlapping protein complexes by constructing a multi-metrics network according to different strategies. First, we adopt the Jaccard coefficient to measure the neighbor similarity between proteins and denoise the PPI network. Then, we propose a new strategy, integrating hierarchical compressing with network embedding, to capture the high-order structural similarity between proteins. Further, a new co-core attachment strategy is proposed to detect overlapping protein complexes from multi-metrics. The experimental results show that our proposed method, Dopcc, outperforms the other eight state-of-the-art methods in terms of F-measure, MMR, and Composite Score on two yeast datasets. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/Dopcc.
蛋白质复合物的鉴定是系统生物学领域的一个重要问题,对于理解细胞组织和推断蛋白质功能至关重要。最近,人们提出了许多计算方法来从蛋白质-蛋白质相互作用(PPI)网络中检测蛋白质复合物。然而,这些方法大多只关注 PPI 网络中蛋白质的局部信息,容易受到 PPI 网络中噪声的影响。同时,检测蛋白质复合物仍具有挑战性,尤其是重叠情况。针对这些问题,我们提出了一种名为 Dopcc 的新方法,根据不同的策略构建多度量网络,从而检测重叠的蛋白质复合物。首先,我们采用 Jaccard 系数来测量蛋白质之间的邻接相似性,并对 PPI 网络进行去噪处理。然后,我们提出了一种新策略,将分层压缩与网络嵌入相结合,以捕捉蛋白质之间的高阶结构相似性。此外,我们还提出了一种新的共核附着策略,以从多指标中检测重叠的蛋白质复合物。实验结果表明,在两个酵母数据集上,我们提出的 Dopcc 方法在 F-measure、MMR 和 Composite Score 方面优于其他八种最先进的方法。源代码和数据集可从 https://github.com/CSUBioGroup/Dopcc 下载。
{"title":"Dopcc: Detecting overlapping protein complexes via multi-metrics and co-core attachment method.","authors":"Wenkang Wang, Xiangmao Meng, Ju Xiang, Hayat Dino Bedru, Min Li","doi":"10.1109/TCBB.2024.3429546","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3429546","url":null,"abstract":"<p><p>Identification of protein complex is an important issue in the field of system biology, which is crucial to understanding the cellular organization and inferring protein functions. Recently, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) networks. However, most of these methods only focus on local information of proteins in the PPI network, which are easily affected by the noise in the PPI network. Meanwhile, it's still challenging to detect protein complexes, especially for overlapping cases. To address these issues, we propose a new method, named Dopcc, to detect overlapping protein complexes by constructing a multi-metrics network according to different strategies. First, we adopt the Jaccard coefficient to measure the neighbor similarity between proteins and denoise the PPI network. Then, we propose a new strategy, integrating hierarchical compressing with network embedding, to capture the high-order structural similarity between proteins. Further, a new co-core attachment strategy is proposed to detect overlapping protein complexes from multi-metrics. The experimental results show that our proposed method, Dopcc, outperforms the other eight state-of-the-art methods in terms of F-measure, MMR, and Composite Score on two yeast datasets. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/Dopcc.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141633413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-16DOI: 10.1109/TCBB.2024.3429234
Rajesh Kumar Mundotiya, Juhi Priya, Divya Kuwarbi, Teekam Singh
One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F1-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.
数据挖掘早期阶段的主要任务之一是识别生物医学语料库中的实体。当使用数据驱动模型(如基于深度学习的架构)从可用的(未)注释数据中学习时,依赖于稳健特征工程的传统方法面临着挑战。尽管利用了大型语料库和先进的深度学习模型,但领域泛化仍然是一个问题。注意力机制能有效捕捉较长的句子依赖关系,并从有限的注释数据集中提取语义和句法信息。为了应对生物医学文本中词汇外的挑战,PCA-CLS(Position and Contextual Attention with CNN-LSTM-Softmax)模型结合了全局自注意力和字符级卷积神经网络技术。该模型的性能在八个不同的生物医学领域数据集上进行了评估,其中包括基因、药物、疾病和物种等实体。PCA-CLS 模型的性能优于几种最先进的模型,取得了显著的 F1 分数,包括 BC2GM 的 88.19%、JNLPBA 的 85.44%、BC5CDR-chemical 的 90.80%、BC5CDR-disease 的 87.07%、BC4CHEMD 的 89.18%、NCBI 的 88.81% 和 s800 数据集的 91.59%。
{"title":"Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model.","authors":"Rajesh Kumar Mundotiya, Juhi Priya, Divya Kuwarbi, Teekam Singh","doi":"10.1109/TCBB.2024.3429234","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3429234","url":null,"abstract":"<p><p>One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F1-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141626660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-15DOI: 10.1109/TCBB.2024.3427381
Kamal Taha
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology subcategory, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
这篇综述文章深入探讨了用于辨别蛋白质功能的各种机器学习(ML)方法和算法。文章对所讨论的每种方法的功效、局限性、潜在改进和未来前景进行了评估。我们提出了一种创新的分层分类系统,将算法分为复杂的类别和独特的技术。这种分类法基于三级层次结构,从方法类别开始,逐渐缩小到具体技术。这种框架可以对算法进行结构化的全面分类,帮助研究人员了解各种算法和技术之间的相互关系。研究结合了经验评估和实验评估来区分不同的技术。经验评估根据四项标准对技术进行排名。实验评估将:(1) 同一方法子类别下的单项技术;(2) 同一类别中的不同子类别;(3) 大类别本身。综合创新方法分类、经验发现和实验评估,文章提供了对蛋白质功能鉴定中的 ML 策略的全面理解。除了关注单任务方法,本文还探讨了蛋白质功能的多任务和多标签检测技术。此外,文章还揭示了 ML 在蛋白质功能鉴定中的未来发展方向。
{"title":"Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations.","authors":"Kamal Taha","doi":"10.1109/TCBB.2024.3427381","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3427381","url":null,"abstract":"<p><p>This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology subcategory, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141619844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1109/TCBB.2024.3426999
Suzanne W Dietrich, Wenli Ma, Yian Ding, Karen H Watanabe, Mary B Zelinski, James P Sluka
The goal of the Multispecies Ovary Tissue Histology Electronic Repository (MOTHER) project is to establish a collection of nonhuman ovary histology images for multiple species as a resource for researchers and educators. An important component of sharing scientific data is the inclusion of the contextual metadata that describes the data. MOTHER extends the Ecological Metadata Language (EML) for documenting research data, leveraging its data provenance and usage license with the inclusion of metadata for ovary histology images. The design of the MOTHER metadata includes information on the donor animal, including reproductive cycle status, the slide and its preparation. MOTHER also extends the ezEML tool, called ezEML+MOTHER, for the specification of the metadata. The design of the MOTHER database (MOTHERDB) captures the metadata about the histology images, providing a searchable resource for discovering relevant images. MOTHER also defines a curation process for the ingestion of a collection of images and its metadata, verifying the validity of the metadata before its inclusion in the MOTHER collection. A Web search provides the ability to identify relevant images based on various characteristics in the metadata itself, such as genus and species, using filters.
{"title":"MOTHER-DB: A Database for Sharing Nonhuman Ovarian Histology Images.","authors":"Suzanne W Dietrich, Wenli Ma, Yian Ding, Karen H Watanabe, Mary B Zelinski, James P Sluka","doi":"10.1109/TCBB.2024.3426999","DOIUrl":"10.1109/TCBB.2024.3426999","url":null,"abstract":"<p><p>The goal of the Multispecies Ovary Tissue Histology Electronic Repository (MOTHER) project is to establish a collection of nonhuman ovary histology images for multiple species as a resource for researchers and educators. An important component of sharing scientific data is the inclusion of the contextual metadata that describes the data. MOTHER extends the Ecological Metadata Language (EML) for documenting research data, leveraging its data provenance and usage license with the inclusion of metadata for ovary histology images. The design of the MOTHER metadata includes information on the donor animal, including reproductive cycle status, the slide and its preparation. MOTHER also extends the ezEML tool, called ezEML+MOTHER, for the specification of the metadata. The design of the MOTHER database (MOTHERDB) captures the metadata about the histology images, providing a searchable resource for discovering relevant images. MOTHER also defines a curation process for the ingestion of a collection of images and its metadata, verifying the validity of the metadata before its inclusion in the MOTHER collection. A Web search provides the ability to identify relevant images based on various characteristics in the metadata itself, such as genus and species, using filters.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141599244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN + is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.
卷积神经网络(CNN)和深度递归神经网络(RNN)等深度学习方法已成为预测蛋白质功能的中坚力量,并取得了令人鼓舞的先进(SOTA)成果。RNN 具有以下内在能力:(i) 专注于过去的信息;(ii) 同时收集短程和长程依赖信息;(iii) 双向处理,提供了强大的顺序处理机制。而 CNN 虽然提供了并行性,却仅限于关注过去和未来的短期信息。因此,我们引入了一种严格遵守 RNN 顺序处理机制的新型双向 CNN,并将其用于开发蛋白质功能预测框架--Bi-SeqCNN。这是一个基于子序列的框架。此外,Bi-SeqCNN + 是一种集合方法,可以获得更好的预测结果。据我们所知,这是首次将双向 CNN 用于一般时间数据分析,而不仅仅是蛋白质序列。在三个基准蛋白质序列数据集上,所提出的架构比当代的 SOTA 方法提高了 5.5%。此外,与 SOTA 方法相比,它的重量更轻,只需要(0.50-0.70 倍)更少的参数就能获得这些结果。
{"title":"Bi-SeqCNN: A Novel Light-weight Bi-directional CNN Architecture for Protein Function Prediction.","authors":"Vikash Kumar, Akshay Deepak, Ashish Ranjan, Aravind Prakash","doi":"10.1109/TCBB.2024.3426491","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3426491","url":null,"abstract":"<p><p>Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN <sup>+</sup> is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141590210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-08DOI: 10.1109/TCBB.2024.3424400
Wentao Zhu, Zhiqiang Du, Ziang Xu, Defu Yang, Minghan Chen, Qianqian Song
Alzheimer's disease (AD) is the most common neurodegenerative disease, and it consumes considerable medical resources with increasing number of patients every year. Mounting evidence show that the regulatory disruptions altering the intrinsic activity of genes in brain cells contribute to AD pathogenesis. To gain insights into the underlying gene regulation in AD, we proposed a graph learning method, Single-Cell based Regulatory Network (SCRN), to identify the regulatory mechanisms based on single-cell data. SCRN implements the γ-decaying heuristic link prediction based on graph neural networks and can identify reliable gene regulatory networks using locally closed subgraphs. In this work, we first performed UMAP dimension reduction analysis on single-cell RNA sequencing (scRNA-seq) data of AD and normal samples. Then we used SCRN to construct the gene regulatory network based on three well-recognized AD genes (APOE, CX3CR1, and P2RY12). Enrichment analysis of the regulatory network revealed significant pathways including NGF signaling, ERBB2 signaling, and hemostasis. These findings demonstrate the feasibility of using SCRN to uncover potential biomarkers and therapeutic targets related to AD.
阿尔茨海默病(AD)是最常见的神经退行性疾病,每年患者人数不断增加,耗费了大量医疗资源。越来越多的证据表明,改变脑细胞中基因内在活性的调控紊乱是导致阿尔茨海默病发病的原因之一。为了深入了解AD的潜在基因调控,我们提出了一种图学习方法--基于单细胞的调控网络(SCRN),以识别基于单细胞数据的调控机制。SCRN实现了基于图神经网络的γ-衰减启发式链接预测,能利用局部封闭子图识别可靠的基因调控网络。在这项工作中,我们首先对AD和正常样本的单细胞RNA测序(scRNA-seq)数据进行了UMAP降维分析。然后,我们使用 SCRN 构建了基于三个公认的 AD 基因(APOE、CX3CR1 和 P2RY12)的基因调控网络。调控网络的富集分析揭示了包括 NGF 信号转导、ERBB2 信号转导和止血在内的重要通路。这些发现证明了利用 SCRN 发现与 AD 相关的潜在生物标记物和治疗靶点的可行性。
{"title":"SCRN: Single-cell Gene Regulatory Network Identification in Alzheimer's Disease.","authors":"Wentao Zhu, Zhiqiang Du, Ziang Xu, Defu Yang, Minghan Chen, Qianqian Song","doi":"10.1109/TCBB.2024.3424400","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3424400","url":null,"abstract":"<p><p>Alzheimer's disease (AD) is the most common neurodegenerative disease, and it consumes considerable medical resources with increasing number of patients every year. Mounting evidence show that the regulatory disruptions altering the intrinsic activity of genes in brain cells contribute to AD pathogenesis. To gain insights into the underlying gene regulation in AD, we proposed a graph learning method, Single-Cell based Regulatory Network (SCRN), to identify the regulatory mechanisms based on single-cell data. SCRN implements the γ-decaying heuristic link prediction based on graph neural networks and can identify reliable gene regulatory networks using locally closed subgraphs. In this work, we first performed UMAP dimension reduction analysis on single-cell RNA sequencing (scRNA-seq) data of AD and normal samples. Then we used SCRN to construct the gene regulatory network based on three well-recognized AD genes (APOE, CX3CR1, and P2RY12). Enrichment analysis of the regulatory network revealed significant pathways including NGF signaling, ERBB2 signaling, and hemostasis. These findings demonstrate the feasibility of using SCRN to uncover potential biomarkers and therapeutic targets related to AD.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141558630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}