IEEE/ACM Transactions on Computational Biology and Bioinformatics最新文献_第8页

KGRLFF: Detecting Drug-Drug Interactions Based on Knowledge Graph Representation Learning and Feature Fusion KGRLFF：基于知识图谱表示学习和特征融合的药物相互作用检测。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-29 DOI: 10.1109/TCBB.2024.3434992

Xiaoli Lin;Zhuang Yin;Xiaolong Zhang;Jing Hu

Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.

准确预测药物间相互作用（DDIs）对于提高药物开发效率和确保联合疗法的安全性具有重要作用。现有模型大多依赖单一信息源预测 DDIs，很少有模型能在生物医学知识图谱上执行任务。本文提出了一种新的混合方法，即知识图谱表征学习与特征融合（KGRLFF），以充分利用生物医学知识图谱和药物分子结构的信息，更好地预测DDIs。KGRLFF首先使用基于PageRank算法（BRWP）的双向随机游走采样方法获取知识图谱中药物的高阶邻域信息，包括邻近节点、语义关系以及与三重事实相关的高阶信息。然后，一个名为 "基于知识图谱的循环递归聚合（KGCRA）"的嵌入式表征学习模型通过递归传播和聚合以药物为源和目的的信息来学习药物的嵌入式表征。此外，该模型还能学习药物的分子结构，从而获得结构化特征。最后，开发了一种特征表征融合策略（FRFS）来整合嵌入式表征和结构化特征表征。实验结果表明，KGRLFF 对预测潜在的 DDIs 是可行的。

{"title":"KGRLFF: Detecting Drug-Drug Interactions Based on Knowledge Graph Representation Learning and Feature Fusion","authors":"Xiaoli Lin;Zhuang Yin;Xiaolong Zhang;Jing Hu","doi":"10.1109/TCBB.2024.3434992","DOIUrl":"10.1109/TCBB.2024.3434992","url":null,"abstract":"Accurate prediction of drug-drug interactions (DDIs) plays an important role in improving the efficiency of drug development and ensuring the safety of combination therapy. Most existing models rely on a single source of information to predict DDIs, and few models can perform tasks on biomedical knowledge graphs. This paper proposes a new hybrid method, namely Knowledge Graph Representation Learning and Feature Fusion (KGRLFF), to fully exploit the information from the biomedical knowledge graph and molecular structure of drugs to better predict DDIs. KGRLFF first uses a Bidirectional Random Walk sampling method based on the PageRank algorithm (BRWP) to obtain higher-order neighborhood information of drugs in the knowledge graph, including neighboring nodes, semantic relations, and higher-order information associated with triple facts. Then, an embedded representation learning model named Knowledge Graph-based Cyclic Recursive Aggregation (KGCRA) is used to learn the embedded representations of drugs by recursively propagating and aggregating messages with drugs as both the source and destination. In addition, the model learns the molecular structures of the drugs to obtain the structured features. Finally, a Feature Representation Fusion Strategy (FRFS) was developed to integrate embedded representations and structured feature representations. Experimental results showed that KGRLFF is feasible for predicting potential DDIs.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2035-2049"},"PeriodicalIF":3.6,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10613488","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141792317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HGLA: Biomolecular Interaction Prediction Based on Mixed High-Order Graph Convolution With Filter Network via LSTM and Channel Attention HGLA：通过 LSTM 和通道注意，基于混合高阶图卷积与滤波网络的生物分子相互作用预测。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-26 DOI: 10.1109/TCBB.2024.3434399

Zhen Zhang;Zhaohong Deng;Ruibo Li;Wei Zhang;Qiongdan Lou;Kup-Sze Choi;Shitong Wang

Predicting biomolecular interactions is significant for understanding biological systems. Most existing methods for link prediction are based on graph convolution. Although graph convolution methods are advantageous in extracting structure information of biomolecular interactions, two key challenges still remain. One is how to consider both the immediate and high-order neighbors. Another is how to reduce noise when aggregating high-order neighbors. To address these challenges, we propose a novel method, called mixed high-order graph convolution with filter network via LSTM and channel attention (HGLA), to predict biomolecular interactions. Firstly, the basic and high-order features are extracted respectively through the traditional graph convolutional network (GCN) and the two-layer Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing (MixHop). Secondly, these features are mixed and input into the filter network composed of LayerNorm, SENet and LSTM to generate filtered features, which are concatenated and used for link prediction. The advantages of HGLA are: 1) HGLA processes high-order features separately, rather than simply concatenating them; 2) HGLA better balances the basic features and high-order features; 3) HGLA effectively filters the noise from high-order neighbors. It outperforms state-of-the-art networks on four benchmark datasets.

预测生物分子相互作用对于了解生物系统意义重大。现有的链接预测方法大多基于图卷积。虽然图卷积方法在提取生物分子相互作用的结构信息方面具有优势，但仍存在两个关键挑战。一个是如何同时考虑近邻和高阶相邻。另一个挑战是如何在聚合高阶邻域时减少噪音。为了解决这些难题，我们提出了一种新方法，即通过 LSTM 和通道注意（channel attention，HGLA）与滤波网络的混合高阶图卷积（mixed high-order graph convolution with filter network）来预测生物分子相互作用。首先，通过传统的图卷积网络（GCN）和双层高阶图卷积架构（Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing，MixHop）分别提取基本特征和高阶特征。其次，将这些特征混合后输入由 LayerNorm、SENet 和 LSTM 组成的滤波网络，生成滤波后的特征，并将其连接起来用于链接预测。HGLA 的优势在于1) HGLA 单独处理高阶特征，而不是简单地将它们串联起来；2) HGLA 更好地平衡了基本特征和高阶特征；3) HGLA 有效地过滤了来自高阶邻域的噪声。在四个基准数据集上，它的表现优于最先进的网络。代码见 https://github.com/zznb123/HGLA。

{"title":"HGLA: Biomolecular Interaction Prediction Based on Mixed High-Order Graph Convolution With Filter Network via LSTM and Channel Attention","authors":"Zhen Zhang;Zhaohong Deng;Ruibo Li;Wei Zhang;Qiongdan Lou;Kup-Sze Choi;Shitong Wang","doi":"10.1109/TCBB.2024.3434399","DOIUrl":"10.1109/TCBB.2024.3434399","url":null,"abstract":"Predicting biomolecular interactions is significant for understanding biological systems. Most existing methods for link prediction are based on graph convolution. Although graph convolution methods are advantageous in extracting structure information of biomolecular interactions, two key challenges still remain. One is how to consider both the immediate and high-order neighbors. Another is how to reduce noise when aggregating high-order neighbors. To address these challenges, we propose a novel method, called mixed high-order graph convolution with filter network via LSTM and channel attention (HGLA), to predict biomolecular interactions. Firstly, the basic and high-order features are extracted respectively through the traditional graph convolutional network (GCN) and the two-layer Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing (MixHop). Secondly, these features are mixed and input into the filter network composed of LayerNorm, SENet and LSTM to generate filtered features, which are concatenated and used for link prediction. The advantages of HGLA are: 1) HGLA processes high-order features separately, rather than simply concatenating them; 2) HGLA better balances the basic features and high-order features; 3) HGLA effectively filters the noise from high-order neighbors. It outperforms state-of-the-art networks on four benchmark datasets.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2011-2024"},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine Learning-Assisted High-Throughput Screening for Anti-MRSA Compounds 机器学习辅助高通量筛选抗 MRSA 化合物。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-26 DOI: 10.1109/TCBB.2024.3434340

Fadi Shehadeh;LewisOscar Felix;Markos Kalligeros;Adnan Shehadeh;Beth Burgwyn Fuchs;Frederick M. Ausubel;Paul P. Sotiriadis;Eleftherios Mylonakis

Background: Antimicrobial resistance is a major public health threat, and new agents are needed. Computational approaches have been proposed to reduce the cost and time needed for compound screening. Aims: A machine learning (ML) model was developed for the in silico screening of low molecular weight molecules. Methods: We used the results of a high-throughput Caenorhabditis elegans methicillin-resistant Staphylococcus aureus (MRSA) liquid infection assay to develop ML models for compound prioritization and quality control. Results: The compound prioritization model achieved an AUC of 0.795 with a sensitivity of 81% and a specificity of 70%. When applied to a validation set of 22,768 compounds, the model identified 81% of the active compounds identified by high-throughput screening (HTS) among only 30.6% of the total 22,768 compounds, resulting in a 2.67-fold increase in hit rate. When we retrained the model on all the compounds of the HTS dataset, it further identified 45 discordant molecules classified as non-hits by the HTS, with 42/45 (93%) having known antimicrobial activity. Conclusion: Our ML approach can be used to increase HTS efficiency by reducing the number of compounds that need to be physically screened and identifying potential missed hits, making HTS more accessible and reducing barriers to entry.

背景：抗菌药耐药性是一个重大的公共卫生威胁，需要新的制剂。目的：我们开发了一种机器学习（ML）模型，用于对低分子量分子进行硅学筛选：我们利用高通量秀丽隐杆线虫耐甲氧西林金黄色葡萄球菌（MRSA）液体感染试验的结果，开发了用于化合物优先排序和质量控制的机器学习模型：化合物优先排序模型的 AUC 为 0.795，灵敏度为 81%，特异度为 70%。当应用于由 22,768 个化合物组成的验证集时，该模型仅从总数 22,768 个化合物中的 30.6% 中识别出了 81% 通过高通量筛选 (HTS) 确定的活性化合物，从而使命中率提高了 2.67 倍。当我们在高通量筛选数据集的所有化合物上重新训练模型时，它进一步识别出了 45 个被高通量筛选归类为非命中的不和谐分子，其中 42/45 （93%）具有已知的抗菌活性：我们的 ML 方法可用于提高 HTS 效率，减少需要物理筛选的化合物数量，并识别潜在的漏检分子，从而使 HTS 更容易获得并降低进入门槛。

{"title":"Machine Learning-Assisted High-Throughput Screening for Anti-MRSA Compounds","authors":"Fadi Shehadeh;LewisOscar Felix;Markos Kalligeros;Adnan Shehadeh;Beth Burgwyn Fuchs;Frederick M. Ausubel;Paul P. Sotiriadis;Eleftherios Mylonakis","doi":"10.1109/TCBB.2024.3434340","DOIUrl":"10.1109/TCBB.2024.3434340","url":null,"abstract":"Background: Antimicrobial resistance is a major public health threat, and new agents are needed. Computational approaches have been proposed to reduce the cost and time needed for compound screening. Aims: A machine learning (ML) model was developed for the \u0000<italic>in silico\u0000 screening of low molecular weight molecules. Methods: We used the results of a high-throughput \u0000<italic>Caenorhabditis elegans\u0000 methicillin-resistant \u0000<italic>Staphylococcus aureus\u0000 (MRSA) liquid infection assay to develop ML models for compound prioritization and quality control. Results: The compound prioritization model achieved an AUC of 0.795 with a sensitivity of 81% and a specificity of 70%. When applied to a validation set of 22,768 compounds, the model identified 81% of the active compounds identified by high-throughput screening (HTS) among only 30.6% of the total 22,768 compounds, resulting in a 2.67-fold increase in hit rate. When we retrained the model on all the compounds of the HTS dataset, it further identified 45 discordant molecules classified as non-hits by the HTS, with 42/45 (93%) having known antimicrobial activity. Conclusion: Our ML approach can be used to increase HTS efficiency by reducing the number of compounds that need to be physically screened and identifying potential missed hits, making HTS more accessible and reducing barriers to entry.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1911-1921"},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules With Desirable Properties 两级扩散和优化多种特性：生成具有理想特性的分子的新方法。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-26 DOI: 10.1109/TCBB.2024.3434461

Siyuan Guo;Jihong Guan;Shuigeng Zhou

In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250 k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models.

在过去十年中，人工智能（AI）驱动的药物设计与发现一直是人工智能领域的研究热点，其中一个重要分支是通过生成模型生成分子，从基于 GAN 的模型、基于 VAE 的模型到最新的基于扩散的模型。然而，大多数现有模型主要追求生成分子的有效性和唯一性等基本属性，少数模型则进一步明确优化某个重要的分子属性（如 QED 或 PlogP），这使得大多数生成的分子在实践中用处不大。在本文中，我们提出了一种生成具有理想特性的分子的新方法，通过多种创新设计扩展了扩散模型框架。新颖之处有两方面。一方面，考虑到分子结构复杂多样，而分子特性通常由一些子结构（如药理结构）决定，我们建议分别在分子和分子片段这两个结构层次上进行扩散，从而获得混合高斯分布的反向扩散过程。为了得到理想的分子片段，我们开发了一种基于电子效应的新型破碎方法。另一方面，我们介绍了在扩散模型框架下明确优化多种分子特性的两种方法。首先，由于潜在药物分子必须具有化学有效性，我们通过能量引导函数来优化分子有效性。其次，由于潜在药物分子应具有各种理想特性，我们采用了一种多目标机制来同时优化多种分子特性。用两个基准数据集 QM9 和 ZINC250k 进行的大量实验表明，我们提出的方法生成的分子在有效性、唯一性、新颖性、Fr´echet ChemNet Distance (FCD)、QED 和 PlogP 等方面都优于目前的 SOTA 模型。D2L-OMP 的代码见 https://github.com/bz99bz/D2L-OMP。

{"title":"Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules With Desirable Properties","authors":"Siyuan Guo;Jihong Guan;Shuigeng Zhou","doi":"10.1109/TCBB.2024.3434461","DOIUrl":"10.1109/TCBB.2024.3434461","url":null,"abstract":"In the past decade, Artificial Intelligence (AI) driven drug design and discovery has been a hot research topic in the AI area, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue mainly the basic properties like \u0000<italic>validity\u0000 and \u0000<italic>uniqueness\u0000 of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel \u0000<italic>electronic effect\u0000 based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250 k show that the molecules generated by our proposed method have better \u0000<italic>validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP\u0000 than those generated by current SOTA models.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2050-2063"},"PeriodicalIF":3.6,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141765972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MLRR-ATV: A Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization for scRNA-seq Data Clustering. MLRR-ATV：用于 scRNA-seq 数据聚类的具有自适应总变异正则化功能的稳健歧面非负低方根表示。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-24 DOI: 10.1109/TCBB.2024.3432740

Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu

Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.

自基因组学提出以来，对基因的探索一直是研究的重点。单细胞 RNA 测序（scRNA-seq）技术的出现使得在单细胞水平上探索基因表达成为可能。由于测序技术的局限性，数据中含有大量噪声。同时，它还具有高维和稀疏的特点。聚类是分析 scRNA-seq 数据的常用方法。本文提出了一种新的单细胞聚类方法--自适应总变异正则化（MLRR-ATV）的鲁棒性表层非负低方根表示法（Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization）。该方法将自适应总变异（ATV）正则化引入低方根表示（LRR）模型，通过梯度学习降低噪声的影响。然后，通过欧氏距离和余弦相似性学习数据中的线性和非线性流形结构，保留更多有价值的信息。由于模型是非凸的，我们使用交替方向乘法（ADMM）来优化模型。我们在八个真实的 scRNA-seq 数据集上测试了 MLRRATV 模型的性能，并选择了九种最先进的方法作为对比方法。实验结果表明，MLRRATV 模型的性能优于其他九种方法。

{"title":"MLRR-ATV: A Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization for scRNA-seq Data Clustering.","authors":"Gao-Fei Wang, Juan Wang, Shasha Yuan, Chun-Hou Zheng, Jin-Xing Liu","doi":"10.1109/TCBB.2024.3432740","DOIUrl":"10.1109/TCBB.2024.3432740","url":null,"abstract":"Since genomics was proposed, the exploration of genes has been the focus of research. The emergence of single-cell RNA sequencing (scRNA-seq) technology makes it possible to explore gene expression at the single-cell level. Due to the limitations of sequencing technology, the data contains a lot of noise. At the same time, it also has the characteristics of highdimensional and sparse. Clustering is a common method of analyzing scRNA-seq data. This paper proposes a novel singlecell clustering method called Robust Manifold Nonnegative LowRank Representation with Adaptive Total-Variation Regularization (MLRR-ATV). The Adaptive Total-Variation (ATV) regularization is introduced into Low-Rank Representation (LRR) model to reduce the influence of noise through gradient learning. Then, the linear and nonlinear manifold structures in the data are learned through Euclidean distance and cosine similarity, and more valuable information is retained. Because the model is non-convex, we use the Alternating Direction Method of Multipliers (ADMM) to optimize the model. We tested the performance of the MLRRATV model on eight real scRNA-seq datasets and selected nine state-of-the-art methods as comparison methods. The experimental results show that the performance of the MLRRATV model is better than the other nine methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141758476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph dwMLCS：基于动态加权有向无环图的高效 MLCS 算法。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-22 DOI: 10.1109/TCBB.2024.3431558

Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang

The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.

为多个序列寻找最长公共子序列（MLCS）是一个计算密集且极具挑战性的问题，在文本比较、模式识别和基因诊断等多个领域都有重要应用。目前，基于点的主流 MLCS 算法已成为流行算法并得到广泛研究。一般来说，这些算法会构建匹配点的有向无环图（DAG），并将 MLCS 问题转换为搜索 DAG 中的最长路径。目前已做了一些改进，主要是减小模型大小和减少冗余计算。这些改进包括：1）消除重复节点的哈希方法；2）支持较小 DAG 的动态结构；3）路径剪枝策略等。然而，在面对大规模 MLCS 问题时，这些算法的局限性仍然很大，原因在于：1）动态结构的维护过于耗时；2）路径剪枝在很大程度上依赖于 MLCS 下界和上界的紧密性。这些因素导致大规模 MLCS 问题仍然是一个难题。我们针对大规模 MLCS 问题提出了一种新算法，命名为 dwMLCS。它基于两个模型：一个是既节省空间又节省时间的动态 DAG 模型。它能显著减少 DAG 的大小。另一个是带有新后继策略的加权 DAG 模型。利用该模型，我们设计了一种算法，用于找到更严格的 MLCS 下限。然后，进行路径剪枝以进一步缩小 DAG 的大小并消除冗余计算。此外，我们还提出了一种提高路径剪枝策略效率的上界方法。实验结果表明，所提出的模型和算法的有效性和效率均优于最先进的算法。dwMLCS 的源代码可从网站 https://github.com/BioLab310/dwMLCS 下载。

{"title":"dwMLCS: An Efficient MLCS Algorithm Based on Dynamic and Weighted Directed Acyclic Graph","authors":"Changyong Yu;Dekuan Gao;Xu Guo;Haitao Ma;Yuhai Zhao;Guoren Wang","doi":"10.1109/TCBB.2024.3431558","DOIUrl":"10.1109/TCBB.2024.3431558","url":null,"abstract":"The problem of finding the longest common subsequence (MLCS) for multiple sequences is a computationally intensive and challenging problem that has significant applications in various fields such as text comparison, pattern recognition, and gene diagnosis. Currently, the dominant point-based MLCS algorithms have become popular and extensively studied. Generally, they construct the directed acyclic graph (DAG) of matching points and convert the MLCS problem into a search for the longest paths in the DAG. Several improvements have been made, focusing on decreasing model size and reducing redundant computations. These include 1) hash methods for eliminating duplicated nodes, 2) dynamic structures for supporting smaller DAG and 3) path pruning strategy and so on. However, the algorithms are still too limited when facing large-scale MLCS problem due to 1) the dynamic structures are too time-consuming to maintain and 2) the path pruning relies heavily on the tightness of the lower and upper bound of the MLCS. These factors contribute to the large-scale MLCS problem remaining a challenge. We propose a novel algorithm for the large-scale MLCS problem, named dwMLCS. It is based on two models: one is a dynamic DAG model which is both space and time efficient. It can decrease the size of the DAG significantly. The other is a weighted DAG model with new successor strategies. With this model, we design the algorithm for finding a tighter lower bound of the MLCS. Then, the path pruning is conducted to further reduce the size of the DAG and eliminate redundant computation. Additionally, we propose an upper bound method for improving the efficiency of the path pruning strategy. The experimental results demonstrate that the effectiveness and efficiency of the models and algorithms proposed are better than state-of-the-art algorithms.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1987-1999"},"PeriodicalIF":3.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141748107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Adversarial Network-Based Augmentation With Noval 2-Step Authentication for Anti-Coronavirus Peptide Prediction 基于生成式对抗网络和 Noval 两步验证的抗冠状病毒多肽预测。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-22 DOI: 10.1109/TCBB.2024.3431688

Aditya Kumar;Deepak Singh

The virus poses a longstanding and enduring danger to various forms of life. Despite the ongoing endeavors to combat viral diseases, there exists a necessity to explore and develop novel therapeutic options. Antiviral peptides are bioactive molecules with a favorable toxicity profile, making them promising alternatives for viral infection treatment. Therefore, this article employed a generative adversarial network for antiviral peptide augmentation and a novel two-step authentication process for augmented synthetic peptides to enhance antiviral activity prediction. Additionally, five widely utilized deep learning models were employed for classification purposes. Initially, a GAN was used to augment the antiviral peptide. In a two-step authentication process, the NCBI-BLAST was utilized to identify the antiviral activity resemblance between the synthetic and real peptide. Subsequently, the hydrophobicity, hydrophilicity, hydroxylic nature, positive charge, and negative charge of synthetic and authentic antiviral peptides were compared before their utilization. Later, to examine the impact of authenticated peptide augmentation in the prediction of antiviral peptides, a comparison is conducted with the outcomes of non-peptide augmented prediction. The study demonstrates that the 1-D convolution neural network with augmented peptide exhibits superior performance compared to other employed classifiers and state-of-the-art models. The network attains a mean classification accuracy of 95.41%, an AUC value of 0.95, and an MCC value of 0.90 on the benchmark antiviral and anti-corona peptides dataset. Thus, the performance of the proposed model indicates its efficacy in predicting the antiviral activity of peptides.

病毒对各种生命形式构成了长期而持久的威胁。尽管人们一直在努力防治病毒性疾病，但仍有必要探索和开发新的治疗方案。抗病毒肽是一种生物活性分子，具有良好的毒性，是治疗病毒感染的理想选择。因此，本文采用生成式对抗网络进行抗病毒肽扩增，并对扩增合成肽采用新颖的两步验证流程，以增强抗病毒活性预测。此外，本文还采用了五种广泛使用的深度学习模型进行分类。最初，使用 GAN 来增强抗病毒肽。在两步验证过程中，利用 NCBI-BLAST 来识别合成肽与真实肽之间的抗病毒活性相似性。随后，在使用前比较了合成肽和真实抗病毒肽的疏水性、亲水性、羟基性、正电荷和负电荷。随后，为了检验经鉴定的肽增强对预测抗病毒肽的影响，将其与非肽增强预测的结果进行了比较。研究结果表明，与其他使用的分类器和最先进的模型相比，添加了多肽的一维卷积神经网络表现出更优越的性能。在抗病毒和抗晕肽基准数据集上，该网络的平均分类准确率为 95.41%，AUC 值为 0.95，MCC 值为 0.90。因此，所提模型的性能表明它在预测多肽的抗病毒活性方面非常有效。

{"title":"Generative Adversarial Network-Based Augmentation With Noval 2-Step Authentication for Anti-Coronavirus Peptide Prediction","authors":"Aditya Kumar;Deepak Singh","doi":"10.1109/TCBB.2024.3431688","DOIUrl":"10.1109/TCBB.2024.3431688","url":null,"abstract":"The virus poses a longstanding and enduring danger to various forms of life. Despite the ongoing endeavors to combat viral diseases, there exists a necessity to explore and develop novel therapeutic options. Antiviral peptides are bioactive molecules with a favorable toxicity profile, making them promising alternatives for viral infection treatment. Therefore, this article employed a generative adversarial network for antiviral peptide augmentation and a novel two-step authentication process for augmented synthetic peptides to enhance antiviral activity prediction. Additionally, five widely utilized deep learning models were employed for classification purposes. Initially, a GAN was used to augment the antiviral peptide. In a two-step authentication process, the NCBI-BLAST was utilized to identify the antiviral activity resemblance between the synthetic and real peptide. Subsequently, the hydrophobicity, hydrophilicity, hydroxylic nature, positive charge, and negative charge of synthetic and authentic antiviral peptides were compared before their utilization. Later, to examine the impact of authenticated peptide augmentation in the prediction of antiviral peptides, a comparison is conducted with the outcomes of non-peptide augmented prediction. The study demonstrates that the 1-D convolution neural network with augmented peptide exhibits superior performance compared to other employed classifiers and state-of-the-art models. The network attains a mean classification accuracy of 95.41%, an AUC value of 0.95, and an MCC value of 0.90 on the benchmark antiviral and anti-corona peptides dataset. Thus, the performance of the proposed model indicates its efficacy in predicting the antiviral activity of peptides.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1942-1954"},"PeriodicalIF":3.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141748108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dopcc: Detecting Overlapping Protein Complexes via Multi-Metrics and Co-Core Attachment Method Dopcc：通过多指标和共核附着法检测重叠蛋白质复合物。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-17 DOI: 10.1109/TCBB.2024.3429546

Wenkang Wang;Xiangmao Meng;Ju Xiang;Hayat Dino Bedru;Min Li

Identification of protein complex is an important issue in the field of system biology, which is crucial to understanding the cellular organization and inferring protein functions. Recently, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) networks. However, most of these methods only focus on local information of proteins in the PPI network, which are easily affected by the noise in the PPI network. Meanwhile, it's still challenging to detect protein complexes, especially for overlapping cases. To address these issues, we propose a new method, named Dopcc, to detect overlapping protein complexes by constructing a multi-metrics network according to different strategies. First, we adopt the Jaccard coefficient to measure the neighbor similarity between proteins and denoise the PPI network. Then, we propose a new strategy, integrating hierarchical compressing with network embedding, to capture the high-order structural similarity between proteins. Further, a new co-core attachment strategy is proposed to detect overlapping protein complexes from multi-metrics. The experimental results show that our proposed method, Dopcc, outperforms the other eight state-of-the-art methods in terms of F-measure, MMR, and Composite Score on two yeast datasets.

蛋白质复合物的鉴定是系统生物学领域的一个重要问题，对于理解细胞组织和推断蛋白质功能至关重要。最近，人们提出了许多计算方法来从蛋白质-蛋白质相互作用（PPI）网络中检测蛋白质复合物。然而，这些方法大多只关注 PPI 网络中蛋白质的局部信息，容易受到 PPI 网络中噪声的影响。同时，检测蛋白质复合物仍具有挑战性，尤其是重叠情况。针对这些问题，我们提出了一种名为 Dopcc 的新方法，根据不同的策略构建多度量网络，从而检测重叠的蛋白质复合物。首先，我们采用 Jaccard 系数来测量蛋白质之间的邻接相似性，并对 PPI 网络进行去噪处理。然后，我们提出了一种新策略，将分层压缩与网络嵌入相结合，以捕捉蛋白质之间的高阶结构相似性。此外，我们还提出了一种新的共核附着策略，以从多指标中检测重叠的蛋白质复合物。实验结果表明，在两个酵母数据集上，我们提出的 Dopcc 方法在 F-measure、MMR 和 Composite Score 方面优于其他八种最先进的方法。源代码和数据集可从 https://github.com/CSUBioGroup/Dopcc 下载。

{"title":"Dopcc: Detecting Overlapping Protein Complexes via Multi-Metrics and Co-Core Attachment Method","authors":"Wenkang Wang;Xiangmao Meng;Ju Xiang;Hayat Dino Bedru;Min Li","doi":"10.1109/TCBB.2024.3429546","DOIUrl":"10.1109/TCBB.2024.3429546","url":null,"abstract":"Identification of protein complex is an important issue in the field of system biology, which is crucial to understanding the cellular organization and inferring protein functions. Recently, many computational methods have been proposed to detect protein complexes from protein-protein interaction (PPI) networks. However, most of these methods only focus on local information of proteins in the PPI network, which are easily affected by the noise in the PPI network. Meanwhile, it's still challenging to detect protein complexes, especially for overlapping cases. To address these issues, we propose a new method, named Dopcc, to detect overlapping protein complexes by constructing a multi-metrics network according to different strategies. First, we adopt the Jaccard coefficient to measure the neighbor similarity between proteins and denoise the PPI network. Then, we propose a new strategy, integrating hierarchical compressing with network embedding, to capture the high-order structural similarity between proteins. Further, a new co-core attachment strategy is proposed to detect overlapping protein complexes from multi-metrics. The experimental results show that our proposed method, Dopcc, outperforms the other eight state-of-the-art methods in terms of F-measure, MMR, and Composite Score on two yeast datasets.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2000-2010"},"PeriodicalIF":3.6,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141633413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model 增强生物医学实体识别的通用性：自我关注 PCA-CLS 模型。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-16 DOI: 10.1109/TCBB.2024.3429234

Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh

One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F

$_{1}$

-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.

数据挖掘早期阶段的主要任务之一是识别生物医学语料库中的实体。当使用数据驱动模型（如基于深度学习的架构）从可用的（未）注释数据中学习时，依赖于稳健特征工程的传统方法面临着挑战。尽管利用了大型语料库和先进的深度学习模型，但领域泛化仍然是一个问题。注意力机制能有效捕捉较长的句子依赖关系，并从有限的注释数据集中提取语义和句法信息。为了应对生物医学文本中词汇外的挑战，PCA-CLS（Position and Contextual Attention with CNN-LSTM-Softmax）模型结合了全局自注意力和字符级卷积神经网络技术。该模型的性能在八个不同的生物医学领域数据集上进行了评估，其中包括基因、药物、疾病和物种等实体。PCA-CLS 模型的性能优于几种最先进的模型，取得了显著的 F1 分数，包括 BC2GM 的 88.19%、JNLPBA 的 85.44%、BC5CDR-chemical 的 90.80%、BC5CDR-disease 的 87.07%、BC4CHEMD 的 89.18%、NCBI 的 88.81% 和 s800 数据集的 91.59%。

{"title":"Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model","authors":"Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh","doi":"10.1109/TCBB.2024.3429234","DOIUrl":"10.1109/TCBB.2024.3429234","url":null,"abstract":"One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F\u0000<inline-formula><tex-math>$_{1}$</tex-math></inline-formula>\u0000-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1934-1941"},"PeriodicalIF":3.6,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141626660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations 利用机器学习技术检测蛋白质功能：调查、实验和经验评估。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-07-15 DOI: 10.1109/TCBB.2024.3427381

Kamal Taha

This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.

这篇综述文章深入探讨了用于辨别蛋白质功能的各种机器学习（ML）方法和算法。文章对所讨论的每种方法的功效、局限性、潜在改进和未来前景进行了评估。我们提出了一种创新的分层分类系统，将算法分为复杂的类别和独特的技术。这种分类法基于三级层次结构，从方法类别开始，逐渐缩小到具体技术。这种框架可以对算法进行结构化的全面分类，帮助研究人员了解各种算法和技术之间的相互关系。研究结合了经验评估和实验评估来区分不同的技术。经验评估根据四项标准对技术进行排名。实验评估将：(1) 同一方法子类别下的单项技术；(2) 同一类别中的不同子类别；(3) 大类别本身。综合创新方法分类、经验发现和实验评估，文章提供了对蛋白质功能鉴定中的 ML 策略的全面理解。除了关注单任务方法，本文还探讨了蛋白质功能的多任务和多标签检测技术。此外，文章还揭示了 ML 在蛋白质功能鉴定中的未来发展方向。

{"title":"Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations","authors":"Kamal Taha","doi":"10.1109/TCBB.2024.3427381","DOIUrl":"10.1109/TCBB.2024.3427381","url":null,"abstract":"This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1965-1986"},"PeriodicalIF":3.6,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141619844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0