首页 > 最新文献

IEEE/ACM Transactions on Computational Biology and Bioinformatics最新文献

英文 中文
Improving Antifreeze Proteins Prediction With Protein Language Models and Hybrid Feature Extraction Networks 利用蛋白质语言模型和混合特征提取网络改进抗冻蛋白预测。
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-24 DOI: 10.1109/TCBB.2024.3467261
Jiashun Wu;Yan Liu;Yiheng Zhu;Dong-Jun Yu
Accurate identification of antifreeze proteins (AFPs) is crucial in developing biomimetic synthetic anti-icing materials and low-temperature organ preservation materials. Although numerous machine learning-based methods have been proposed for AFPs prediction, the complex and diverse nature of AFPs limits the prediction performance of existing methods. In this study, we propose AFP-Deep, a new deep learning method to predict antifreeze proteins by integrating embedding from protein sequences with pre-trained protein language models and evolutionary contexts with hybrid feature extraction networks. The experimental results demonstrated that the main advantage of AFP-Deep is its utilization of pre-trained protein language models, which can extract discriminative global contextual features from protein sequences. Additionally, the hybrid deep neural networks designed for protein language models and evolutionary context feature extraction enhance the correlation between embeddings and antifreeze pattern. The performance evaluation results show that AFP-Deep achieves superior performance compared to state-of-the-art models on benchmark datasets, achieving an AUPRC of 0.724 and 0.924, respectively.
准确鉴定防冻蛋白(AFPs)对于开发仿生合成防冰材料和低温器官保存材料至关重要。虽然已经提出了许多基于机器学习的 AFPs 预测方法,但 AFPs 的复杂性和多样性限制了现有方法的预测性能。在本研究中,我们提出了一种新的深度学习方法AFP-Deep,通过将蛋白质序列的嵌入与预训练的蛋白质语言模型和进化上下文与混合特征提取网络相结合来预测防冻蛋白质。实验结果表明,AFP-Deep 的主要优势在于它利用了预训练的蛋白质语言模型,可以从蛋白质序列中提取具有区分性的全局上下文特征。此外,为蛋白质语言模型和进化上下文特征提取设计的混合深度神经网络增强了嵌入与防冻模式之间的相关性。性能评估结果表明,AFP-Deep 在基准数据集上的性能优于最先进的模型,AUPRC 分别达到 0.724 和 0.924。
{"title":"Improving Antifreeze Proteins Prediction With Protein Language Models and Hybrid Feature Extraction Networks","authors":"Jiashun Wu;Yan Liu;Yiheng Zhu;Dong-Jun Yu","doi":"10.1109/TCBB.2024.3467261","DOIUrl":"10.1109/TCBB.2024.3467261","url":null,"abstract":"Accurate identification of antifreeze proteins (AFPs) is crucial in developing biomimetic synthetic anti-icing materials and low-temperature organ preservation materials. Although numerous machine learning-based methods have been proposed for AFPs prediction, the complex and diverse nature of AFPs limits the prediction performance of existing methods. In this study, we propose AFP-Deep, a new deep learning method to predict antifreeze proteins by integrating embedding from protein sequences with pre-trained protein language models and evolutionary contexts with hybrid feature extraction networks. The experimental results demonstrated that the main advantage of AFP-Deep is its utilization of pre-trained protein language models, which can extract discriminative global contextual features from protein sequences. Additionally, the hybrid deep neural networks designed for protein language models and evolutionary context feature extraction enhance the correlation between embeddings and antifreeze pattern. The performance evaluation results show that AFP-Deep achieves superior performance compared to state-of-the-art models on benchmark datasets, achieving an AUPRC of 0.724 and 0.924, respectively.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2349-2358"},"PeriodicalIF":3.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142345926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenoM7GNet: An Efficient N7-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model GenoM7GNet:基于核苷酸语言模型的高效 N7-甲基鸟苷位点预测方法
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-20 DOI: 10.1109/TCBB.2024.3459870
Chuang Li;Heshi Wang;Yanhua Wen;Rui Yin;Xiangxiang Zeng;Keqin Li
N$^{7}$-methylguanosine (m7G), one of the mainstream post-transcriptional RNA modifications, occupies an exceedingly significant place in medical treatments. However, classic approaches for identifying m7G sites are costly both in time and equipment. Meanwhile, the existing machine learning methods extract limited hidden information from RNA sequences, thus making it difficult to improve the accuracy. Therefore, we put forward to a deep learning network, called “GenoM7GNet,” for m7G site identification. This model utilizes a Bidirectional Encoder Representation from Transformers (BERT) and is pretrained on nucleotide sequences data to capture hidden patterns from RNA sequences for m7G site prediction. Moreover, through detailed comparative experiments with various deep learning models, we discovered that the one-dimensional convolutional neural network (CNN) exhibits outstanding performance in sequence feature learning and classification. The proposed GenoM7GNet model achieved 0.953in accuracy, 0.932in sensitivity, 0.976in specificity, 0.907in Matthews Correlation Coefficient and 0.984in Area Under the receiver operating characteristic Curve on performance evaluation. Extensive experimental results further prove that our GenoM7GNet model markedly surpasses other state-of-the-art models in predicting m7G sites, exhibiting high computing performance.
N7 -甲基鸟苷(m7G)是转录后 RNA 修饰的主流之一,在医学治疗中占有极其重要的地位。然而,识别 m7G 位点的传统方法在时间和设备上都很昂贵。同时,现有的机器学习方法从 RNA 序列中提取的隐藏信息有限,因此很难提高准确率。因此,我们提出了一种用于识别 m7G 位点的深度学习网络,称为 "GenoM7GNet"。该模型利用双向变换器编码器表征(BERT),并在核苷酸序列数据上进行预训练,以捕捉 RNA 序列中的隐藏模式,用于 m7G 位点预测。此外,通过与各种深度学习模型的详细对比实验,我们发现一维卷积神经网络(CNN)在序列特征学习和分类方面表现出色。所提出的 GenoM7GNet 模型在性能评估上取得了 0.953 的准确率、0.932 的灵敏度、0.976 的特异性、0.907 的马修斯相关系数和 0.984 的接收者工作特征曲线下面积。广泛的实验结果进一步证明,我们的 GenoM7GNet 模型在预测 m7G 位点方面明显超越了其他最先进的模型,表现出很高的计算性能。
{"title":"GenoM7GNet: An Efficient N7-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model","authors":"Chuang Li;Heshi Wang;Yanhua Wen;Rui Yin;Xiangxiang Zeng;Keqin Li","doi":"10.1109/TCBB.2024.3459870","DOIUrl":"10.1109/TCBB.2024.3459870","url":null,"abstract":"N\u0000<inline-formula><tex-math>$^{7}$</tex-math></inline-formula>\u0000-methylguanosine (m7G), one of the mainstream post-transcriptional RNA modifications, occupies an exceedingly significant place in medical treatments. However, classic approaches for identifying m7G sites are costly both in time and equipment. Meanwhile, the existing machine learning methods extract limited hidden information from RNA sequences, thus making it difficult to improve the accuracy. Therefore, we put forward to a deep learning network, called “GenoM7GNet,” for m7G site identification. This model utilizes a Bidirectional Encoder Representation from Transformers (BERT) and is pretrained on nucleotide sequences data to capture hidden patterns from RNA sequences for m7G site prediction. Moreover, through detailed comparative experiments with various deep learning models, we discovered that the one-dimensional convolutional neural network (CNN) exhibits outstanding performance in sequence feature learning and classification. The proposed GenoM7GNet model achieved 0.953in accuracy, 0.932in sensitivity, 0.976in specificity, 0.907in Matthews Correlation Coefficient and 0.984in Area Under the receiver operating characteristic Curve on performance evaluation. Extensive experimental results further prove that our GenoM7GNet model markedly surpasses other state-of-the-art models in predicting m7G sites, exhibiting high computing performance.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2258-2268"},"PeriodicalIF":3.6,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142286167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topological-Similarity Based Canonical Representations for Biological Link Prediction 基于拓扑相似性的生物链接预测典型表示法
IF 4.5 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-17 DOI: 10.1109/tcbb.2024.3462730
Mengzhen Li, Mustafa Coşkun, Mehmet Koyutürk
{"title":"Topological-Similarity Based Canonical Representations for Biological Link Prediction","authors":"Mengzhen Li, Mustafa Coşkun, Mehmet Koyutürk","doi":"10.1109/tcbb.2024.3462730","DOIUrl":"https://doi.org/10.1109/tcbb.2024.3462730","url":null,"abstract":"","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"38 1","pages":""},"PeriodicalIF":4.5,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accurate Flow Decomposition via Robust Integer Linear Programming 通过稳健整数线性规划实现精确流量分解
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-13 DOI: 10.1109/TCBB.2024.3433523
Fernando H. C. Dias;Alexandru I. Tomescu
Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to solution paths. As such, we introduce a new minimum path-error flow decomposition problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30–50% compared to previous error-handling formulations, with computational requirements that remain practical.
最小流分解(MFD)是计算机科学各个领域的一个常见问题,其中流被分解为加权路径的最小集合。然而,在生物信息学应用中,如RNA转录或准物种组装,流是错误的,因为它是从嘈杂的读取覆盖中获得的。处理错误的MFD问题的典型概括是基于最小二乘公式或将错误的流量值建模为范围。因此,所有这些都集中在单个边级别的错误处理上。在本文中,我们将流分解问题解释为一个鲁棒优化问题,并将错误处理从单个边提升到解路径。因此,我们引入了一个新的最小路径误差流分解问题,并给出了一个整数线性规划公式。我们的实验结果表明,我们的公式可以更好地解释错误,与以前的错误处理公式相比,将不准确率降低了30-50%,并且计算要求仍然实用。
{"title":"Accurate Flow Decomposition via Robust Integer Linear Programming","authors":"Fernando H. C. Dias;Alexandru I. Tomescu","doi":"10.1109/TCBB.2024.3433523","DOIUrl":"10.1109/TCBB.2024.3433523","url":null,"abstract":"Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to \u0000<italic>solution paths</i>\u0000. As such, we introduce a new \u0000<italic>minimum path-error flow decomposition</i>\u0000 problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30–50% compared to previous error-handling formulations, with computational requirements that remain practical.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1955-1964"},"PeriodicalIF":3.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Graph Autoencoder-Based Multi-Level Kernel Subspace Fusion Framework for Single-Cell Type Identification 基于图自动编码器的单细胞类型识别多级核子空间融合新框架
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1109/TCBB.2024.3459960
Juan Wang;Tian-Jing Qiao;Chun-Hou Zheng;Jin-Xing Liu;Jun-Liang Shang
The advent of single-cell RNA sequencing (scRNA-seq) technology offers the opportunity to conduct biological research at the cellular level. Single-cell type identification based on unsupervised clustering is one of the fundamental tasks of scRNA-seq data analysis. Although many single-cell clustering methods have been developed recently, few can fully exploit the deep potential relationships between cells, resulting in suboptimal clustering. In this paper, we propose scGAMF, a graph autoencoder-based multi-level kernel subspace fusion framework for scRNA-seq data analysis. Based on multiple top feature sets, scGAMF unifies deep feature embedding and kernel space analysis into a single framework to learn an accurate clustering affinity matrix. First, we construct multiple top feature sets to avoid the high variability caused by single feature set learning. Second, scGAMF uses a graph autoencoder (GAEs) to extract deep information embedded in the data, and learn embeddings including gene expression patterns and cell-cell relationships. Third, to fully explore the deep potential relationships between cells, we design a multi-level kernel space fusion strategy. This strategy uses a kernel expression model with adaptive similarity preservation to learn a self-expression matrix shared by all embedding spaces of a given feature set, and a consensus affinity matrix across multiple top feature sets. Finally, the consensus affinity matrix is used for spectral clustering, visualization, and identification of gene markers. Extensive validation on real datasets shows that scGAMF achieves higher clustering accuracy than many popular single-cell analysis methods.
单细胞RNA测序(scRNA-seq)技术的出现为在细胞水平上进行生物学研究提供了机会。基于无监督聚类的单细胞类型鉴定是scRNA-seq数据分析的基本任务之一。虽然最近开发了许多单细胞聚类方法,但很少能充分利用细胞之间的深层潜在关系,导致聚类不理想。本文提出了一种基于图自编码器的多级核子空间融合框架scGAMF,用于scRNA-seq数据分析。scGAMF基于多个顶级特征集,将深度特征嵌入和核空间分析统一到一个框架中,学习精确的聚类亲和矩阵。首先,我们构建了多个顶级特征集,以避免单一特征集学习带来的高可变性。其次,scGAMF使用图形自编码器(GAEs)来提取嵌入在数据中的深层信息,并学习嵌入的信息,包括基因表达模式和细胞-细胞关系。第三,为了充分挖掘细胞之间的深层潜在关系,我们设计了一种多级核空间融合策略。该策略使用具有自适应相似度保持的核表达模型来学习给定特征集的所有嵌入空间共享的自表达矩阵和跨多个顶级特征集的一致亲和矩阵。最后,将共识亲和矩阵用于基因标记的光谱聚类、可视化和鉴定。在实际数据集上的大量验证表明,scGAMF比许多流行的单细胞分析方法具有更高的聚类精度。
{"title":"A New Graph Autoencoder-Based Multi-Level Kernel Subspace Fusion Framework for Single-Cell Type Identification","authors":"Juan Wang;Tian-Jing Qiao;Chun-Hou Zheng;Jin-Xing Liu;Jun-Liang Shang","doi":"10.1109/TCBB.2024.3459960","DOIUrl":"10.1109/TCBB.2024.3459960","url":null,"abstract":"The advent of single-cell RNA sequencing (scRNA-seq) technology offers the opportunity to conduct biological research at the cellular level. Single-cell type identification based on unsupervised clustering is one of the fundamental tasks of scRNA-seq data analysis. Although many single-cell clustering methods have been developed recently, few can fully exploit the deep potential relationships between cells, resulting in suboptimal clustering. In this paper, we propose scGAMF, a graph autoencoder-based multi-level kernel subspace fusion framework for scRNA-seq data analysis. Based on multiple top feature sets, scGAMF unifies deep feature embedding and kernel space analysis into a single framework to learn an accurate clustering affinity matrix. First, we construct multiple top feature sets to avoid the high variability caused by single feature set learning. Second, scGAMF uses a graph autoencoder (GAEs) to extract deep information embedded in the data, and learn embeddings including gene expression patterns and cell-cell relationships. Third, to fully explore the deep potential relationships between cells, we design a multi-level kernel space fusion strategy. This strategy uses a kernel expression model with adaptive similarity preservation to learn a self-expression matrix shared by all embedding spaces of a given feature set, and a consensus affinity matrix across multiple top feature sets. Finally, the consensus affinity matrix is used for spectral clustering, visualization, and identification of gene markers. Extensive validation on real datasets shows that scGAMF achieves higher clustering accuracy than many popular single-cell analysis methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2292-2303"},"PeriodicalIF":3.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data 使用多编码器半隐式图变自动编码器分析单细胞 RNA 测序数据
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3458170
Shengwen Tian;Cunmei Ji;Jiancheng Ni;Yutian Wang;Chunhou Zheng
Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.
单细胞RNA测序技术(scRNA-seq)的快速发展,使得大规模文库在高分辨率视图下表征细胞状态成为可能。scRNA-seq数据包含了大量的生物学信息,主要用于发现细胞亚型和跟踪细胞发育。然而,传统方法在处理高维、高稀疏度的scRNA-seq数据时面临许多挑战。为了更好地分析scRNA-seq数据,我们提出了一个基于变分图自编码器和图注意网络的MSVGAE框架。具体来说,我们引入了多个编码器来学习不同尺度的特征并控制非信息特征。此外,在编码器中加入不同的噪声来促进图结构信息和分布不确定性的传播。因此,我们的模型可以捕获一些复杂的后验分布。MSVGAE将高维、高噪声的scRNA-seq数据映射到低维潜在空间中,有利于后续任务的处理。特别是,MSVGAE可以处理非常稀疏的数据。在实验之前,我们创建了24个模拟数据集来模拟各种生物场景,并收集了8个真实数据集。聚类、可视化和标记基因分析的实验结果表明,MSVGAE模型在分析scRNA-seq数据方面具有良好的准确性和鲁棒性。
{"title":"Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data","authors":"Shengwen Tian;Cunmei Ji;Jiancheng Ni;Yutian Wang;Chunhou Zheng","doi":"10.1109/TCBB.2024.3458170","DOIUrl":"10.1109/TCBB.2024.3458170","url":null,"abstract":"Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2280-2291"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APMG: 3D Molecule Generation Driven by Atomic Chemical Properties APMG:由原子化学性质驱动的三维分子生成
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3457807
Yang Hua;Zhenhua Feng;Xiaoning Song;Hui Li;Tianyang Xu;Xiao-Jun Wu;Dong-Jun Yu
Recently, mask-fill-based 3D Molecular Generation (MG) methods have become very popular in virtual drug design. However, the existing MG methods ignore the chemical properties of atoms and contain inappropriate atomic position training data, which limits their generation capability. To mitigate the above issues, this paper presents a novel mask-fill-based 3D molecule generation model driven by atomic chemical properties (APMG). Specifically, we construct a new attention-MPNN-based encoder and introduce the electronic information into atom representations to enrich chemical properties. Also, a multi-functional classifier is designed to predict the electronic information of each generated atom, guiding the type prediction of elements and bonds. By design, the proposed method uses the chemical properties of atoms and their correlations for high-quality molecule generation. Second, to optimize the atomic position training data, we propose a novel atomic training position generation approach using the Chi-Square distribution. We evaluate our APMG method on the CrossDocked dataset and visualize the docking states of the pockets and generated molecules. The obtained results demonstrate the superiority and merits of APMG over the state-of-the-art approaches.
近年来,基于掩膜填充的三维分子生成(MG)方法在虚拟药物设计中非常流行。然而,现有的MG方法忽略了原子的化学性质,并且包含了不适当的原子位置训练数据,这限制了它们的生成能力。为了解决上述问题,本文提出了一种基于掩膜填充的原子化学性质驱动的三维分子生成模型。具体而言,我们构建了一个新的基于注意力- mpnn的编码器,并将电子信息引入原子表示中以丰富化学性质。同时,设计了一个多功能分类器来预测每个生成原子的电子信息,指导元素和键的类型预测。通过设计,提出的方法利用原子的化学性质及其相关性来生成高质量的分子。其次,为了优化原子位置训练数据,我们提出了一种基于卡方分布的原子训练位置生成方法。我们在CrossDocked数据集上评估了我们的APMG方法,并可视化了口袋和生成分子的对接状态。所得结果表明了APMG方法相对于现有方法的优越性和优点。
{"title":"APMG: 3D Molecule Generation Driven by Atomic Chemical Properties","authors":"Yang Hua;Zhenhua Feng;Xiaoning Song;Hui Li;Tianyang Xu;Xiao-Jun Wu;Dong-Jun Yu","doi":"10.1109/TCBB.2024.3457807","DOIUrl":"10.1109/TCBB.2024.3457807","url":null,"abstract":"Recently, mask-fill-based 3D Molecular Generation (MG) methods have become very popular in virtual drug design. However, the existing MG methods ignore the chemical properties of atoms and contain inappropriate atomic position training data, which limits their generation capability. To mitigate the above issues, this paper presents a novel mask-fill-based 3D molecule generation model driven by atomic chemical properties (APMG). Specifically, we construct a new attention-MPNN-based encoder and introduce the electronic information into atom representations to enrich chemical properties. Also, a multi-functional classifier is designed to predict the electronic information of each generated atom, guiding the type prediction of elements and bonds. By design, the proposed method uses the chemical properties of atoms and their correlations for high-quality molecule generation. Second, to optimize the atomic position training data, we propose a novel atomic training position generation approach using the Chi-Square distribution. We evaluate our APMG method on the CrossDocked dataset and visualize the docking states of the pockets and generated molecules. The obtained results demonstrate the superiority and merits of APMG over the state-of-the-art approaches.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2269-2279"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Zhegalkin Polynomials and SAT Solving for Context-Specific Boolean Modeling of Biological Systems 结合哲加金多项式和 SAT 求解,建立生物系统的特定语境布尔模型
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3456302
Vincent Deman;Marine Ciantar;Laurent Naudin;Philippe Castera;Anne-Sophie Beignon
Large amounts of knowledge regarding biological processes are readily available in the literature and aggregated in diverse databases. Boolean networks are powerful tools to render that knowledge into models that can mimic and simulate biological phenomena at multiple scales. Yet, when a model is required to understand or predict the behavior of a biological system in given conditions, existing information often does not completely match this context. Networks built from only prior knowledge can overlook mechanisms, lack specificity, and just partially recapitulate experimental observations. To address this limitation, context-specific data needs to be integrated. However, the brute-force identification of qualitative rules matching these data becomes infeasible as the number of candidates explodes for increasingly complex systems. Here, we used Zhegalkin polynomials to transform this identification into a binary value assignment for exponentially fewer variables, which we addressed with a state-of-the-art SAT solver. We evaluated our implemented method alongside two widely recognized tools, CellNetOptimizer and Caspo-ts, on both artificial toy models and large-scale models based on experimental data from the HPN-DREAM challenge. Our approach demonstrated benchmark-leading capabilities on networks of significant size and intricate complexity. It thus appears promising for the in silico modeling of ever more comprehensive biological systems.
关于生物过程的大量知识在文献中很容易获得,并汇集在不同的数据库中。布尔网络是将知识转化为模型的强大工具,可以在多个尺度上模拟和模拟生物现象。然而,当需要一个模型来理解或预测给定条件下生物系统的行为时,现有的信息往往不能完全匹配这一背景。仅从先验知识构建的网络可能会忽略机制,缺乏特异性,并且只是部分概括实验观察结果。为了解决这个限制,需要集成特定于上下文的数据。然而,对于日益复杂的系统,随着候选规则数量的爆炸式增长,匹配这些数据的定性规则的暴力识别变得不可行。在这里,我们使用Zhegalkin多项式将这种识别转换为指数较少变量的二进制值赋值,我们使用最先进的SAT求解器来解决这个问题。我们在人工玩具模型和基于HPN-DREAM挑战实验数据的大型模型上,与两种广泛认可的工具CellNetOptimizer和Caspo-ts一起评估了我们的实现方法。我们的方法在规模巨大、复杂的网络上展示了领先基准的能力。因此,它对于更全面的生物系统的计算机建模似乎是有希望的。
{"title":"Combining Zhegalkin Polynomials and SAT Solving for Context-Specific Boolean Modeling of Biological Systems","authors":"Vincent Deman;Marine Ciantar;Laurent Naudin;Philippe Castera;Anne-Sophie Beignon","doi":"10.1109/TCBB.2024.3456302","DOIUrl":"10.1109/TCBB.2024.3456302","url":null,"abstract":"Large amounts of knowledge regarding biological processes are readily available in the literature and aggregated in diverse databases. Boolean networks are powerful tools to render that knowledge into models that can mimic and simulate biological phenomena at multiple scales. Yet, when a model is required to understand or predict the behavior of a biological system in given conditions, existing information often does not completely match this context. Networks built from only prior knowledge can overlook mechanisms, lack specificity, and just partially recapitulate experimental observations. To address this limitation, context-specific data needs to be integrated. However, the brute-force identification of qualitative rules matching these data becomes infeasible as the number of candidates explodes for increasingly complex systems. Here, we used Zhegalkin polynomials to transform this identification into a binary value assignment for exponentially fewer variables, which we addressed with a state-of-the-art SAT solver. We evaluated our implemented method alongside two widely recognized tools, CellNetOptimizer and Caspo-ts, on both artificial toy models and large-scale models based on experimental data from the HPN-DREAM challenge. Our approach demonstrated benchmark-leading capabilities on networks of significant size and intricate complexity. It thus appears promising for the \u0000<italic>in silico</i>\u0000 modeling of ever more comprehensive biological systems.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2188-2199"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10671585","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Automated Convergence Diagnostic for Phylogenetic MCMC Analyses 系统发育 MCMC 分析的自动收敛诊断方法
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3457875
Lars Berling;Remco Bouckaert;Alex Gavryushkin
Assessing convergence of Markov chain Monte Carlo (MCMC) based analyses is crucial but challenging, especially so in high dimensional and complex spaces such as the space of phylogenetic trees (treespace). In practice, it is assumed that the target distribution is the unique stationary distribution of the MCMC and convergence is achieved when samples appear to be stationary. Here we leverage recent advances in computational geometry of the treespace and introduce a method that combines classical statistical techniques and algorithms with geometric properties of the treespace to automatically evaluate and assess practical convergence of phylogenetic MCMC analyses. Our method monitors convergence across multiple MCMC chains and achieves high accuracy in detecting both practical convergence and convergence issues within treespace. Furthermore, our approach is developed to allow for real-time evaluation during the MCMC algorithm run, eliminating any of the chain post-processing steps that are currently required. Our tool therefore improves reliability and efficiency of MCMC based phylogenetic inference methods and makes analyses easier to reproduce and compare. We demonstrate the efficacy of our diagnostic via a well-calibrated simulation study and provide examples of its performance on real data sets. Although our method performs well in practice, a significant part of the underlying treespace probability theory is still missing, which creates an excellent opportunity for future mathematical research in this area.
评估基于马尔可夫链蒙特卡罗(MCMC)分析的收敛性是至关重要但具有挑战性的,特别是在高维和复杂的空间中,如系统发育树空间(树空间)。在实践中,假设目标分布是MCMC的唯一平稳分布,当样本看起来平稳时实现收敛。在这里,我们利用树空间计算几何的最新进展,并引入一种方法,将经典的统计技术和算法与树空间的几何特性相结合,以自动评估和评估系统发育MCMC分析的实际收敛性。我们的方法监测多个MCMC链的收敛性,在检测树空间内的实际收敛性和收敛性问题方面都达到了很高的精度。此外,我们的方法是为了允许在MCMC算法运行期间进行实时评估而开发的,消除了当前所需的任何链后处理步骤。因此,我们的工具提高了基于MCMC的系统发育推断方法的可靠性和效率,并使分析更容易重现和比较。我们通过精心校准的模拟研究证明了我们的诊断的有效性,并提供了其在真实数据集上的性能示例。虽然我们的方法在实践中表现良好,但潜在的树空间概率论的重要部分仍然缺失,这为该领域的未来数学研究创造了一个极好的机会。
{"title":"An Automated Convergence Diagnostic for Phylogenetic MCMC Analyses","authors":"Lars Berling;Remco Bouckaert;Alex Gavryushkin","doi":"10.1109/TCBB.2024.3457875","DOIUrl":"10.1109/TCBB.2024.3457875","url":null,"abstract":"Assessing convergence of Markov chain Monte Carlo (MCMC) based analyses is crucial but challenging, especially so in high dimensional and complex spaces such as the space of phylogenetic trees (treespace). In practice, it is assumed that the target distribution is the unique stationary distribution of the MCMC and convergence is achieved when samples appear to be stationary. Here we leverage recent advances in computational geometry of the treespace and introduce a method that combines classical statistical techniques and algorithms with geometric properties of the treespace to automatically evaluate and assess practical convergence of phylogenetic MCMC analyses. Our method monitors convergence across multiple MCMC chains and achieves high accuracy in detecting both practical convergence and convergence issues within treespace. Furthermore, our approach is developed to allow for real-time evaluation during the MCMC algorithm run, eliminating any of the chain post-processing steps that are currently required. Our tool therefore improves reliability and efficiency of MCMC based phylogenetic inference methods and makes analyses easier to reproduce and compare. We demonstrate the efficacy of our diagnostic via a well-calibrated simulation study and provide examples of its performance on real data sets. Although our method performs well in practice, a significant part of the underlying treespace probability theory is still missing, which creates an excellent opportunity for future mathematical research in this area.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2246-2257"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10675342","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bridging Between Deviation Indices for Non-Tree-Based Phylogenetic Networks 非基于树的系统发育网络偏差指数之间的衔接
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-09 DOI: 10.1109/TCBB.2024.3456575
Takatora Suzuki;Han Guo;Momoko Hayamizu
Phylogenetic networks are a useful model that can represent reticulate evolution and complex biological data. In recent years, mathematical and computational aspects of tree-based networks have been well studied. However, not all phylogenetic networks are tree-based, so it is meaningful to consider how close a given network is to being tree-based; Francis–Steel–Semple (2018) proposed several different indices to measure the degree of deviation of a phylogenetic network from being tree-based. One is the minimum number of leaves that need to be added to convert a given network to tree-based, and another is the number of vertices that are not included in the largest subtree covering its leaf-set. Both values are zero if and only if the network is tree-based. Both deviation indices can be computed efficiently, but the relationship between the above two is unknown, as each has been studied using different approaches. In this study, we derive a tight inequality for the values of the two measures and also give a characterisation of phylogenetic networks such that they coincide. This characterisation yields a new efficient algorithm for the Maximum Covering Subtree Problem based on the maximal zig-zag trail decomposition.
系统发育网络是一个有用的模型,可以表示网状进化和复杂的生物数据。近年来,基于树的网络的数学和计算方面得到了很好的研究。然而,并非所有的系统发育网络都是基于树的,因此考虑给定网络与基于树的距离有多近是有意义的;Francis-Steel-Semple(2018)提出了几个不同的指标来衡量系统发育网络与基于树的偏离程度。一个是将给定网络转换为基于树的网络所需添加的最小叶子数,另一个是覆盖其叶子集的最大子树中不包含的顶点数。当且仅当网络是基于树的,这两个值都为零。这两种偏差指标都可以有效地计算,但两者之间的关系是未知的,因为每个指标都使用不同的方法进行了研究。在这项研究中,我们推导了两个测量值的紧密不等式,并给出了系统发育网络的特征,使它们重合。这一特征给出了一种新的基于最大之字形轨迹分解的最大覆盖子树问题的有效算法。
{"title":"Bridging Between Deviation Indices for Non-Tree-Based Phylogenetic Networks","authors":"Takatora Suzuki;Han Guo;Momoko Hayamizu","doi":"10.1109/TCBB.2024.3456575","DOIUrl":"10.1109/TCBB.2024.3456575","url":null,"abstract":"Phylogenetic networks are a useful model that can represent reticulate evolution and complex biological data. In recent years, mathematical and computational aspects of tree-based networks have been well studied. However, not all phylogenetic networks are tree-based, so it is meaningful to consider how close a given network is to being tree-based; Francis–Steel–Semple (2018) proposed several different indices to measure the degree of deviation of a phylogenetic network from being tree-based. One is the minimum number of leaves that need to be added to convert a given network to tree-based, and another is the number of vertices that are not included in the largest subtree covering its leaf-set. Both values are zero if and only if the network is tree-based. Both deviation indices can be computed efficiently, but the relationship between the above two is unknown, as each has been studied using different approaches. In this study, we derive a tight inequality for the values of the two measures and also give a characterisation of phylogenetic networks such that they coincide. This characterisation yields a new efficient algorithm for the Maximum Covering Subtree Problem based on the maximal zig-zag trail decomposition.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2226-2234"},"PeriodicalIF":3.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10670207","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Computational Biology and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1