首页 > 最新文献

IEEE/ACM Transactions on Computational Biology and Bioinformatics最新文献

英文 中文
Accurate Flow Decomposition via Robust Integer Linear Programming 通过稳健整数线性规划实现精确流量分解
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-13 DOI: 10.1109/TCBB.2024.3433523
Fernando H. C. Dias;Alexandru I. Tomescu
Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to solution paths. As such, we introduce a new minimum path-error flow decomposition problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30–50% compared to previous error-handling formulations, with computational requirements that remain practical.
{"title":"Accurate Flow Decomposition via Robust Integer Linear Programming","authors":"Fernando H. C. Dias;Alexandru I. Tomescu","doi":"10.1109/TCBB.2024.3433523","DOIUrl":"10.1109/TCBB.2024.3433523","url":null,"abstract":"Minimum flow decomposition (MFD) is a common problem across various fields of Computer Science, where a flow is decomposed into a minimum set of weighted paths. However, in Bioinformatics applications, such as RNA transcript or quasi-species assembly, the flow is erroneous since it is obtained from noisy read coverages. Typical generalizations of the MFD problem to handle errors are based on least-squares formulations or modelling the erroneous flow values as ranges. All of these are thus focused on error handling at the level of individual edges. In this paper, we interpret the flow decomposition problem as a robust optimization problem and lift error-handling from individual edges to \u0000<italic>solution paths</i>\u0000. As such, we introduce a new \u0000<italic>minimum path-error flow decomposition</i>\u0000 problem, for which we give an Integer Linear Programming formulation. Our experimental results reveal that our formulation can account for errors significantly better, by lowering the inaccuracy rate by 30–50% compared to previous error-handling formulations, with computational requirements that remain practical.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1955-1964"},"PeriodicalIF":3.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Graph Autoencoder-Based Multi-Level Kernel Subspace Fusion Framework for Single-Cell Type Identification 基于图自动编码器的单细胞类型识别多级核子空间融合新框架
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1109/TCBB.2024.3459960
Juan Wang;Tian-Jing Qiao;Chun-Hou Zheng;Jin-Xing Liu;Jun-Liang Shang
The advent of single-cell RNA sequencing (scRNA-seq) technology offers the opportunity to conduct biological research at the cellular level. Single-cell type identification based on unsupervised clustering is one of the fundamental tasks of scRNA-seq data analysis. Although many single-cell clustering methods have been developed recently, few can fully exploit the deep potential relationships between cells, resulting in suboptimal clustering. In this paper, we propose scGAMF, a graph autoencoder-based multi-level kernel subspace fusion framework for scRNA-seq data analysis. Based on multiple top feature sets, scGAMF unifies deep feature embedding and kernel space analysis into a single framework to learn an accurate clustering affinity matrix. First, we construct multiple top feature sets to avoid the high variability caused by single feature set learning. Second, scGAMF uses a graph autoencoder (GAEs) to extract deep information embedded in the data, and learn embeddings including gene expression patterns and cell-cell relationships. Third, to fully explore the deep potential relationships between cells, we design a multi-level kernel space fusion strategy. This strategy uses a kernel expression model with adaptive similarity preservation to learn a self-expression matrix shared by all embedding spaces of a given feature set, and a consensus affinity matrix across multiple top feature sets. Finally, the consensus affinity matrix is used for spectral clustering, visualization, and identification of gene markers. Extensive validation on real datasets shows that scGAMF achieves higher clustering accuracy than many popular single-cell analysis methods.
{"title":"A New Graph Autoencoder-Based Multi-Level Kernel Subspace Fusion Framework for Single-Cell Type Identification","authors":"Juan Wang;Tian-Jing Qiao;Chun-Hou Zheng;Jin-Xing Liu;Jun-Liang Shang","doi":"10.1109/TCBB.2024.3459960","DOIUrl":"10.1109/TCBB.2024.3459960","url":null,"abstract":"The advent of single-cell RNA sequencing (scRNA-seq) technology offers the opportunity to conduct biological research at the cellular level. Single-cell type identification based on unsupervised clustering is one of the fundamental tasks of scRNA-seq data analysis. Although many single-cell clustering methods have been developed recently, few can fully exploit the deep potential relationships between cells, resulting in suboptimal clustering. In this paper, we propose scGAMF, a graph autoencoder-based multi-level kernel subspace fusion framework for scRNA-seq data analysis. Based on multiple top feature sets, scGAMF unifies deep feature embedding and kernel space analysis into a single framework to learn an accurate clustering affinity matrix. First, we construct multiple top feature sets to avoid the high variability caused by single feature set learning. Second, scGAMF uses a graph autoencoder (GAEs) to extract deep information embedded in the data, and learn embeddings including gene expression patterns and cell-cell relationships. Third, to fully explore the deep potential relationships between cells, we design a multi-level kernel space fusion strategy. This strategy uses a kernel expression model with adaptive similarity preservation to learn a self-expression matrix shared by all embedding spaces of a given feature set, and a consensus affinity matrix across multiple top feature sets. Finally, the consensus affinity matrix is used for spectral clustering, visualization, and identification of gene markers. Extensive validation on real datasets shows that scGAMF achieves higher clustering accuracy than many popular single-cell analysis methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2292-2303"},"PeriodicalIF":3.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data 使用多编码器半隐式图变自动编码器分析单细胞 RNA 测序数据
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3458170
Shengwen Tian;Cunmei Ji;Jiancheng Ni;Yutian Wang;Chunhou Zheng
Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.
{"title":"Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data","authors":"Shengwen Tian;Cunmei Ji;Jiancheng Ni;Yutian Wang;Chunhou Zheng","doi":"10.1109/TCBB.2024.3458170","DOIUrl":"10.1109/TCBB.2024.3458170","url":null,"abstract":"Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2280-2291"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APMG: 3D Molecule Generation Driven by Atomic Chemical Properties APMG:由原子化学性质驱动的三维分子生成
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3457807
Yang Hua;Zhenhua Feng;Xiaoning Song;Hui Li;Tianyang Xu;Xiao-Jun Wu;Dong-Jun Yu
Recently, mask-fill-based 3D Molecular Generation (MG) methods have become very popular in virtual drug design. However, the existing MG methods ignore the chemical properties of atoms and contain inappropriate atomic position training data, which limits their generation capability. To mitigate the above issues, this paper presents a novel mask-fill-based 3D molecule generation model driven by atomic chemical properties (APMG). Specifically, we construct a new attention-MPNN-based encoder and introduce the electronic information into atom representations to enrich chemical properties. Also, a multi-functional classifier is designed to predict the electronic information of each generated atom, guiding the type prediction of elements and bonds. By design, the proposed method uses the chemical properties of atoms and their correlations for high-quality molecule generation. Second, to optimize the atomic position training data, we propose a novel atomic training position generation approach using the Chi-Square distribution. We evaluate our APMG method on the CrossDocked dataset and visualize the docking states of the pockets and generated molecules. The obtained results demonstrate the superiority and merits of APMG over the state-of-the-art approaches.
{"title":"APMG: 3D Molecule Generation Driven by Atomic Chemical Properties","authors":"Yang Hua;Zhenhua Feng;Xiaoning Song;Hui Li;Tianyang Xu;Xiao-Jun Wu;Dong-Jun Yu","doi":"10.1109/TCBB.2024.3457807","DOIUrl":"10.1109/TCBB.2024.3457807","url":null,"abstract":"Recently, mask-fill-based 3D Molecular Generation (MG) methods have become very popular in virtual drug design. However, the existing MG methods ignore the chemical properties of atoms and contain inappropriate atomic position training data, which limits their generation capability. To mitigate the above issues, this paper presents a novel mask-fill-based 3D molecule generation model driven by atomic chemical properties (APMG). Specifically, we construct a new attention-MPNN-based encoder and introduce the electronic information into atom representations to enrich chemical properties. Also, a multi-functional classifier is designed to predict the electronic information of each generated atom, guiding the type prediction of elements and bonds. By design, the proposed method uses the chemical properties of atoms and their correlations for high-quality molecule generation. Second, to optimize the atomic position training data, we propose a novel atomic training position generation approach using the Chi-Square distribution. We evaluate our APMG method on the CrossDocked dataset and visualize the docking states of the pockets and generated molecules. The obtained results demonstrate the superiority and merits of APMG over the state-of-the-art approaches.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2269-2279"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining Zhegalkin Polynomials and SAT Solving for Context-Specific Boolean Modeling of Biological Systems 结合哲加金多项式和 SAT 求解,建立生物系统的特定语境布尔模型
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3456302
Vincent Deman;Marine Ciantar;Laurent Naudin;Philippe Castera;Anne-Sophie Beignon
Large amounts of knowledge regarding biological processes are readily available in the literature and aggregated in diverse databases. Boolean networks are powerful tools to render that knowledge into models that can mimic and simulate biological phenomena at multiple scales. Yet, when a model is required to understand or predict the behavior of a biological system in given conditions, existing information often does not completely match this context. Networks built from only prior knowledge can overlook mechanisms, lack specificity, and just partially recapitulate experimental observations. To address this limitation, context-specific data needs to be integrated. However, the brute-force identification of qualitative rules matching these data becomes infeasible as the number of candidates explodes for increasingly complex systems. Here, we used Zhegalkin polynomials to transform this identification into a binary value assignment for exponentially fewer variables, which we addressed with a state-of-the-art SAT solver. We evaluated our implemented method alongside two widely recognized tools, CellNetOptimizer and Caspo-ts, on both artificial toy models and large-scale models based on experimental data from the HPN-DREAM challenge. Our approach demonstrated benchmark-leading capabilities on networks of significant size and intricate complexity. It thus appears promising for the in silico modeling of ever more comprehensive biological systems.
{"title":"Combining Zhegalkin Polynomials and SAT Solving for Context-Specific Boolean Modeling of Biological Systems","authors":"Vincent Deman;Marine Ciantar;Laurent Naudin;Philippe Castera;Anne-Sophie Beignon","doi":"10.1109/TCBB.2024.3456302","DOIUrl":"10.1109/TCBB.2024.3456302","url":null,"abstract":"Large amounts of knowledge regarding biological processes are readily available in the literature and aggregated in diverse databases. Boolean networks are powerful tools to render that knowledge into models that can mimic and simulate biological phenomena at multiple scales. Yet, when a model is required to understand or predict the behavior of a biological system in given conditions, existing information often does not completely match this context. Networks built from only prior knowledge can overlook mechanisms, lack specificity, and just partially recapitulate experimental observations. To address this limitation, context-specific data needs to be integrated. However, the brute-force identification of qualitative rules matching these data becomes infeasible as the number of candidates explodes for increasingly complex systems. Here, we used Zhegalkin polynomials to transform this identification into a binary value assignment for exponentially fewer variables, which we addressed with a state-of-the-art SAT solver. We evaluated our implemented method alongside two widely recognized tools, CellNetOptimizer and Caspo-ts, on both artificial toy models and large-scale models based on experimental data from the HPN-DREAM challenge. Our approach demonstrated benchmark-leading capabilities on networks of significant size and intricate complexity. It thus appears promising for the \u0000<italic>in silico</i>\u0000 modeling of ever more comprehensive biological systems.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2188-2199"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10671585","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Automated Convergence Diagnostic for Phylogenetic MCMC Analyses 系统发育 MCMC 分析的自动收敛诊断方法
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1109/TCBB.2024.3457875
Lars Berling;Remco Bouckaert;Alex Gavryushkin
Assessing convergence of Markov chain Monte Carlo (MCMC) based analyses is crucial but challenging, especially so in high dimensional and complex spaces such as the space of phylogenetic trees (treespace). In practice, it is assumed that the target distribution is the unique stationary distribution of the MCMC and convergence is achieved when samples appear to be stationary. Here we leverage recent advances in computational geometry of the treespace and introduce a method that combines classical statistical techniques and algorithms with geometric properties of the treespace to automatically evaluate and assess practical convergence of phylogenetic MCMC analyses. Our method monitors convergence across multiple MCMC chains and achieves high accuracy in detecting both practical convergence and convergence issues within treespace. Furthermore, our approach is developed to allow for real-time evaluation during the MCMC algorithm run, eliminating any of the chain post-processing steps that are currently required. Our tool therefore improves reliability and efficiency of MCMC based phylogenetic inference methods and makes analyses easier to reproduce and compare. We demonstrate the efficacy of our diagnostic via a well-calibrated simulation study and provide examples of its performance on real data sets. Although our method performs well in practice, a significant part of the underlying treespace probability theory is still missing, which creates an excellent opportunity for future mathematical research in this area.
{"title":"An Automated Convergence Diagnostic for Phylogenetic MCMC Analyses","authors":"Lars Berling;Remco Bouckaert;Alex Gavryushkin","doi":"10.1109/TCBB.2024.3457875","DOIUrl":"10.1109/TCBB.2024.3457875","url":null,"abstract":"Assessing convergence of Markov chain Monte Carlo (MCMC) based analyses is crucial but challenging, especially so in high dimensional and complex spaces such as the space of phylogenetic trees (treespace). In practice, it is assumed that the target distribution is the unique stationary distribution of the MCMC and convergence is achieved when samples appear to be stationary. Here we leverage recent advances in computational geometry of the treespace and introduce a method that combines classical statistical techniques and algorithms with geometric properties of the treespace to automatically evaluate and assess practical convergence of phylogenetic MCMC analyses. Our method monitors convergence across multiple MCMC chains and achieves high accuracy in detecting both practical convergence and convergence issues within treespace. Furthermore, our approach is developed to allow for real-time evaluation during the MCMC algorithm run, eliminating any of the chain post-processing steps that are currently required. Our tool therefore improves reliability and efficiency of MCMC based phylogenetic inference methods and makes analyses easier to reproduce and compare. We demonstrate the efficacy of our diagnostic via a well-calibrated simulation study and provide examples of its performance on real data sets. Although our method performs well in practice, a significant part of the underlying treespace probability theory is still missing, which creates an excellent opportunity for future mathematical research in this area.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2246-2257"},"PeriodicalIF":3.6,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10675342","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bridging Between Deviation Indices for Non-Tree-Based Phylogenetic Networks 非基于树的系统发育网络偏差指数之间的衔接
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-09 DOI: 10.1109/TCBB.2024.3456575
Takatora Suzuki;Han Guo;Momoko Hayamizu
Phylogenetic networks are a useful model that can represent reticulate evolution and complex biological data. In recent years, mathematical and computational aspects of tree-based networks have been well studied. However, not all phylogenetic networks are tree-based, so it is meaningful to consider how close a given network is to being tree-based; Francis–Steel–Semple (2018) proposed several different indices to measure the degree of deviation of a phylogenetic network from being tree-based. One is the minimum number of leaves that need to be added to convert a given network to tree-based, and another is the number of vertices that are not included in the largest subtree covering its leaf-set. Both values are zero if and only if the network is tree-based. Both deviation indices can be computed efficiently, but the relationship between the above two is unknown, as each has been studied using different approaches. In this study, we derive a tight inequality for the values of the two measures and also give a characterisation of phylogenetic networks such that they coincide. This characterisation yields a new efficient algorithm for the Maximum Covering Subtree Problem based on the maximal zig-zag trail decomposition.
{"title":"Bridging Between Deviation Indices for Non-Tree-Based Phylogenetic Networks","authors":"Takatora Suzuki;Han Guo;Momoko Hayamizu","doi":"10.1109/TCBB.2024.3456575","DOIUrl":"10.1109/TCBB.2024.3456575","url":null,"abstract":"Phylogenetic networks are a useful model that can represent reticulate evolution and complex biological data. In recent years, mathematical and computational aspects of tree-based networks have been well studied. However, not all phylogenetic networks are tree-based, so it is meaningful to consider how close a given network is to being tree-based; Francis–Steel–Semple (2018) proposed several different indices to measure the degree of deviation of a phylogenetic network from being tree-based. One is the minimum number of leaves that need to be added to convert a given network to tree-based, and another is the number of vertices that are not included in the largest subtree covering its leaf-set. Both values are zero if and only if the network is tree-based. Both deviation indices can be computed efficiently, but the relationship between the above two is unknown, as each has been studied using different approaches. In this study, we derive a tight inequality for the values of the two measures and also give a characterisation of phylogenetic networks such that they coincide. This characterisation yields a new efficient algorithm for the Maximum Covering Subtree Problem based on the maximal zig-zag trail decomposition.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2226-2234"},"PeriodicalIF":3.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10670207","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142182857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation Extraction in Biomedical Texts: A Cross-Sentence Approach 生物医学文本中的关系提取:跨句子方法
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-06 DOI: 10.1109/TCBB.2024.3451348
Zhijing Li;Liwei Tian;Yiping Jiang;Yucheng Huang
Relation extraction, a crucial task in understanding the intricate relationships between entities in biomedical domains, has predominantly focused on binary relations within single sentences. However, in practical biomedical scenarios, relationships often extend across multiple sentences, leading to extraction errors with potential impacts on clinical decision-making and medical diagnosis. To overcome this limitation, we present a novel cross-sentence relation extraction framework that integrates and enhances coreference resolution and relation extraction models. Coreference resolution serves as the foundation, breaking sentence boundaries and linking entities across sentences. Our framework incorporates pre-trained deep language representations and leverages graph LSTMs to effectively model cross-sentence entity mentions. The use of a self-attentive Transformer architecture and external semantic information further enhances the modeling of intricate relationships. Comprehensive experiments conducted on two standard datasets, namely the BioNLP dataset and THYME dataset, demonstrate the state-of-the-art performance of our proposed approach.
关系提取是理解生物医学领域中实体间错综复杂关系的一项重要任务,主要侧重于单句中的二元关系。然而,在实际生物医学场景中,关系往往跨越多个句子,从而导致提取错误,对临床决策和医疗诊断造成潜在影响。为了克服这一局限性,我们提出了一种新型的跨句子关系提取框架,该框架整合并增强了核心参照解析和关系提取模型。核心参照解析是基础,它能打破句子界限并连接跨句子的实体。我们的框架结合了预先训练的深度语言表征,并利用图 LSTM 对跨句实体提及进行有效建模。自注意变换器架构和外部语义信息的使用进一步增强了对错综复杂关系的建模。在两个标准数据集(即 BioNLP 数据集和 THYME 数据集)上进行的综合实验证明了我们提出的方法具有一流的性能。
{"title":"Relation Extraction in Biomedical Texts: A Cross-Sentence Approach","authors":"Zhijing Li;Liwei Tian;Yiping Jiang;Yucheng Huang","doi":"10.1109/TCBB.2024.3451348","DOIUrl":"10.1109/TCBB.2024.3451348","url":null,"abstract":"Relation extraction, a crucial task in understanding the intricate relationships between entities in biomedical domains, has predominantly focused on binary relations within single sentences. However, in practical biomedical scenarios, relationships often extend across multiple sentences, leading to extraction errors with potential impacts on clinical decision-making and medical diagnosis. To overcome this limitation, we present a novel cross-sentence relation extraction framework that integrates and enhances coreference resolution and relation extraction models. Coreference resolution serves as the foundation, breaking sentence boundaries and linking entities across sentences. Our framework incorporates pre-trained deep language representations and leverages graph LSTMs to effectively model cross-sentence entity mentions. The use of a self-attentive Transformer architecture and external semantic information further enhances the modeling of intricate relationships. Comprehensive experiments conducted on two standard datasets, namely the BioNLP dataset and THYME dataset, demonstrate the state-of-the-art performance of our proposed approach.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2156-2166"},"PeriodicalIF":3.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CTsynther: Contrastive Transformer Model for End-to-End Retrosynthesis Prediction CTsynther:用于端到端逆合成预测的对比变换器模型。
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-06 DOI: 10.1109/TCBB.2024.3455381
Hao Lu;Zhiqiang Wei;Kun Zhang;Xuze Wang;Liaqat Ali;Hao Liu
Retrosynthesis prediction is a fundamental problem in organic chemistry and drug synthesis. We proposed an end-to-end deep learning model called CTsynther (Contrastive Transformer for single-step retrosynthesis prediction model) that could provide single-step retrosynthesis prediction without external reaction templates or specialized knowledge. The model introduced the concept of contrastive learning in Transformer architecture and employed a contrastive learning language representation model at the SMILES sentence level to enhance model inference by learning similarities and differences between various samples. Mixed global and local attention mechanisms allow the model to capture features and dependencies between different atoms to improve generalization. We further investigated the embedding representations of SMILES learned automatically from the model. Visualization results show that the model could effectively acquire information about identical molecules and improve prediction performance. Experiments showed that the accuracy of retrosynthesis reached 53.5% and 64.4% for with and without reaction types, respectively. The validity of the predicted reactants is improved, showing competitiveness compared with semi-template methods.
逆合成预测是有机化学和药物合成中的一个基本问题。我们提出了一种名为 CTsynther(Contrastive Transformer for single-step retrosynthesis prediction model)的端到端深度学习模型,无需外部反应模板或专业知识,即可提供单步逆合成预测。该模型在 Transformer 架构中引入了对比学习的概念,并在 SMILES 句子层面采用了对比学习语言表征模型,通过学习不同样本之间的异同来增强模型推理能力。全局和局部混合关注机制使模型能够捕捉不同原子之间的特征和依赖关系,从而提高泛化能力。我们进一步研究了从模型中自动学习到的 SMILES 的嵌入表征。可视化结果表明,该模型能有效获取相同分子的信息,并提高预测性能。实验表明,有反应类型和无反应类型的逆合成准确率分别达到了 53.5% 和 64.4%。与半模板方法相比,预测反应物的有效性得到了提高,显示出了竞争力。
{"title":"CTsynther: Contrastive Transformer Model for End-to-End Retrosynthesis Prediction","authors":"Hao Lu;Zhiqiang Wei;Kun Zhang;Xuze Wang;Liaqat Ali;Hao Liu","doi":"10.1109/TCBB.2024.3455381","DOIUrl":"10.1109/TCBB.2024.3455381","url":null,"abstract":"Retrosynthesis prediction is a fundamental problem in organic chemistry and drug synthesis. We proposed an end-to-end deep learning model called CTsynther (Contrastive Transformer for single-step retrosynthesis prediction model) that could provide single-step retrosynthesis prediction without external reaction templates or specialized knowledge. The model introduced the concept of contrastive learning in Transformer architecture and employed a contrastive learning language representation model at the SMILES sentence level to enhance model inference by learning similarities and differences between various samples. Mixed global and local attention mechanisms allow the model to capture features and dependencies between different atoms to improve generalization. We further investigated the embedding representations of SMILES learned automatically from the model. Visualization results show that the model could effectively acquire information about identical molecules and improve prediction performance. Experiments showed that the accuracy of retrosynthesis reached 53.5% and 64.4% for with and without reaction types, respectively. The validity of the predicted reactants is improved, showing competitiveness compared with semi-template methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2235-2245"},"PeriodicalIF":3.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Similarities via Local Interaction Consistency and Optimizing Area Under the Curve Measures via Matrix Factorization for Drug-Target Interaction Prediction 通过局部相互作用一致性整合相似性,并通过矩阵因式分解优化曲线下面积度量,用于药物-靶点相互作用预测。
IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-03 DOI: 10.1109/TCBB.2024.3453499
Bin Liu;Grigorios Tsoumakas
In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities. Furthermore, area under the precision-recall curve (AUPR) and area under the receiver operating characteristic curve (AUC) are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. We propose a local interaction consistency (LIC) aware similarity integration method to fuse vital information from diverse views for DTI prediction models. Furthermore, we propose two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develop an ensemble MF approach that takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Experimental results under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize and are reliable in discovering potential new DTIs.
在药物发现过程中,通过实验方法确定药物-靶点相互作用(DTIs)是一个繁琐而昂贵的过程。计算方法能有效预测 DTIs,并推荐一小部分潜在的相互作用配对供进一步实验确认,从而加速药物发现过程。虽然融合药物和靶点的异质性相似性可以提高预测能力,但现有的相似性组合方法忽略了相邻实体的相互作用一致性。此外,精确度-召回曲线下面积(AUPR)和接收者工作特征曲线下面积(AUC)是 DTI 预测中两个广泛使用的评价指标。然而,在现有的 DTI 预测方法中,这两个指标很少被视为损失。我们提出了一种局部交互一致性(LIC)感知的相似性整合方法,将来自不同视图的重要信息融合到 DTI 预测模型中。此外,我们还提出了两种矩阵因式分解(MF)方法,分别利用凸代理损失优化 AUPR 和 AUC,然后开发了一种集合 MF 方法,通过组合两种基于单一指标的 MF 模型,利用这两种曲线下面积指标的优势。不同预测设置下的实验结果表明,所提出的方法在其优化的指标方面优于各种竞争对手,而且在发现潜在的新 DTI 方面也很可靠。
{"title":"Integrating Similarities via Local Interaction Consistency and Optimizing Area Under the Curve Measures via Matrix Factorization for Drug-Target Interaction Prediction","authors":"Bin Liu;Grigorios Tsoumakas","doi":"10.1109/TCBB.2024.3453499","DOIUrl":"10.1109/TCBB.2024.3453499","url":null,"abstract":"In drug discovery, identifying drug-target interactions (DTIs) via experimental approaches is a tedious and expensive procedure. Computational methods efficiently predict DTIs and recommend a small part of potential interacting pairs for further experimental confirmation, accelerating the drug discovery process. Although fusing heterogeneous drug and target similarities can improve the prediction ability, the existing similarity combination methods ignore the interaction consistency for neighbour entities. Furthermore, area under the precision-recall curve (AUPR) and area under the receiver operating characteristic curve (AUC) are two widely used evaluation metrics in DTI prediction. However, the two metrics are seldom considered as losses within existing DTI prediction methods. We propose a local interaction consistency (LIC) aware similarity integration method to fuse vital information from diverse views for DTI prediction models. Furthermore, we propose two matrix factorization (MF) methods that optimize AUPR and AUC using convex surrogate losses respectively, and then develop an ensemble MF approach that takes advantage of the two area under the curve metrics by combining the two single metric based MF models. Experimental results under different prediction settings show that the proposed methods outperform various competitors in terms of the metric(s) they optimize and are reliable in discovering potential new DTIs.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2212-2225"},"PeriodicalIF":3.6,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142125626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Computational Biology and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1