IEEE/ACM Transactions on Computational Biology and Bioinformatics最新文献_第2页

Discriminative Domain Adaption Network for Simultaneously Removing Batch Effects and Annotating Cell Types in Single-Cell RNA-Seq 在单细胞 RNA-Seq 中同时消除批次效应和标注细胞类型的判别域自适应网络

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-29 DOI: 10.1109/TCBB.2024.3487574

Qi Zhu;Aizhen Li;Zheng Zhang;Chuhang Zheng;Junyong Zhao;Jin-Xing Liu;Daoqiang Zhang;Wei Shao

Machine learning techniques have become increasingly important in analyzing single-cell RNA and identifying cell types, providing valuable insights into cellular development and disease mechanisms. However, the presence of batch effects poses major challenges in scRNA-seq analysis due to data distribution variation across batches. Although several batch effect mitigation algorithms have been proposed, most of them focus only on the correlation of local structure embeddings, ignoring global distribution matching and discriminative feature representation in batch correction. In this paper, we proposed the discriminative domain adaption network (D2AN) for joint batch effects correction and type annotation with single-cell RNA-seq. Specifically, we first captured the global low-dimensional embeddings of samples from the source and target domains by adversarial domain adaption strategy. Second, a contrastive loss is developed to preliminarily align the source domain samples. Moreover, the semantic alignment of class centroids in the source and target domains is achieved for further local alignment. Finally, a self-paced learning mechanism based on inter-domain loss is adopted to gradually select samples with high similarity to the target domain for training, which is used to improve the robustness of the model. Experimental results demonstrated that the proposed method on multiple real datasets outperforms several state-of-the-art methods.

机器学习技术在分析单细胞 RNA 和识别细胞类型方面的作用日益重要，为细胞发育和疾病机制提供了宝贵的见解。然而，由于不同批次的数据分布存在差异，批次效应的存在给 scRNA-seq 分析带来了重大挑战。虽然已有多种批次效应缓解算法被提出，但大多数算法只关注局部结构嵌入的相关性，忽略了批次校正中的全局分布匹配和判别特征表示。在本文中，我们提出了用于单细胞 RNA-seq 批次效应校正和类型标注的判别域自适应网络（D2AN）。具体来说，我们首先通过对抗性域自适应策略捕获源域和目标域样本的全局低维嵌入。其次，我们开发了一种对比损失（contrastive loss）来初步对齐源域样本。此外，还实现了源域和目标域中类中心点的语义对齐，以进一步进行局部对齐。最后，采用基于域间损失的自步进学习机制，逐步选择与目标域相似度高的样本进行训练，从而提高模型的鲁棒性。实验结果表明，所提出的方法在多个真实数据集上的表现优于几种最先进的方法。

{"title":"Discriminative Domain Adaption Network for Simultaneously Removing Batch Effects and Annotating Cell Types in Single-Cell RNA-Seq","authors":"Qi Zhu;Aizhen Li;Zheng Zhang;Chuhang Zheng;Junyong Zhao;Jin-Xing Liu;Daoqiang Zhang;Wei Shao","doi":"10.1109/TCBB.2024.3487574","DOIUrl":"10.1109/TCBB.2024.3487574","url":null,"abstract":"Machine learning techniques have become increasingly important in analyzing single-cell RNA and identifying cell types, providing valuable insights into cellular development and disease mechanisms. However, the presence of batch effects poses major challenges in scRNA-seq analysis due to data distribution variation across batches. Although several batch effect mitigation algorithms have been proposed, most of them focus only on the correlation of local structure embeddings, ignoring global distribution matching and discriminative feature representation in batch correction. In this paper, we proposed the discriminative domain adaption network (D2AN) for joint batch effects correction and type annotation with single-cell RNA-seq. Specifically, we first captured the global low-dimensional embeddings of samples from the source and target domains by adversarial domain adaption strategy. Second, a contrastive loss is developed to preliminarily align the source domain samples. Moreover, the semantic alignment of class centroids in the source and target domains is achieved for further local alignment. Finally, a self-paced learning mechanism based on inter-domain loss is adopted to gradually select samples with high similarity to the target domain for training, which is used to improve the robustness of the model. Experimental results demonstrated that the proposed method on multiple real datasets outperforms several state-of-the-art methods.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2543-2555"},"PeriodicalIF":3.6,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ESGC-MDA: Identifying miRNA-disease associations using enhanced Simple Graph Convolutional Networks. ESGC-MDA：利用增强型简单图卷积网络识别 miRNA 与疾病的关联。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-28 DOI: 10.1109/TCBB.2024.3486911

Xuehua Bi, Chunyang Jiang, Cheng Yan, Kai Zhao, Linlin Zhang, Jianxin Wang

MiRNAs play an important role in the occurrence and development of human disease. Identifying potential miRNA-disease associations is valuable for disease diagnosis and treatment. Therefore, it is urgent to develop efficient computational methods for predicting potential miRNA-disease associations to reduce the cost and time associated with biological wet experiments. In addition, high-quality feature representation remains a challenge for miRNA-disease association prediction using graph neural network methods. In this paper, we propose a method named ESGC-MDA, which employs an enhanced Simple Graph Convolution Network to identify miRNA-disease associations. We first construct a bipartite attributed graph for miRNAs and diseases by computing multi-source similarity. Then, we enhance the feature representations of miRNA and disease nodes by applying two strategies in the simple convolution network, which include randomly dropping messages during propagation to ensure the model learns more reliable feature representations, and using adaptive weighting to aggregate features from different layers. Finally, we calculate the prediction scores of miRNA-disease pairs by using a fully connected neural network decoder. We conduct 5-fold cross-validation and 10-fold cross-validation on HDMM v2.0 and HMDD v3.2, respectively, and ESGC-MDA achieves better performance than state-of-the-art baseline methods. The case studies for cardiovascular disease, lung cancer and colon cancer also further confirm the effectiveness of ESGC-MDA. The source codes are available at https://github.com/bixuehua/ESGC-MDA.

miRNA 在人类疾病的发生和发展中发挥着重要作用。识别潜在的 miRNA 与疾病的关联对疾病诊断和治疗非常有价值。因此，当务之急是开发预测潜在 miRNA 与疾病关联的高效计算方法，以减少生物湿实验的成本和时间。此外，高质量的特征表示仍然是使用图神经网络方法预测 miRNA-疾病关联的一个挑战。本文提出了一种名为 ESGC-MDA 的方法，它采用增强型简单图卷积网络来识别 miRNA 与疾病的关联。我们首先通过计算多源相似性为 miRNA 和疾病构建一个双方属性图。然后，我们通过在简单卷积网络中应用两种策略来增强 miRNA 和疾病节点的特征表示，包括在传播过程中随机丢弃信息以确保模型学习到更可靠的特征表示，以及使用自适应加权来聚合不同层的特征。最后，我们使用全连接神经网络解码器计算 miRNA 疾病对的预测得分。我们分别在 HDMM v2.0 和 HMDD v3.2 上进行了 5 倍交叉验证和 10 倍交叉验证，ESGC-MDA 比最先进的基线方法取得了更好的性能。对心血管疾病、肺癌和结肠癌的案例研究也进一步证实了 ESGC-MDA 的有效性。源代码见 https://github.com/bixuehua/ESGC-MDA。

{"title":"ESGC-MDA: Identifying miRNA-disease associations using enhanced Simple Graph Convolutional Networks.","authors":"Xuehua Bi, Chunyang Jiang, Cheng Yan, Kai Zhao, Linlin Zhang, Jianxin Wang","doi":"10.1109/TCBB.2024.3486911","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3486911","url":null,"abstract":"MiRNAs play an important role in the occurrence and development of human disease. Identifying potential miRNA-disease associations is valuable for disease diagnosis and treatment. Therefore, it is urgent to develop efficient computational methods for predicting potential miRNA-disease associations to reduce the cost and time associated with biological wet experiments. In addition, high-quality feature representation remains a challenge for miRNA-disease association prediction using graph neural network methods. In this paper, we propose a method named ESGC-MDA, which employs an enhanced Simple Graph Convolution Network to identify miRNA-disease associations. We first construct a bipartite attributed graph for miRNAs and diseases by computing multi-source similarity. Then, we enhance the feature representations of miRNA and disease nodes by applying two strategies in the simple convolution network, which include randomly dropping messages during propagation to ensure the model learns more reliable feature representations, and using adaptive weighting to aggregate features from different layers. Finally, we calculate the prediction scores of miRNA-disease pairs by using a fully connected neural network decoder. We conduct 5-fold cross-validation and 10-fold cross-validation on HDMM v2.0 and HMDD v3.2, respectively, and ESGC-MDA achieves better performance than state-of-the-art baseline methods. The case studies for cardiovascular disease, lung cancer and colon cancer also further confirm the effectiveness of ESGC-MDA. The source codes are available at https://github.com/bixuehua/ESGC-MDA.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142521814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MLW-BFECF: A Multi-Weighted Dynamic Cascade Forest Based on Bilinear Feature Extraction for Predicting the Stage of Kidney Renal Clear Cell Carcinoma on Multi-Modal Gene Data MLW-BFECF：基于双线性特征提取的多加权动态级联森林，用于在多模态基因数据上预测肾透明细胞癌的分期。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-25 DOI: 10.1109/TCBB.2024.3486742

Liye Jia;Liancheng Jiang;Junhong Yue;Fang Hao;Yongfei Wu;Xilin Liu

The stage prediction of kidney renal clear cell carcinoma (KIRC) is important for the diagnosis, personalized treatment, and prognosis of patients. Many prediction methods have been proposed, but most of them are based on unimodal gene data, and their accuracy is difficult to further improve. Therefore, we propose a novel multi-weighted dynamic cascade forest based on the bilinear feature extraction (MLW-BFECF) model for stage prediction of KIRC using multimodal gene data (RNA-seq, CNA, and methylation). The proposed model utilizes a dynamic cascade framework with shuffle layers to prevent early degradation of the model. In each cascade layer, a voting technique based on three gene selection algorithms is first employed to effectively retain gene features more relevant to KIRC and eliminate redundant information in gene features. Then, two new bilinear models based on the gated attention mechanism are proposed to better extract new intra-modal and inter-modal gene features; Finally, based on the idea of the bagging, a multi-weighted ensemble forest classifiers module is proposed to extract and fuse probabilistic features of the three-modal gene data. A series of experiments demonstrate that the MLW-BFECF model based on the three-modal KIRC dataset achieves the highest prediction performance with an accuracy of 88.9 %.

肾透明细胞癌（KIRC）的分期预测对于患者的诊断、个性化治疗和预后都非常重要。目前已提出了许多预测方法，但大多基于单模态基因数据，其准确性难以进一步提高。因此，我们提出了一种基于双线性特征提取的新型多权重动态级联森林（MLW-BFECF）模型，利用多模态基因数据集（RNA-seq、CNA 和甲基化）对 KIRC 进行分期预测。该模型采用动态级联框架和洗牌层，以防止模型的早期退化。在每个级联层中，首先采用基于三种基因选择算法的投票技术，以有效保留与 KIRC 更为相关的基因特征，并消除基因特征中的冗余信息。然后，提出了基于门控注意机制的两个新的双线性模型，以更好地提取新的模内和模间基因特征；最后，基于bagging的思想，提出了多加权集合森林分类器模块，以提取和融合三模态基因数据的概率特征。一系列实验证明，基于三模态 KIRC 数据集的 MLW-BFECF 模型预测准确率高达 88.92%，预测性能最高。

{"title":"MLW-BFECF: A Multi-Weighted Dynamic Cascade Forest Based on Bilinear Feature Extraction for Predicting the Stage of Kidney Renal Clear Cell Carcinoma on Multi-Modal Gene Data","authors":"Liye Jia;Liancheng Jiang;Junhong Yue;Fang Hao;Yongfei Wu;Xilin Liu","doi":"10.1109/TCBB.2024.3486742","DOIUrl":"10.1109/TCBB.2024.3486742","url":null,"abstract":"The stage prediction of kidney renal clear cell carcinoma (KIRC) is important for the diagnosis, personalized treatment, and prognosis of patients. Many prediction methods have been proposed, but most of them are based on unimodal gene data, and their accuracy is difficult to further improve. Therefore, we propose a novel multi-weighted dynamic cascade forest based on the bilinear feature extraction (MLW-BFECF) model for stage prediction of KIRC using multimodal gene data (RNA-seq, CNA, and methylation). The proposed model utilizes a dynamic cascade framework with shuffle layers to prevent early degradation of the model. In each cascade layer, a voting technique based on three gene selection algorithms is first employed to effectively retain gene features more relevant to KIRC and eliminate redundant information in gene features. Then, two new bilinear models based on the gated attention mechanism are proposed to better extract new intra-modal and inter-modal gene features; Finally, based on the idea of the bagging, a multi-weighted ensemble forest classifiers module is proposed to extract and fuse probabilistic features of the three-modal gene data. A series of experiments demonstrate that the MLW-BFECF model based on the three-modal KIRC dataset achieves the highest prediction performance with an accuracy of 88.9 %.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2568-2579"},"PeriodicalIF":3.6,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142499543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An End-to-End Knowledge Graph Fused Graph Neural Network for Accurate Protein-Protein Interactions Prediction 用于准确预测蛋白质-蛋白质相互作用的端到端知识图谱融合图神经网络

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-24 DOI: 10.1109/TCBB.2024.3486216

Jie Yang;Yapeng Li;Guoyin Wang;Zhong Chen;Di Wu

Protein-protein interactions (PPIs) are essential to understanding cellular mechanisms, signaling networks, disease processes, and drug development, as they represent the physical contacts and functional associations between proteins. Recent advances have witnessed the achievements of artificial intelligence (AI) methods aimed at predicting PPIs. However, these approaches often handle the intricate web of relationships and mechanisms among proteins, drugs, diseases, ribonucleic acid (RNA), and protein structures in a fragmented or superficial manner. This is typically due to the limitations of non-end-to-end learning frameworks, which can lead to sub-optimal feature extraction and fusion, thereby compromising the prediction accuracy. To address these deficiencies, this paper introduces a novel end-to-end learning model, the Knowledge Graph Fused Graph Neural Network (KGF-GNN). This model comprises three integral components: (1) Protein Associated Network (PAN) Construction: We begin by constructing a PAN that extensively captures the diverse relationships and mechanisms linking proteins with drugs, diseases, RNA, and protein structures. (2) Graph Neural Network for Feature Extraction: A Graph Neural Network (GNN) is then employed to distill both topological and semantic features from the PAN, alongside another GNN designed to extract topological features directly from observed PPI networks. (3) Multi-layer Perceptron for Feature Fusion: Finally, a multi-layer perceptron integrates these varied features through end-to-end learning, ensuring that the feature extraction and fusion processes are both comprehensive and optimized for PPI prediction. Extensive experiments conducted on real-world PPI datasets validate the effectiveness of our proposed KGF-GNN approach, which not only achieves high accuracy in predicting PPIs but also significantly surpasses existing state-of-the-art models. This work not only enhances our ability to predict PPIs with a higher precision but also contributes to the broader application of AI in Bioinformatics, offering profound implications for biological research and therapeutic development.

蛋白质-蛋白质相互作用（PPIs）对于理解细胞机制、信号网络、疾病过程和药物开发至关重要，因为它们代表了蛋白质之间的物理接触和功能关联。近年来，旨在预测 PPIs 的人工智能（AI）方法取得了长足的进步。然而，这些方法往往以零散或肤浅的方式处理蛋白质、药物、疾病、核糖核酸（RNA）和蛋白质结构之间错综复杂的关系和机制。这通常是由于非端到端学习框架的局限性造成的，它可能导致次优特征提取和融合，从而影响预测的准确性。为了解决这些不足，本文介绍了一种新型端到端学习模型--知识图谱融合图神经网络（KGF-GNN）。该模型由三个组成部分组成：(1) 蛋白质关联网络（PAN）构建：我们首先构建一个 PAN，广泛捕捉将蛋白质与药物、疾病、RNA 和蛋白质结构联系起来的各种关系和机制。(2) 用于特征提取的图神经网络：然后使用图神经网络（GNN）从 PAN 中提取拓扑和语义特征，同时使用另一个图神经网络直接从观察到的 PPI 网络中提取拓扑特征。(3) 用于特征融合的多层感知器：最后，多层感知器通过端到端学习整合这些不同的特征，确保特征提取和融合过程既全面又优化了 PPI 预测。在真实世界的 PPI 数据集上进行的大量实验验证了我们提出的 KGF-GNN 方法的有效性，它不仅在预测 PPI 方面实现了高准确率，而且大大超过了现有的先进模型。这项工作不仅提高了我们预测 PPIs 的精度，而且有助于人工智能在生物信息学中的广泛应用，对生物研究和治疗开发具有深远影响。

{"title":"An End-to-End Knowledge Graph Fused Graph Neural Network for Accurate Protein-Protein Interactions Prediction","authors":"Jie Yang;Yapeng Li;Guoyin Wang;Zhong Chen;Di Wu","doi":"10.1109/TCBB.2024.3486216","DOIUrl":"10.1109/TCBB.2024.3486216","url":null,"abstract":"Protein-protein interactions (PPIs) are essential to understanding cellular mechanisms, signaling networks, disease processes, and drug development, as they represent the physical contacts and functional associations between proteins. Recent advances have witnessed the achievements of artificial intelligence (AI) methods aimed at predicting PPIs. However, these approaches often handle the intricate web of relationships and mechanisms among proteins, drugs, diseases, ribonucleic acid (RNA), and protein structures in a fragmented or superficial manner. This is typically due to the limitations of non-end-to-end learning frameworks, which can lead to sub-optimal feature extraction and fusion, thereby compromising the prediction accuracy. To address these deficiencies, this paper introduces a novel end-to-end learning model, the Knowledge Graph Fused Graph Neural Network (KGF-GNN). This model comprises three integral components: (1) \u0000<bold>Protein Associated Network (PAN) Construction\u0000: We begin by constructing a PAN that extensively captures the diverse relationships and mechanisms linking proteins with drugs, diseases, RNA, and protein structures. (2) \u0000<bold>Graph Neural Network for Feature Extraction\u0000: A Graph Neural Network (GNN) is then employed to distill both topological and semantic features from the PAN, alongside another GNN designed to extract topological features directly from observed PPI networks. (3) \u0000<bold>Multi-layer Perceptron for Feature Fusion\u0000: Finally, a multi-layer perceptron integrates these varied features through end-to-end learning, ensuring that the feature extraction and fusion processes are both comprehensive and optimized for PPI prediction. Extensive experiments conducted on real-world PPI datasets validate the effectiveness of our proposed KGF-GNN approach, which not only achieves high accuracy in predicting PPIs but also significantly surpasses existing state-of-the-art models. This work not only enhances our ability to predict PPIs with a higher precision but also contributes to the broader application of AI in Bioinformatics, offering profound implications for biological research and therapeutic development.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2518-2530"},"PeriodicalIF":3.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142499542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery 基于 omics 的生物标记发现中多目标特征选择基准的综合评估框架。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-14 DOI: 10.1109/TCBB.2024.3480150

Luca Cattelani;Arindam Ghosh;Teemu J. Rintala;Vittorio Fortino

Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.

机器学习算法已被广泛用于对基于基因表达的生物标记物驱动的癌症亚型进行准确分类。然而，结合多种基因表达特征的生物标志物模型在外部验证数据集中往往不可重现，而且其特征集的大小往往没有得到优化，从而影响了其转化为具有成本效益的临床工具的能力。我们研究了如何解决在分类性能和特征集大小之间找到最佳权衡的多目标问题，应用了七种机器学习驱动的特征子集选择算法，并分析了它们在八个大规模癌症转录组数据集（涵盖训练集和外部验证集）基准中的表现。该基准包括根据组成基因的准确性、多样性和稳定性评估单个生物标记物和解决方案集性能的评价指标。此外，还提出了一种用于交叉验证研究的新评价指标，该指标对通常用于评估多目标优化算法性能的超体积（hypervolume）进行了概括。在乳腺癌、肾癌和卵巢癌的外部数据集上，分别使用 4 个、2 个和 7 个特征的生物标志物显示出 0.8 的均衡准确性。遗传算法的性能往往优于其他算法，最近提出的 NSGA2-CH 和 NSGA2-CHS 在大多数情况下是性能最好的方法。

{"title":"A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery","authors":"Luca Cattelani;Arindam Ghosh;Teemu J. Rintala;Vittorio Fortino","doi":"10.1109/TCBB.2024.3480150","DOIUrl":"10.1109/TCBB.2024.3480150","url":null,"abstract":"Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2432-2446"},"PeriodicalIF":3.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10716353","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Biomedical Event Extraction With Constrained Decoding Strategy 采用约束解码策略的生成式生物医学事件提取。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-14 DOI: 10.1109/TCBB.2024.3480088

Fangfang Su;Chong Teng;Fei Li;Bobo Li;Jun Zhou;Donghong Ji

Currently, biomedical event extraction has received considerable attention in various fields, including natural language processing, bioinformatics, and computational biomedicine. This has led to the emergence of numerous machine learning and deep learning models that have been proposed and applied to tackle this complex task. While existing models typically adopt an extraction-based approach, which requires breaking down the extraction of biomedical events into multiple subtasks for sequential processing, making it prone to cascading errors. This paper presents a novel approach by constructing a biomedical event generation model based on the framework of the pre-trained language model T5. We employ a sequence-to-sequence generation paradigm to obtain events, the model utilizes constrained decoding algorithm to guide sequence generation, and a curriculum learning algorithm for efficient model learning. To demonstrate the effectiveness of our model, we evaluate it on two public benchmark datasets, Genia 2011 and Genia 2013. Our model achieves superior performance, illustrating the effectiveness of generative modeling of biomedical events.

目前，生物医学事件提取已在自然语言处理、生物信息学和计算生物医学等多个领域受到广泛关注。为解决这一复杂任务，人们提出并应用了大量机器学习和深度学习模型。现有模型通常采用基于提取的方法，这需要将生物医学事件的提取分解成多个子任务进行顺序处理，因此容易出现层叠错误。本文提出了一种新方法，即在预训练语言模型 T5 的框架基础上构建生物医学事件生成模型。我们采用序列-序列生成范式来获取事件，模型利用约束解码算法来指导序列生成，并利用课程学习算法来实现高效的模型学习。为了证明模型的有效性，我们在两个公共基准数据集（Genia 2011 和 Genia 2013）上对其进行了评估。我们的模型取得了优异的性能，说明了生物医学事件生成模型的有效性。

{"title":"Generative Biomedical Event Extraction With Constrained Decoding Strategy","authors":"Fangfang Su;Chong Teng;Fei Li;Bobo Li;Jun Zhou;Donghong Ji","doi":"10.1109/TCBB.2024.3480088","DOIUrl":"10.1109/TCBB.2024.3480088","url":null,"abstract":"Currently, biomedical event extraction has received considerable attention in various fields, including natural language processing, bioinformatics, and computational biomedicine. This has led to the emergence of numerous machine learning and deep learning models that have been proposed and applied to tackle this complex task. While existing models typically adopt an extraction-based approach, which requires breaking down the extraction of biomedical events into multiple subtasks for sequential processing, making it prone to cascading errors. This paper presents a novel approach by constructing a biomedical event generation model based on the framework of the pre-trained language model \u0000<italic>T5\u0000. We employ a sequence-to-sequence generation paradigm to obtain events, the model utilizes constrained decoding algorithm to guide sequence generation, and a curriculum learning algorithm for efficient model learning. To demonstrate the effectiveness of our model, we evaluate it on two public benchmark datasets, Genia 2011 and Genia 2013. Our model achieves superior performance, illustrating the effectiveness of generative modeling of biomedical events.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2471-2484"},"PeriodicalIF":3.6,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142464222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GrapHiC: An integrative graph based approach for imputing missing Hi-C reads. GrapHiC：一种基于图的综合方法，用于估算缺失的 Hi-C 读数。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-11 DOI: 10.1109/TCBB.2024.3477909

Ghulam Murtaza, Justin Wagner, Justin M Zook, Ritambhara Singh

Hi-C experiments allow researchers to study and understand the 3D genome organization and its regulatory function. Unfortunately, sequencing costs and technical constraints severely restrict access to high-quality Hi-C data for many cell types. Existing frameworks rely on a sparse Hi-C dataset or cheaper-to-acquire ChIP-seq data to predict Hi-C contact maps with high read coverage. However, these methods fail to generalize to sparse or cross-cell-type inputs because they do not account for the contributions of epigenomic features or the impact of the structural neighborhood in predicting Hi-C reads. We propose GrapHiC, which combines Hi-C and ChIP-seq in a graph representation, allowing more accurate embedding of structural and epigenomic features. Each node represents a binned genomic region, and we assign edge weights using the observed Hi-C reads. Additionally, we embed ChIP-seq and relative positional information as node attributes, allowing our representation to capture structural neighborhoods and the contributions of proteins and their modifications for predicting Hi-C reads. We show that GrapHiC generalizes better than the current state-of-the-art on cross-cell-type settings and sparse Hi-C inputs. Moreover, we can utilize our framework to impute Hi-C reads even when no Hi-C contact map is available, thus making high-quality Hi-C data accessible for many cell types. Availability: https://github.com/rsinghlab/GrapHiC.

Hi-C 实验使研究人员能够研究和了解三维基因组的组织及其调控功能。遗憾的是，测序成本和技术限制严重制约了对许多细胞类型的高质量 Hi-C 数据的获取。现有的框架依赖于稀疏的 Hi-C 数据集或获取成本更低的 ChIP-seq 数据来预测高读数覆盖率的 Hi-C 接触图。然而，这些方法无法推广到稀疏或跨细胞类型的输入，因为它们没有考虑表观基因组特征的贡献或结构邻域对预测 Hi-C 读数的影响。我们提出的 GrapHiC 方法将 Hi-C 和 ChIP-seq 结合到图表示法中，可以更准确地嵌入结构和表观基因组特征。每个节点代表一个二进制基因组区域，我们使用观察到的 Hi-C 读数分配边缘权重。此外，我们还将 ChIP-seq 和相对位置信息嵌入节点属性，从而使我们的表征能够捕捉结构邻域和蛋白质及其修饰对预测 Hi-C 读数的贡献。我们的研究表明，在交叉细胞类型设置和稀疏 Hi-C 输入上，GrapHiC 的通用性优于目前最先进的技术。此外，即使没有 Hi-C 接触图，我们也能利用我们的框架来推算 Hi-C 读数，从而使许多细胞类型都能获得高质量的 Hi-C 数据。可用性：https://github.com/rsinghlab/GrapHiC。

{"title":"GrapHiC: An integrative graph based approach for imputing missing Hi-C reads.","authors":"Ghulam Murtaza, Justin Wagner, Justin M Zook, Ritambhara Singh","doi":"10.1109/TCBB.2024.3477909","DOIUrl":"10.1109/TCBB.2024.3477909","url":null,"abstract":"Hi-C experiments allow researchers to study and understand the 3D genome organization and its regulatory function. Unfortunately, sequencing costs and technical constraints severely restrict access to high-quality Hi-C data for many cell types. Existing frameworks rely on a sparse Hi-C dataset or cheaper-to-acquire ChIP-seq data to predict Hi-C contact maps with high read coverage. However, these methods fail to generalize to sparse or cross-cell-type inputs because they do not account for the contributions of epigenomic features or the impact of the structural neighborhood in predicting Hi-C reads. We propose GrapHiC, which combines Hi-C and ChIP-seq in a graph representation, allowing more accurate embedding of structural and epigenomic features. Each node represents a binned genomic region, and we assign edge weights using the observed Hi-C reads. Additionally, we embed ChIP-seq and relative positional information as node attributes, allowing our representation to capture structural neighborhoods and the contributions of proteins and their modifications for predicting Hi-C reads. We show that GrapHiC generalizes better than the current state-of-the-art on cross-cell-type settings and sparse Hi-C inputs. Moreover, we can utilize our framework to impute Hi-C reads even when no Hi-C contact map is available, thus making high-quality Hi-C data accessible for many cell types. Availability: https://github.com/rsinghlab/GrapHiC.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142406376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest Editors' Introduction to the Special Section on Bioinformatics Research and Applications 特邀编辑对生物信息学研究与应用专栏的介绍

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-09 DOI: 10.1109/TCBB.2024.3390374

Zhipeng Cai;Alexander Zelikovsky

引用次数: 0

De Novo Drug Design by Multi-Objective Path Consistency Learning With Beam A* Search 利用光束 A∗ 搜索的多目标路径一致性学习进行新药设计。

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-09 DOI: 10.1109/TCBB.2024.3477592

Dengwei Zhao;Jingyuan Zhou;Shikui Tu;Lei Xu

Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A

$^*$

search, path consistency (PC), i.e.,

$f$

values on one optimal path should be identical, is employed as the objective function in the update of the

$f$

value estimator to train a multi-objective de novo drug designer. By incorporating the

$f$

value into the decision-making process of beam search, the DrugBA

$^*$

algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-the-art algorithm QADD in multiple molecular properties of the generated molecules.

在广阔的化学空间内从零开始生成高质量的类药物分子是药物发现领域的一项重大挑战。在之前的研究中，基于价值的强化学习算法被用来迭代生成具有多种所需特性的分子。即时奖励被定义为每一步对中间状态分子的评估，学习目标是最大化生成路径上所有分子的预期累积评估分数。然而，这种对奖励的定义有误导性，因为实际上，优化目标应该只是最终生成的分子的评价得分。此外，在以前的研究中，决策过程中引入了随机性，从而可以生成多种分子，但不再追求未来的最大回报。在本文中，即时奖励被定义为通过对分子的修改所实现的改进，从而使最终生成的分子的评价得分最大化。路径一致性（PC）源于 A ∗ 搜索，即一条最优路径上的 f 值应完全相同，它被用作更新 f 值估计器的目标函数，以训练多目标全新药物设计器。通过将 f 值纳入波束搜索的决策过程，DrugBA∗ 算法得以大规模生成高质量和多样性的分子。实验结果表明，与最先进的 QADD 算法相比，所生成分子的多种分子特性都有大幅提升。

{"title":"De Novo Drug Design by Multi-Objective Path Consistency Learning With Beam A* Search","authors":"Dengwei Zhao;Jingyuan Zhou;Shikui Tu;Lei Xu","doi":"10.1109/TCBB.2024.3477592","DOIUrl":"10.1109/TCBB.2024.3477592","url":null,"abstract":"Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A\u0000<inline-formula><tex-math>$^*$</tex-math></inline-formula>\u0000 search, path consistency (PC), i.e., \u0000<inline-formula><tex-math>$f$</tex-math></inline-formula>\u0000 values on one optimal path should be identical, is employed as the objective function in the update of the \u0000<inline-formula><tex-math>$f$</tex-math></inline-formula>\u0000 value estimator to train a multi-objective \u0000de novo\u0000 drug designer. By incorporating the \u0000<inline-formula><tex-math>$f$</tex-math></inline-formula>\u0000 value into the decision-making process of beam search, the DrugBA\u0000<inline-formula><tex-math>$^*$</tex-math></inline-formula>\u0000 algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-the-art algorithm QADD in multiple molecular properties of the generated molecules.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2459-2470"},"PeriodicalIF":3.6,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guest Editorial Selected Papers From BIOKDD 2022 特邀编辑 BIOKDD 2022 论文选

IF 3.6 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Pub Date : 2024-10-09 DOI: 10.1109/TCBB.2024.3429784

Da Yan;Catia Pesquita;Carsten Görg;Jake Y. Chen

引用次数: 0