Pub Date : 2024-08-29DOI: 10.1109/TCBB.2024.3451985
Qi Zhang, Yuxiao Wei, Bo Liao, Liwei Liu, Shengli Zhang
The prediction of drug-target affinity (DTA) plays a crucial role in drug development and the identification of potential drug targets. In recent years, computer-assisted DTA prediction has emerged as a significant approach in this field. In this study, we propose a multi-modal deep learning framework called MMD-DTA for predicting drug-target binding affinity and binding regions. The model can predict DTA while simultaneously learning the binding regions of drug-target interactions through unsupervised learning. To achieve this, MMD-DTA first uses graph neural networks and target structural feature extraction network to extract multi-modal information from the sequences and structures of drugs and targets. It then utilizes the feature interaction and fusion modules to generate interaction descriptors for predicting DTA and interaction strength for binding region prediction. Our experimental results demonstrate that MMD-DTA outperforms existing models based on key evaluation metrics. Furthermore, external validation results indicate that MMD-DTA enhances the generalization capability of the model by integrating sequence and structural information of drugs and targets. The model trained on the benchmark dataset can effectively generalize to independent virtual screening tasks. The visualization of drug-target binding region prediction showcases the interpretability of MMD-DTA, providing valuable insights into the functional regions of drug molecules that interact with proteins.
{"title":"MMD-DTA: A multi-modal deep learning framework for drug-target binding affinity and binding region prediction.","authors":"Qi Zhang, Yuxiao Wei, Bo Liao, Liwei Liu, Shengli Zhang","doi":"10.1109/TCBB.2024.3451985","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3451985","url":null,"abstract":"<p><p>The prediction of drug-target affinity (DTA) plays a crucial role in drug development and the identification of potential drug targets. In recent years, computer-assisted DTA prediction has emerged as a significant approach in this field. In this study, we propose a multi-modal deep learning framework called MMD-DTA for predicting drug-target binding affinity and binding regions. The model can predict DTA while simultaneously learning the binding regions of drug-target interactions through unsupervised learning. To achieve this, MMD-DTA first uses graph neural networks and target structural feature extraction network to extract multi-modal information from the sequences and structures of drugs and targets. It then utilizes the feature interaction and fusion modules to generate interaction descriptors for predicting DTA and interaction strength for binding region prediction. Our experimental results demonstrate that MMD-DTA outperforms existing models based on key evaluation metrics. Furthermore, external validation results indicate that MMD-DTA enhances the generalization capability of the model by integrating sequence and structural information of drugs and targets. The model trained on the benchmark dataset can effectively generalize to independent virtual screening tasks. The visualization of drug-target binding region prediction showcases the interpretability of MMD-DTA, providing valuable insights into the functional regions of drug molecules that interact with proteins.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142106991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1109/TCBB.2024.3443854
Chu-Ting Yu, Bo Tian, Qian-Qian Meng, Zhe-Ran Chen, Ya-Nan Pang, Xun Zhang, Yan Bian, Si-Wei Zhou, Mei-Juan Hao, Ye Gao, Lei Xin, Han Lin, Wei Wang, Luo-Wei Wang
Immunotherapy for esophageal squamous cell carcinoma (ESCC) exhibits notable variability in efficacy. Concurrently, recent research emphasizes circRNAs' impact on the ESCC tumor microenvironment. To further explore the relationship, we leveraged circRNA, microRNA, and mRNA sequence datasets to construct a comprehensive immune-related circRNA-microRNA-mRNA network, revealing competing endogenous RNA (ceRNA) roles in ESCC. The network comprises 16 circular RNAs, 13 microRNAs, and 1,560 mRNAs. Weighted gene co-expression analysis identified immune-related modules, notably cancer-associated fibroblast (CAF) and myeloid-derived suppressor cell modules, correlating significantly with immune and stemness scores. Among them, the CAF module plays a crucial role in extracellular matrix function and effectively discriminates ESCC patients. Four hub collagen family genes within CAF correlated robustly with CAF, macrophage infiltration, and T-cell exclusion. In-house sequencing and RT-qPCR validated their elevated expression. We also identified CAF module-targeting drugs as potential ESCC treatments. In summary, we established an immune-related circRNA-miRNA-mRNA network that not only illuminates ceRNA functionality but also highlights circRNAs' involvement in the CAF through collagen gene targeting. These findings hold promise to predict ESCC immune landscapes and therapy responses, ultimately aiding in more personalized and effective clinical decision-making.
{"title":"Development and Validation of a Comprehensive Analysis of the Competing Endogenous circRNA/miRNA/mRNA Network for the Identification of Immune-Related Targets in Esophageal Squamous Cell Carcinoma.","authors":"Chu-Ting Yu, Bo Tian, Qian-Qian Meng, Zhe-Ran Chen, Ya-Nan Pang, Xun Zhang, Yan Bian, Si-Wei Zhou, Mei-Juan Hao, Ye Gao, Lei Xin, Han Lin, Wei Wang, Luo-Wei Wang","doi":"10.1109/TCBB.2024.3443854","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3443854","url":null,"abstract":"<p><p>Immunotherapy for esophageal squamous cell carcinoma (ESCC) exhibits notable variability in efficacy. Concurrently, recent research emphasizes circRNAs' impact on the ESCC tumor microenvironment. To further explore the relationship, we leveraged circRNA, microRNA, and mRNA sequence datasets to construct a comprehensive immune-related circRNA-microRNA-mRNA network, revealing competing endogenous RNA (ceRNA) roles in ESCC. The network comprises 16 circular RNAs, 13 microRNAs, and 1,560 mRNAs. Weighted gene co-expression analysis identified immune-related modules, notably cancer-associated fibroblast (CAF) and myeloid-derived suppressor cell modules, correlating significantly with immune and stemness scores. Among them, the CAF module plays a crucial role in extracellular matrix function and effectively discriminates ESCC patients. Four hub collagen family genes within CAF correlated robustly with CAF, macrophage infiltration, and T-cell exclusion. In-house sequencing and RT-qPCR validated their elevated expression. We also identified CAF module-targeting drugs as potential ESCC treatments. In summary, we established an immune-related circRNA-miRNA-mRNA network that not only illuminates ceRNA functionality but also highlights circRNAs' involvement in the CAF through collagen gene targeting. These findings hold promise to predict ESCC immune landscapes and therapy responses, ultimately aiding in more personalized and effective clinical decision-making.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142106989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1109/TCBB.2024.3451051
Huiwei Zhou, Wenchu Li, Weihong Yao, Yingyu Lin, Lei Du
Hypothesis Generation (HG) aims to expedite biomedical researches by generating novel hypotheses from existing scientific literature. Most existing studies focused on modeling static snapshots of the corpus, neglecting the temporal evolution of scientific terms. Despite recent efforts to learn term evolution from Knowledge Bases (KBs) for HG, the temporal information from multi-source KBs is still overlooked, which contains important, up-to-date knowledge. In this paper, an innovative Temporal Contrastive Learning (TCL) framework is introduced to uncover latent associations between entities by jointly modeling their co-evolution across multi-source temporal KBs. Specifically, we first construct a temporal relation graph based on PubMed papers and a biomedical relation database (such as Comparative Toxicogenomics Database (CTD)). Then the constructed temporal relation graph and a temporal concept graph (such as Medical Subject Headings (MeSH)) are used to train two GCN-based recurrent networks for learning the entity temporal evolutional embeddings, respectively. Finally, a cross-view temporal prediction task is designed for learning knowledge enriched temporal embeddings by contrasting the temporal embeddings learned from the two Temporal Knowledge Graphs (TKGs). Findings from experiments conducted on three real-world biomedical term relationship datasets demonstrate that the proposed approach is clearly superior to approaches based on single TKG, achieving the state-of-the-art performance.
{"title":"Contrasting Multi-Source Temporal Knowledge Graphs for Biomedical Hypothesis Generation.","authors":"Huiwei Zhou, Wenchu Li, Weihong Yao, Yingyu Lin, Lei Du","doi":"10.1109/TCBB.2024.3451051","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3451051","url":null,"abstract":"<p><p>Hypothesis Generation (HG) aims to expedite biomedical researches by generating novel hypotheses from existing scientific literature. Most existing studies focused on modeling static snapshots of the corpus, neglecting the temporal evolution of scientific terms. Despite recent efforts to learn term evolution from Knowledge Bases (KBs) for HG, the temporal information from multi-source KBs is still overlooked, which contains important, up-to-date knowledge. In this paper, an innovative Temporal Contrastive Learning (TCL) framework is introduced to uncover latent associations between entities by jointly modeling their co-evolution across multi-source temporal KBs. Specifically, we first construct a temporal relation graph based on PubMed papers and a biomedical relation database (such as Comparative Toxicogenomics Database (CTD)). Then the constructed temporal relation graph and a temporal concept graph (such as Medical Subject Headings (MeSH)) are used to train two GCN-based recurrent networks for learning the entity temporal evolutional embeddings, respectively. Finally, a cross-view temporal prediction task is designed for learning knowledge enriched temporal embeddings by contrasting the temporal embeddings learned from the two Temporal Knowledge Graphs (TKGs). Findings from experiments conducted on three real-world biomedical term relationship datasets demonstrate that the proposed approach is clearly superior to approaches based on single TKG, achieving the state-of-the-art performance.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142086095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1109/TCBB.2024.3448617
Jose A Saez, J Fernando Vera
Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.
{"title":"Compact Class-conditional Attribute Category Clustering: Amino Acid Grouping for Enhanced HIV-1 Protease Cleavage Classification.","authors":"Jose A Saez, J Fernando Vera","doi":"10.1109/TCBB.2024.3448617","DOIUrl":"10.1109/TCBB.2024.3448617","url":null,"abstract":"<p><p>Categorical attributes are common in many classification tasks, presenting certain challenges as the number of categories grows. This situation can affect data handling, negatively impacting the building time of models, their complexity and, ultimately, their classification performance. In order to mitigate these issues, this research proposes a novel preprocessing technique for grouping attribute categories in classification datasets. This approach combines the exact representation of the association between categorical values in a Euclidean space, clustering methods and attribute quality metrics to group similar attribute categories based on their contribution to the classification task. To estimate its effectiveness, the proposal is evaluated within the context of HIV-1 protease cleavage site prediction, where each attribute represents an amino acid that can take multiple possible values. The results obtained on HIV-1 real-world datasets show a significant reduction in the number of categories per attribute, with an average reduction percentage ranging from 74% to 81%. This reduction leads to simplified data representations and improved classification performances compared to not preprocessing. Specifically, improvements of up to 0.07 in accuracy and 0.19 in geometric mean are observed across different datasets and classification algorithms. Additionally, extensive simulations on synthetic datasets with varied characteristics are carried out, providing consistent and reliable results that validate the robustness of the proposal. These findings highlight the capability of the developed method to enhance cleavage prediction, which could potentially contribute to understanding viral processes and developing targeted therapeutic strategies.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142043936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-22DOI: 10.1109/TCBB.2024.3447746
Shaokai Wang, Ming Zhu, Bin Ma
Major Histocompatibility Complex (MHC) molecules play a critical role in the immune system by presenting peptides on the cell surface for recognition by T-cells. Tumor cells often produce MHC peptides with amino acid mutations, known as neoantigens, which evade T-cell recognition, leading to rapid tumor growth. In immunotherapies such as TCR-T and CAR-T, identifying these mutated MHC peptide sequences is crucial. Current mass spectrometry-based peptide identification methods primarily rely on database searching, which fails to detect mutated peptides not present in human databases. In this paper, we propose a novel workflow called NeoMS, designed to efficiently identify both non-mutated and mutated MHC-I peptides from mass spectrometry data. NeoMS utilizes a tagging algorithm to generate an expanded sequence database that includes potential mutated proteins for each sample. Furthermore, it employs a machine learning-based scoring function for each peptide-spectrum match (PSM) to maximize search sensitivity. Finally, a rigorous target-decoy approach is implemented to control the false discovery rates (FDR) of the peptides with and without mutations separately. Experimental results for regular peptides demonstrate that NeoMS outperforms four benchmark methods. For mutated peptides, NeoMS successfully identifies hundreds of high-quality mutated peptides in a melanoma-associated sample, with their validity confirmed by further studies.
主要组织相容性复合物(MHC)分子在免疫系统中发挥着关键作用,它在细胞表面呈现肽,供 T 细胞识别。肿瘤细胞通常会产生氨基酸突变的 MHC 多肽,即所谓的新抗原,它们会逃避 T 细胞的识别,导致肿瘤快速生长。在 TCR-T 和 CAR-T 等免疫疗法中,识别这些突变的 MHC 肽序列至关重要。目前基于质谱的多肽识别方法主要依赖于数据库搜索,但这种方法无法检测到人类数据库中不存在的突变多肽。在本文中,我们提出了一种名为 NeoMS 的新型工作流程,旨在从质谱数据中有效识别非突变和突变 MHC-I 肽。NeoMS 利用标记算法生成一个扩展序列数据库,其中包括每个样本的潜在突变蛋白质。此外,它还对每个肽谱匹配(PSM)采用基于机器学习的评分函数,以最大限度地提高搜索灵敏度。最后,它采用了一种严格的目标诱饵方法,分别控制有突变和无突变肽段的错误发现率(FDR)。针对常规多肽的实验结果表明,NeoMS优于四种基准方法。对于突变肽,NeoMS在黑色素瘤相关样本中成功鉴定出了数百个高质量的突变肽,其有效性得到了进一步研究的证实。
{"title":"NeoMS: Mass Spectrometry-based Method for Uncovering Mutated MHC-I Neoantigens.","authors":"Shaokai Wang, Ming Zhu, Bin Ma","doi":"10.1109/TCBB.2024.3447746","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3447746","url":null,"abstract":"<p><p>Major Histocompatibility Complex (MHC) molecules play a critical role in the immune system by presenting peptides on the cell surface for recognition by T-cells. Tumor cells often produce MHC peptides with amino acid mutations, known as neoantigens, which evade T-cell recognition, leading to rapid tumor growth. In immunotherapies such as TCR-T and CAR-T, identifying these mutated MHC peptide sequences is crucial. Current mass spectrometry-based peptide identification methods primarily rely on database searching, which fails to detect mutated peptides not present in human databases. In this paper, we propose a novel workflow called NeoMS, designed to efficiently identify both non-mutated and mutated MHC-I peptides from mass spectrometry data. NeoMS utilizes a tagging algorithm to generate an expanded sequence database that includes potential mutated proteins for each sample. Furthermore, it employs a machine learning-based scoring function for each peptide-spectrum match (PSM) to maximize search sensitivity. Finally, a rigorous target-decoy approach is implemented to control the false discovery rates (FDR) of the peptides with and without mutations separately. Experimental results for regular peptides demonstrate that NeoMS outperforms four benchmark methods. For mutated peptides, NeoMS successfully identifies hundreds of high-quality mutated peptides in a melanoma-associated sample, with their validity confirmed by further studies.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142035750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property using both artificial neural networks and mixed integer linear programming. In this paper, we design a new method for inferring a polymer based on the framework. For this, we introduce a new way of representing a polymer as a form of monomer and define new descriptors that feature the structure of polymers. We also use linear regression as a building block of constructing a prediction function in the framework. The results of our computational experiments reveal a set of chemical properties on polymers to which a prediction function constructed with linear regression performs well. We also observe that the proposed method can infer polymers with up to 50 nonhydrogen atoms in a monomer form.
{"title":"A Method for Inferring Polymers Based on Linear Regression and Integer Programming.","authors":"Ryota Ido, Shengjuan Cao, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Hiroshi Nagamochi, Tatsuya Akutsu","doi":"10.1109/TCBB.2024.3447780","DOIUrl":"10.1109/TCBB.2024.3447780","url":null,"abstract":"<p><p>A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property using both artificial neural networks and mixed integer linear programming. In this paper, we design a new method for inferring a polymer based on the framework. For this, we introduce a new way of representing a polymer as a form of monomer and define new descriptors that feature the structure of polymers. We also use linear regression as a building block of constructing a prediction function in the framework. The results of our computational experiments reveal a set of chemical properties on polymers to which a prediction function constructed with linear regression performs well. We also observe that the proposed method can infer polymers with up to 50 nonhydrogen atoms in a monomer form.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142035787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CircRNA is closely related to human disease, so it is important to predict circRNA-disease association (CDA). However, the traditional biological detection methods have high difficulty and low accuracy, and computational methods represented by deep learning ignore the ability of the model to explicitly extract local depth information of the CDA. We propose a model based on knowledge graph from recursion and attention aggregation for circRNA-disease association prediction (KGRACDA). This model combines explicit structural features and implicit embedding information of graphs, optimizing graph embedding vectors. First, we built large-scale, multi-source heterogeneous datasets and construct a knowledge graph of multiple RNAs and diseases. After that, we use a recursive method to build multi-hop subgraphs and optimize graph attention mechanism by gating mechanism, mining local depth information. At the same time, the model uses multi-head attention mechanism to balance global and local depth features of graphs, and generate CDA prediction scores. KGRACDA surpasses other methods by capturing local and global depth information related to CDA. We update an interactive web platform HNRBase v2.0, which visualizes circRNA data, and allows users to download data and predict CDA using model.
{"title":"KGRACDA: A Model Based on Knowledge Graph from Recursion and Attention Aggregation for CircRNA-disease Association Prediction.","authors":"Ying Wang, Maoyuan Ma, Yanxin Xie, Qinke Peng, Hongqiang Lyu, Hequan Sun, Laiyi Fu","doi":"10.1109/TCBB.2024.3447110","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3447110","url":null,"abstract":"<p><p>CircRNA is closely related to human disease, so it is important to predict circRNA-disease association (CDA). However, the traditional biological detection methods have high difficulty and low accuracy, and computational methods represented by deep learning ignore the ability of the model to explicitly extract local depth information of the CDA. We propose a model based on knowledge graph from recursion and attention aggregation for circRNA-disease association prediction (KGRACDA). This model combines explicit structural features and implicit embedding information of graphs, optimizing graph embedding vectors. First, we built large-scale, multi-source heterogeneous datasets and construct a knowledge graph of multiple RNAs and diseases. After that, we use a recursive method to build multi-hop subgraphs and optimize graph attention mechanism by gating mechanism, mining local depth information. At the same time, the model uses multi-head attention mechanism to balance global and local depth features of graphs, and generate CDA prediction scores. KGRACDA surpasses other methods by capturing local and global depth information related to CDA. We update an interactive web platform HNRBase v2.0, which visualizes circRNA data, and allows users to download data and predict CDA using model.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142017376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The function labeling of enzymes has a wide range of application value in the medical field, industrial biology and other fields. Scientists define enzyme categories by enzyme commission (EC) numbers. At present, although there are some tools for enzyme function prediction, their effects have not reached the application level. To improve the precision of enzyme function prediction, we propose a parallel convolutional contrastive learning (PCCL) method to predict enzyme functions. First, we use the advanced protein language model ESM-2 to preprocess the protein sequences. Second, PCCL combines convolutional neural networks (CNNs) and contrastive learning to improve the prediction precision of multifunctional enzymes. Contrastive learning can make the model better deal with the problem of class imbalance. Finally, the deep learning framework is mainly composed of three parallel CNNs for fully extracting sample features. we compare PCCL with state-of-art enzyme function prediction methods based on three evaluation metrics. The performance of our model improves on both two test sets. Especially on the smaller test set, PCCL improves the AUC by 2.57%. The source code can be downloaded from https://github.com/biomg/PCCL.
{"title":"Parallel convolutional contrastive learning method for enzyme function prediction.","authors":"Xindi Yu, Shusen Zhou, Mujun Zang, Qingjun Wang, Chanjuan Liu, Tong Liu","doi":"10.1109/TCBB.2024.3447037","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3447037","url":null,"abstract":"<p><p>The function labeling of enzymes has a wide range of application value in the medical field, industrial biology and other fields. Scientists define enzyme categories by enzyme commission (EC) numbers. At present, although there are some tools for enzyme function prediction, their effects have not reached the application level. To improve the precision of enzyme function prediction, we propose a parallel convolutional contrastive learning (PCCL) method to predict enzyme functions. First, we use the advanced protein language model ESM-2 to preprocess the protein sequences. Second, PCCL combines convolutional neural networks (CNNs) and contrastive learning to improve the prediction precision of multifunctional enzymes. Contrastive learning can make the model better deal with the problem of class imbalance. Finally, the deep learning framework is mainly composed of three parallel CNNs for fully extracting sample features. we compare PCCL with state-of-art enzyme function prediction methods based on three evaluation metrics. The performance of our model improves on both two test sets. Especially on the smaller test set, PCCL improves the AUC by 2.57%. The source code can be downloaded from https://github.com/biomg/PCCL.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142017377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1109/TCBB.2024.3447273
Yufei Li, Xiaoyong Ma, Xiangyu Zhou, Penghzhen Cheng, Kai He, Tieliang Gong, Chen Li
Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.
{"title":"Integrating K+ Entities into Coreference Resolution on Biomedical Texts.","authors":"Yufei Li, Xiaoyong Ma, Xiangyu Zhou, Penghzhen Cheng, Kai He, Tieliang Gong, Chen Li","doi":"10.1109/TCBB.2024.3447273","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3447273","url":null,"abstract":"<p><p>Biomedical Coreference Resolution focuses on identifying the coreferences in biomedical texts, which normally consists of two parts: (i) mention detection to identify textual representation of biological entities and (ii) finding their coreference links. Recently, a popular approach to enhance the task is to embed knowledge base into deep neural networks. However, the way in which these methods integrate knowledge leads to the shortcoming that such knowledge may play a larger role in mention detection than coreference resolution. Specifically, they tend to integrate knowledge prior to mention detection, as part of the embeddings. Besides, they primarily focus on mention-dependent knowledge (KBase), i.e., knowledge entities directly related to mentions, while ignores the correlated knowledge (K+) between mentions in the mention-pair. For mentions with significant differences in word form, this may limit their ability to extract potential correlations between those mentions. Thus, this paper develops a novel model to integrate both KBase and K+ entities and achieves the state-of-the-art performance on BioNLP and CRAFT-CR datasets. Empirical studies on mention detection with different length reveals the effectiveness of the KBase entities. The evaluation on cross-sentence and match/mismatch coreference further demonstrate the superiority of the K+ entities in extracting background potential correlation between mentions.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142017375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-19DOI: 10.1109/TCBB.2024.3433378
Jeremie S Kim, Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Nastaran Hajinazar, Mohammed Alser, Can Alkan, Onur Mutlu
AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4×. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.
{"title":"AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes.","authors":"Jeremie S Kim, Can Firtina, Meryem Banu Cavlak, Damla Senol Cali, Nastaran Hajinazar, Mohammed Alser, Can Alkan, Onur Mutlu","doi":"10.1109/TCBB.2024.3433378","DOIUrl":"https://doi.org/10.1109/TCBB.2024.3433378","url":null,"abstract":"<p><p>AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4×. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants.</p>","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"PP ","pages":""},"PeriodicalIF":3.6,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142004162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}