The development of therapeutic antibodies is an important aspect of new drug discovery pipelines. The assessment of an antibody's developability-its suitability for large-scale production and therapeutic use-is a particularly important step in this process. Given that experimental assays to assess antibody developability in large scale are expensive and time-consuming, computational methods have been a more efficient alternative. However, the antibody research community faces significant challenges due to the scarcity of readily accessible data on antibody developability, which is essential for training and validating computational models. To address this gap, DOTAD (Database Of Therapeutic Antibody Developability) has been built as the first database dedicated exclusively to the curation of therapeutic antibody developability information. DOTAD aggregates all available therapeutic antibody sequence data along with various developability metrics from the scientific literature, offering researchers a robust platform for data storage, retrieval, exploration, and downloading. In addition to serving as a comprehensive repository, DOTAD enhances its utility by integrating a web-based interface that features state-of-the-art tools for the assessment of antibody developability. This ensures that users not only have access to critical data but also have the convenience of analyzing and interpreting this information. The DOTAD database represents a valuable resource for the scientific community, facilitating the advancement of therapeutic antibody research. It is freely accessible at http://i.uestc.edu.cn/DOTAD/ , providing an open data platform that supports the continuous growth and evolution of computational methods in the field of antibody development.
{"title":"DOTAD: A Database of Therapeutic Antibody Developability.","authors":"Wenzhen Li, Hongyan Lin, Ziru Huang, Shiyang Xie, Yuwei Zhou, Rong Gong, Qianhu Jiang, ChangCheng Xiang, Jian Huang","doi":"10.1007/s12539-024-00613-2","DOIUrl":"10.1007/s12539-024-00613-2","url":null,"abstract":"<p><p>The development of therapeutic antibodies is an important aspect of new drug discovery pipelines. The assessment of an antibody's developability-its suitability for large-scale production and therapeutic use-is a particularly important step in this process. Given that experimental assays to assess antibody developability in large scale are expensive and time-consuming, computational methods have been a more efficient alternative. However, the antibody research community faces significant challenges due to the scarcity of readily accessible data on antibody developability, which is essential for training and validating computational models. To address this gap, DOTAD (Database Of Therapeutic Antibody Developability) has been built as the first database dedicated exclusively to the curation of therapeutic antibody developability information. DOTAD aggregates all available therapeutic antibody sequence data along with various developability metrics from the scientific literature, offering researchers a robust platform for data storage, retrieval, exploration, and downloading. In addition to serving as a comprehensive repository, DOTAD enhances its utility by integrating a web-based interface that features state-of-the-art tools for the assessment of antibody developability. This ensures that users not only have access to critical data but also have the convenience of analyzing and interpreting this information. The DOTAD database represents a valuable resource for the scientific community, facilitating the advancement of therapeutic antibody research. It is freely accessible at http://i.uestc.edu.cn/DOTAD/ , providing an open data platform that supports the continuous growth and evolution of computational methods in the field of antibody development.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"623-634"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140293414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-03-27DOI: 10.1007/s12539-024-00622-1
Jiaqi Zhai, Wenda Wang, Ranxi Zhao, Daiwen Sun, Da Lu, Xinqi Gong
Protein complex structure prediction is an important problem in computational biology. While significant progress has been made for protein monomers, accurate evaluation of protein complexes remains challenging. Existing assessment methods in CASP, lack dedicated metrics for evaluating complexes. DockQ, a widely used metric, has some limitations. In this study, we propose a novel metric called BDM (Based on Distance difference Matrix) for assessing protein complex prediction structures. Our approach utilizes a distance difference matrix derived from comparing real and predicted protein structures, establishing a linear correlation with Root Mean Square Deviation (RMSD). BDM overcomes limitations associated with receptor-ligand differentiation and eliminates the requirement for structure alignment, making it a more effective and efficient metric. Evaluation of BDM using CASP14 and CASP15 test sets demonstrates superior performance compared to the official CASP scoring. BDM provides accurate and reasonable assessments of predicted protein complexes, wide adoption of BDM has the potential to advance protein complex structure prediction and facilitate related researches across scientific domains. Code is available at http://mialab.ruc.edu.cn/BDMServer/ .
{"title":"BDM: An Assessment Metric for Protein Complex Structure Models Based on Distance Difference Matrix.","authors":"Jiaqi Zhai, Wenda Wang, Ranxi Zhao, Daiwen Sun, Da Lu, Xinqi Gong","doi":"10.1007/s12539-024-00622-1","DOIUrl":"10.1007/s12539-024-00622-1","url":null,"abstract":"<p><p>Protein complex structure prediction is an important problem in computational biology. While significant progress has been made for protein monomers, accurate evaluation of protein complexes remains challenging. Existing assessment methods in CASP, lack dedicated metrics for evaluating complexes. DockQ, a widely used metric, has some limitations. In this study, we propose a novel metric called BDM (Based on Distance difference Matrix) for assessing protein complex prediction structures. Our approach utilizes a distance difference matrix derived from comparing real and predicted protein structures, establishing a linear correlation with Root Mean Square Deviation (RMSD). BDM overcomes limitations associated with receptor-ligand differentiation and eliminates the requirement for structure alignment, making it a more effective and efficient metric. Evaluation of BDM using CASP14 and CASP15 test sets demonstrates superior performance compared to the official CASP scoring. BDM provides accurate and reasonable assessments of predicted protein complexes, wide adoption of BDM has the potential to advance protein complex structure prediction and facilitate related researches across scientific domains. Code is available at http://mialab.ruc.edu.cn/BDMServer/ .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"677-687"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140305533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-03-08DOI: 10.1007/s12539-024-00615-0
Qian Deng, Jing Zhang, Jie Liu, Yuqi Liu, Zong Dai, Xiaoyong Zou, Zhanchao Li
As one of the most important post-translational modifications (PTMs), protein phosphorylation plays a key role in a variety of biological processes. Many studies have shown that protein phosphorylation is associated with various human diseases. Therefore, identifying protein phosphorylation site-disease associations can help to elucidate the pathogenesis of disease and discover new drug targets. Networks of sequence similarity and Gaussian interaction profile kernel similarity were constructed for phosphorylation sites, as well as networks of disease semantic similarity, disease symptom similarity and Gaussian interaction profile kernel similarity were constructed for diseases. To effectively combine different phosphorylation sites and disease similarity information, random walk with restart algorithm was used to obtain the topology information of the network. Then, the diffusion component analysis method was utilized to obtain the comprehensive phosphorylation site similarity and disease similarity. Meanwhile, the reliable negative samples were screened based on the Euclidean distance method. Finally, a convolutional neural network (CNN) model was constructed to identify potential associations between phosphorylation sites and diseases. Based on tenfold cross-validation, the evaluation indicators were obtained including accuracy of 93.48%, specificity of 96.82%, sensitivity of 90.15%, precision of 96.62%, Matthew's correlation coefficient of 0.8719, area under the receiver operating characteristic curve of 0.9786 and area under the precision-recall curve of 0.9836. Additionally, most of the top 20 predicted disease-related phosphorylation sites (19/20 for Alzheimer's disease; 20/16 for neuroblastoma) were verified by literatures and databases. These results show that the proposed method has an outstanding prediction performance and a high practical value.
{"title":"Identifying Protein Phosphorylation Site-Disease Associations Based on Multi-Similarity Fusion and Negative Sample Selection by Convolutional Neural Network.","authors":"Qian Deng, Jing Zhang, Jie Liu, Yuqi Liu, Zong Dai, Xiaoyong Zou, Zhanchao Li","doi":"10.1007/s12539-024-00615-0","DOIUrl":"10.1007/s12539-024-00615-0","url":null,"abstract":"<p><p>As one of the most important post-translational modifications (PTMs), protein phosphorylation plays a key role in a variety of biological processes. Many studies have shown that protein phosphorylation is associated with various human diseases. Therefore, identifying protein phosphorylation site-disease associations can help to elucidate the pathogenesis of disease and discover new drug targets. Networks of sequence similarity and Gaussian interaction profile kernel similarity were constructed for phosphorylation sites, as well as networks of disease semantic similarity, disease symptom similarity and Gaussian interaction profile kernel similarity were constructed for diseases. To effectively combine different phosphorylation sites and disease similarity information, random walk with restart algorithm was used to obtain the topology information of the network. Then, the diffusion component analysis method was utilized to obtain the comprehensive phosphorylation site similarity and disease similarity. Meanwhile, the reliable negative samples were screened based on the Euclidean distance method. Finally, a convolutional neural network (CNN) model was constructed to identify potential associations between phosphorylation sites and diseases. Based on tenfold cross-validation, the evaluation indicators were obtained including accuracy of 93.48%, specificity of 96.82%, sensitivity of 90.15%, precision of 96.62%, Matthew's correlation coefficient of 0.8719, area under the receiver operating characteristic curve of 0.9786 and area under the precision-recall curve of 0.9836. Additionally, most of the top 20 predicted disease-related phosphorylation sites (19/20 for Alzheimer's disease; 20/16 for neuroblastoma) were verified by literatures and databases. These results show that the proposed method has an outstanding prediction performance and a high practical value.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"649-664"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140059304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mass spectrometry is crucial in proteomics analysis, particularly using Data Independent Acquisition (DIA) for reliable and reproducible mass spectrometry data acquisition, enabling broad mass-to-charge ratio coverage and high throughput. DIA-NN, a prominent deep learning software in DIA proteome analysis, generates peptide results but may include low-confidence peptides. Conventionally, biologists have to manually screen peptide fragment ion chromatogram peaks (XIC) for identifying high-confidence peptides, a time-consuming and subjective process prone to variability. In this study, we introduce SeFilter-DIA, a deep learning algorithm, aiming at automating the identification of high-confidence peptides. Leveraging compressed excitation neural network and residual network models, SeFilter-DIA extracts XIC features and effectively discerns between high and low-confidence peptides. Evaluation of the benchmark datasets demonstrates SeFilter-DIA achieving 99.6% AUC on the test set and 97% for other performance indicators. Furthermore, SeFilter-DIA is applicable for screening peptides with phosphorylation modifications. These results demonstrate the potential of SeFilter-DIA to replace manual screening, providing an efficient and objective approach for high-confidence peptide identification while mitigating associated limitations.
{"title":"SeFilter-DIA: Squeeze-and-Excitation Network for Filtering High-Confidence Peptides of Data-Independent Acquisition Proteomics.","authors":"Qingzu He, Huan Guo, Yulin Li, Guoqiang He, Xiang Li, Jianwei Shuai","doi":"10.1007/s12539-024-00611-4","DOIUrl":"10.1007/s12539-024-00611-4","url":null,"abstract":"<p><p>Mass spectrometry is crucial in proteomics analysis, particularly using Data Independent Acquisition (DIA) for reliable and reproducible mass spectrometry data acquisition, enabling broad mass-to-charge ratio coverage and high throughput. DIA-NN, a prominent deep learning software in DIA proteome analysis, generates peptide results but may include low-confidence peptides. Conventionally, biologists have to manually screen peptide fragment ion chromatogram peaks (XIC) for identifying high-confidence peptides, a time-consuming and subjective process prone to variability. In this study, we introduce SeFilter-DIA, a deep learning algorithm, aiming at automating the identification of high-confidence peptides. Leveraging compressed excitation neural network and residual network models, SeFilter-DIA extracts XIC features and effectively discerns between high and low-confidence peptides. Evaluation of the benchmark datasets demonstrates SeFilter-DIA achieving 99.6% AUC on the test set and 97% for other performance indicators. Furthermore, SeFilter-DIA is applicable for screening peptides with phosphorylation modifications. These results demonstrate the potential of SeFilter-DIA to replace manual screening, providing an efficient and objective approach for high-confidence peptide identification while mitigating associated limitations.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"579-592"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140109963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recognizing drug-target interactions (DTI) stands as a pivotal element in the expansive field of drug discovery. Traditional biological wet experiments, although valuable, are time-consuming and costly as methods. Recently, computational methods grounded in network learning have demonstrated great advantages by effective topological feature extraction and attracted extensive research attention. However, most existing network-based learning methods only consider the low-order binary correlation between individual drug and target, neglecting the potential higher-order correlation information derived from multiple drugs and targets. High-order information, as an essential component, exhibits complementarity with low-order information. Hence, the incorporation of higher-order associations between drugs and targets, while adequately integrating them with the existing lower-order information, could potentially yield substantial breakthroughs in predicting drug-target interactions. We propose a novel dual channels network-based learning model CHL-DTI that converges high-order information from hypergraphs and low-order information from ordinary graph for drug-target interaction prediction. The convergence of high-low order information in CHL-DTI is manifested in two key aspects. First, during the feature extraction stage, the model integrates both high-level semantic information and low-level topological information by combining hypergraphs and ordinary graph. Second, CHL-DTI fully fuse the innovative introduced drug-protein pairs (DPP) hypergraph network structure with ordinary topological network structure information. Extensive experimentation conducted on three public datasets showcases the superior performance of CHL-DTI in DTI prediction tasks when compared to SOTA methods. The source code of CHL-DTI is available at https://github.com/UPCLyy/CHL-DTI .
{"title":"CHL-DTI: A Novel High-Low Order Information Convergence Framework for Effective Drug-Target Interaction Prediction.","authors":"Shudong Wang, Yingye Liu, Yuanyuan Zhang, Kuijie Zhang, Xuanmo Song, Yu Zhang, Shanchen Pang","doi":"10.1007/s12539-024-00608-z","DOIUrl":"10.1007/s12539-024-00608-z","url":null,"abstract":"<p><p>Recognizing drug-target interactions (DTI) stands as a pivotal element in the expansive field of drug discovery. Traditional biological wet experiments, although valuable, are time-consuming and costly as methods. Recently, computational methods grounded in network learning have demonstrated great advantages by effective topological feature extraction and attracted extensive research attention. However, most existing network-based learning methods only consider the low-order binary correlation between individual drug and target, neglecting the potential higher-order correlation information derived from multiple drugs and targets. High-order information, as an essential component, exhibits complementarity with low-order information. Hence, the incorporation of higher-order associations between drugs and targets, while adequately integrating them with the existing lower-order information, could potentially yield substantial breakthroughs in predicting drug-target interactions. We propose a novel dual channels network-based learning model CHL-DTI that converges high-order information from hypergraphs and low-order information from ordinary graph for drug-target interaction prediction. The convergence of high-low order information in CHL-DTI is manifested in two key aspects. First, during the feature extraction stage, the model integrates both high-level semantic information and low-level topological information by combining hypergraphs and ordinary graph. Second, CHL-DTI fully fuse the innovative introduced drug-protein pairs (DPP) hypergraph network structure with ordinary topological network structure information. Extensive experimentation conducted on three public datasets showcases the superior performance of CHL-DTI in DTI prediction tasks when compared to SOTA methods. The source code of CHL-DTI is available at https://github.com/UPCLyy/CHL-DTI .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"568-578"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140131318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-03-01DOI: 10.1007/s12539-024-00606-1
Jin Deng, Kaijun Li, Wei Luo
Sarcomas are malignant tumors from mesenchymal tissue and are characterized by their complexity and diversity. The high recurrence rate making it important to understand the mechanisms behind their recurrence and to develop personalized treatments and drugs. However, previous studies on the association patterns of multi-modal data on sarcoma recurrence have overlooked the fact that genes do not act independently, but rather function within signaling pathways. Therefore, this study collected 290 whole solid images, 869 gene and 1387 pathway data of over 260 sarcoma samples from UCSC and TCGA to identify the association patterns of gene-pathway-cell related to sarcoma recurrences. Meanwhile, considering that most multi-modal data fusion methods based on the joint non-negative matrix factorization (NMF) model led to poor experimental repeatability due to random initialization of factorization parameters, the study proposed the singular value decomposition (SVD)-driven joint NMF model by applying the SVD method to calculate initialized weight and coefficient matrices to achieve the reproducibility of the results. The results of the experimental comparison indicated that the SVD algorithm enhances the performance of the joint NMF algorithm. Furthermore, the representative module indicated a significant relationship between genes in pathways and image features. Multi-level analysis provided valuable insights into the connections between biological processes, cellular features, and sarcoma recurrence. In addition, potential biomarkers were uncovered, while various mechanisms of sarcoma recurrence were identified from an imaging genetic perspective. Overall, the SVD-NMF model affords a novel perspective on combining multi-omics data to explore the association related to sarcoma recurrence.
{"title":"Singular Value Decomposition-Driven Non-negative Matrix Factorization with Application to Identify the Association Patterns of Sarcoma Recurrence.","authors":"Jin Deng, Kaijun Li, Wei Luo","doi":"10.1007/s12539-024-00606-1","DOIUrl":"10.1007/s12539-024-00606-1","url":null,"abstract":"<p><p>Sarcomas are malignant tumors from mesenchymal tissue and are characterized by their complexity and diversity. The high recurrence rate making it important to understand the mechanisms behind their recurrence and to develop personalized treatments and drugs. However, previous studies on the association patterns of multi-modal data on sarcoma recurrence have overlooked the fact that genes do not act independently, but rather function within signaling pathways. Therefore, this study collected 290 whole solid images, 869 gene and 1387 pathway data of over 260 sarcoma samples from UCSC and TCGA to identify the association patterns of gene-pathway-cell related to sarcoma recurrences. Meanwhile, considering that most multi-modal data fusion methods based on the joint non-negative matrix factorization (NMF) model led to poor experimental repeatability due to random initialization of factorization parameters, the study proposed the singular value decomposition (SVD)-driven joint NMF model by applying the SVD method to calculate initialized weight and coefficient matrices to achieve the reproducibility of the results. The results of the experimental comparison indicated that the SVD algorithm enhances the performance of the joint NMF algorithm. Furthermore, the representative module indicated a significant relationship between genes in pathways and image features. Multi-level analysis provided valuable insights into the connections between biological processes, cellular features, and sarcoma recurrence. In addition, potential biomarkers were uncovered, while various mechanisms of sarcoma recurrence were identified from an imaging genetic perspective. Overall, the SVD-NMF model affords a novel perspective on combining multi-omics data to explore the association related to sarcoma recurrence.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"554-567"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139996217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular representation learning can preserve meaningful molecular structures as embedding vectors, which is a necessary prerequisite for molecular property prediction. Yet, learning how to accurately represent molecules remains challenging. Previous approaches to learning molecular representations in an end-to-end manner potentially suffered information loss while neglecting the utilization of molecular generative representations. To obtain rich molecular feature information, the pre-training molecular representation model utilized different molecular representations to reduce information loss caused by a single molecular representation. Therefore, we provide the MVGC, a unique multi-view generative contrastive learning pre-training model. Our pre-training framework specifically acquires knowledge of three fundamental feature representations of molecules and effectively integrates them to predict molecular properties on benchmark datasets. Comprehensive experiments on seven classification tasks and three regression tasks demonstrate that our proposed MVGC model surpasses the majority of state-of-the-art approaches. Moreover, we explore the potential of the MVGC model to learn the representation of molecules with chemical significance.
{"title":"A Multi-view Molecular Pre-training with Generative Contrastive Learning.","authors":"Yunwu Liu, Ruisheng Zhang, Yongna Yuan, Jun Ma, Tongfeng Li, Zhixuan Yu","doi":"10.1007/s12539-024-00632-z","DOIUrl":"10.1007/s12539-024-00632-z","url":null,"abstract":"<p><p>Molecular representation learning can preserve meaningful molecular structures as embedding vectors, which is a necessary prerequisite for molecular property prediction. Yet, learning how to accurately represent molecules remains challenging. Previous approaches to learning molecular representations in an end-to-end manner potentially suffered information loss while neglecting the utilization of molecular generative representations. To obtain rich molecular feature information, the pre-training molecular representation model utilized different molecular representations to reduce information loss caused by a single molecular representation. Therefore, we provide the MVGC, a unique multi-view generative contrastive learning pre-training model. Our pre-training framework specifically acquires knowledge of three fundamental feature representations of molecules and effectively integrates them to predict molecular properties on benchmark datasets. Comprehensive experiments on seven classification tasks and three regression tasks demonstrate that our proposed MVGC model surpasses the majority of state-of-the-art approaches. Moreover, we explore the potential of the MVGC model to learn the representation of molecules with chemical significance.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"741-754"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140865514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01Epub Date: 2024-07-02DOI: 10.1007/s12539-024-00621-2
Shouheng Tuo, Jiewei Jiang
To elucidate the genetic basis of complex diseases, it is crucial to discover the single-nucleotide polymorphisms (SNPs) contributing to disease susceptibility. This is particularly challenging for high-order SNP epistatic interactions (HEIs), which exhibit small individual effects but potentially large joint effects. These interactions are difficult to detect due to the vast search space, encompassing billions of possible combinations, and the computational complexity of evaluating them. This study proposes a novel explicit-encoding-based multitasking harmony search algorithm (MTHS-EE-DHEI) specifically designed to address this challenge. The algorithm operates in three stages. First, a harmony search algorithm is employed, utilizing four lightweight evaluation functions, such as Bayesian network and entropy, to efficiently explore potential SNP combinations related to disease status. Second, a G-test statistical method is applied to filter out insignificant SNP combinations. Finally, two machine learning-based methods, multifactor dimensionality reduction (MDR) as well as random forest (RF), are employed to validate the classification performance of the remaining significant SNP combinations. This research aims to demonstrate the effectiveness of MTHS-EE-DHEI in identifying HEIs compared to existing methods, potentially providing valuable insights into the genetic architecture of complex diseases. The performance of MTHS-EE-DHEI was evaluated on twenty simulated disease datasets and three real-world datasets encompassing age-related macular degeneration (AMD), rheumatoid arthritis (RA), and breast cancer (BC). The results demonstrably indicate that MTHS-EE-DHEI outperforms four state-of-the-art algorithms in terms of both detection power and computational efficiency. The source code is available at https://github.com/shouhengtuo/MTHS-EE-DHEI.git .
要阐明复杂疾病的遗传基础,发现导致疾病易感性的单核苷酸多态性(SNPs)至关重要。这对于高阶 SNP 表观交互作用(HEIs)来说尤其具有挑战性,因为这种交互作用表现出较小的个体效应,但可能产生较大的联合效应。由于搜索空间巨大,包含数十亿种可能的组合,而且评估这些组合的计算复杂,因此很难检测到这些相互作用。本研究提出了一种基于显式编码的新型多任务和谐搜索算法(MTHS-EE-DHEI),专门用于解决这一难题。该算法分三个阶段运行。首先,采用和谐搜索算法,利用贝叶斯网络和熵等四种轻量级评估函数,有效探索与疾病状态相关的潜在 SNP 组合。其次,采用 G 检验统计方法过滤掉不重要的 SNP 组合。最后,采用多因素降维(MDR)和随机森林(RF)这两种基于机器学习的方法来验证剩余重要 SNP 组合的分类性能。这项研究旨在证明,与现有方法相比,MTHS-EE-DHEI 在识别 HEI 方面非常有效,有可能为复杂疾病的遗传结构提供有价值的见解。MTHS-EE-DHEI 的性能在二十个模拟疾病数据集和三个真实世界数据集上进行了评估,包括老年性黄斑变性(AMD)、类风湿性关节炎(RA)和乳腺癌(BC)。结果表明,MTHS-EE-DHEI 在检测能力和计算效率方面都优于四种最先进的算法。源代码见 https://github.com/shouhengtuo/MTHS-EE-DHEI.git 。
{"title":"A Novel Detection Method for High-Order SNP Epistatic Interactions Based on Explicit-Encoding-Based Multitasking Harmony Search.","authors":"Shouheng Tuo, Jiewei Jiang","doi":"10.1007/s12539-024-00621-2","DOIUrl":"10.1007/s12539-024-00621-2","url":null,"abstract":"<p><p>To elucidate the genetic basis of complex diseases, it is crucial to discover the single-nucleotide polymorphisms (SNPs) contributing to disease susceptibility. This is particularly challenging for high-order SNP epistatic interactions (HEIs), which exhibit small individual effects but potentially large joint effects. These interactions are difficult to detect due to the vast search space, encompassing billions of possible combinations, and the computational complexity of evaluating them. This study proposes a novel explicit-encoding-based multitasking harmony search algorithm (MTHS-EE-DHEI) specifically designed to address this challenge. The algorithm operates in three stages. First, a harmony search algorithm is employed, utilizing four lightweight evaluation functions, such as Bayesian network and entropy, to efficiently explore potential SNP combinations related to disease status. Second, a G-test statistical method is applied to filter out insignificant SNP combinations. Finally, two machine learning-based methods, multifactor dimensionality reduction (MDR) as well as random forest (RF), are employed to validate the classification performance of the remaining significant SNP combinations. This research aims to demonstrate the effectiveness of MTHS-EE-DHEI in identifying HEIs compared to existing methods, potentially providing valuable insights into the genetic architecture of complex diseases. The performance of MTHS-EE-DHEI was evaluated on twenty simulated disease datasets and three real-world datasets encompassing age-related macular degeneration (AMD), rheumatoid arthritis (RA), and breast cancer (BC). The results demonstrably indicate that MTHS-EE-DHEI outperforms four state-of-the-art algorithms in terms of both detection power and computational efficiency. The source code is available at https://github.com/shouhengtuo/MTHS-EE-DHEI.git .</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"688-711"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141491788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-03-08DOI: 10.1007/s12539-024-00609-y
Jun Ma, Zhili Zhao, Tongfeng Li, Yunwu Liu, Jun Ma, Ruisheng Zhang
Accurately predicting compound-protein interactions (CPI) is a critical task in computer-aided drug design. In recent years, the exponential growth of compound activity and biomedical data has highlighted the need for efficient and interpretable prediction approaches. In this study, we propose GraphsformerCPI, an end-to-end deep learning framework that improves prediction performance and interpretability. GraphsformerCPI treats compounds and proteins as sequences of nodes with spatial structures, and leverages novel structure-enhanced self-attention mechanisms to integrate semantic and graph structural features within molecules for deep molecule representations. To capture the vital association between compound atoms and protein residues, we devise a dual-attention mechanism to effectively extract relational features through .cross-mapping. By extending the powerful learning capabilities of Transformers to spatial structures and extensively utilizing attention mechanisms, our model offers strong interpretability, a significant advantage over most black-box deep learning methods. To evaluate GraphsformerCPI, extensive experiments were conducted on benchmark datasets including human, C. elegans, Davis and KIBA datasets. We explored the impact of model depth and dropout rate on performance and compared our model against state-of-the-art baseline models. Our results demonstrate that GraphsformerCPI outperforms baseline models in classification datasets and achieves competitive performance in regression datasets. Specifically, on the human dataset, GraphsformerCPI achieves an average improvement of 1.6% in AUC, 0.5% in precision, and 5.3% in recall. On the KIBA dataset, the average improvement in Concordance index (CI) and mean squared error (MSE) is 3.3% and 7.2%, respectively. Molecular docking shows that our model provides novel insights into the intrinsic interactions and binding mechanisms. Our research holds practical significance in effectively predicting CPIs and binding affinities, identifying key atoms and residues, enhancing model interpretability.
{"title":"GraphsformerCPI: Graph Transformer for Compound-Protein Interaction Prediction.","authors":"Jun Ma, Zhili Zhao, Tongfeng Li, Yunwu Liu, Jun Ma, Ruisheng Zhang","doi":"10.1007/s12539-024-00609-y","DOIUrl":"10.1007/s12539-024-00609-y","url":null,"abstract":"<p><p>Accurately predicting compound-protein interactions (CPI) is a critical task in computer-aided drug design. In recent years, the exponential growth of compound activity and biomedical data has highlighted the need for efficient and interpretable prediction approaches. In this study, we propose GraphsformerCPI, an end-to-end deep learning framework that improves prediction performance and interpretability. GraphsformerCPI treats compounds and proteins as sequences of nodes with spatial structures, and leverages novel structure-enhanced self-attention mechanisms to integrate semantic and graph structural features within molecules for deep molecule representations. To capture the vital association between compound atoms and protein residues, we devise a dual-attention mechanism to effectively extract relational features through .cross-mapping. By extending the powerful learning capabilities of Transformers to spatial structures and extensively utilizing attention mechanisms, our model offers strong interpretability, a significant advantage over most black-box deep learning methods. To evaluate GraphsformerCPI, extensive experiments were conducted on benchmark datasets including human, C. elegans, Davis and KIBA datasets. We explored the impact of model depth and dropout rate on performance and compared our model against state-of-the-art baseline models. Our results demonstrate that GraphsformerCPI outperforms baseline models in classification datasets and achieves competitive performance in regression datasets. Specifically, on the human dataset, GraphsformerCPI achieves an average improvement of 1.6% in AUC, 0.5% in precision, and 5.3% in recall. On the KIBA dataset, the average improvement in Concordance index (CI) and mean squared error (MSE) is 3.3% and 7.2%, respectively. Molecular docking shows that our model provides novel insights into the intrinsic interactions and binding mechanisms. Our research holds practical significance in effectively predicting CPIs and binding affinities, identifying key atoms and residues, enhancing model interpretability.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"361-377"},"PeriodicalIF":3.9,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140059303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01Epub Date: 2024-05-11DOI: 10.1007/s12539-024-00619-w
Lihong Peng, Mengnan Ren, Liangliang Huang, Min Chen
Accumulating studies have demonstrated close relationships between long non-coding RNAs (lncRNAs) and diseases. Identification of new lncRNA-disease associations (LDAs) enables us to better understand disease mechanisms and further provides promising insights into cancer targeted therapy and anti-cancer drug design. Here, we present an LDA prediction framework called GEnDDn based on deep learning. GEnDDn mainly comprises two steps: First, features of both lncRNAs and diseases are extracted by combining similarity computation, non-negative matrix factorization, and graph attention auto-encoder, respectively. And each lncRNA-disease pair (LDP) is depicted as a vector based on concatenation operation on the extracted features. Subsequently, unknown LDPs are classified by aggregating dual-net neural architecture and deep neural network. Using six different evaluation metrics, we found that GEnDDn surpassed four competing LDA identification methods (SDLDA, LDNFSGB, IPCARF, LDASR) on the lncRNADisease and MNDR databases under fivefold cross-validation experiments on lncRNAs, diseases, LDPs, and independent lncRNAs and independent diseases, respectively. Ablation experiments further validated the powerful LDA prediction performance of GEnDDn. Furthermore, we utilized GEnDDn to find underlying lncRNAs for lung cancer and breast cancer. The results elucidated that there may be dense linkages between IFNG-AS1 and lung cancer as well as between HIF1A-AS1 and breast cancer. The results require further biomedical experimental verification. GEnDDn is publicly available at https://github.com/plhhnu/GEnDDn.
{"title":"GEnDDn: An lncRNA-Disease Association Identification Framework Based on Dual-Net Neural Architecture and Deep Neural Network.","authors":"Lihong Peng, Mengnan Ren, Liangliang Huang, Min Chen","doi":"10.1007/s12539-024-00619-w","DOIUrl":"10.1007/s12539-024-00619-w","url":null,"abstract":"<p><p>Accumulating studies have demonstrated close relationships between long non-coding RNAs (lncRNAs) and diseases. Identification of new lncRNA-disease associations (LDAs) enables us to better understand disease mechanisms and further provides promising insights into cancer targeted therapy and anti-cancer drug design. Here, we present an LDA prediction framework called GEnDDn based on deep learning. GEnDDn mainly comprises two steps: First, features of both lncRNAs and diseases are extracted by combining similarity computation, non-negative matrix factorization, and graph attention auto-encoder, respectively. And each lncRNA-disease pair (LDP) is depicted as a vector based on concatenation operation on the extracted features. Subsequently, unknown LDPs are classified by aggregating dual-net neural architecture and deep neural network. Using six different evaluation metrics, we found that GEnDDn surpassed four competing LDA identification methods (SDLDA, LDNFSGB, IPCARF, LDASR) on the lncRNADisease and MNDR databases under fivefold cross-validation experiments on lncRNAs, diseases, LDPs, and independent lncRNAs and independent diseases, respectively. Ablation experiments further validated the powerful LDA prediction performance of GEnDDn. Furthermore, we utilized GEnDDn to find underlying lncRNAs for lung cancer and breast cancer. The results elucidated that there may be dense linkages between IFNG-AS1 and lung cancer as well as between HIF1A-AS1 and breast cancer. The results require further biomedical experimental verification. GEnDDn is publicly available at https://github.com/plhhnu/GEnDDn.</p>","PeriodicalId":13670,"journal":{"name":"Interdisciplinary Sciences: Computational Life Sciences","volume":" ","pages":"418-438"},"PeriodicalIF":3.9,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140907796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}