Drug-Protein Interactions Prediction Models Using Feature Selection and Classification Techniques

IF 1.8 4区医学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY Current drug metabolism Pub Date : 2024-01-05 DOI:10.2174/0113892002268739231211063718

T. Idhaya, A. Suruliandi, S. P. Raja

{"title":"Drug-Protein Interactions Prediction Models Using Feature Selection and Classification Techniques","authors":"T. Idhaya, A. Suruliandi, S. P. Raja","doi":"10.2174/0113892002268739231211063718","DOIUrl":null,"url":null,"abstract":"Background: Drug-Protein Interaction (DPI) identification is crucial in drug discovery. The high dimensionality of drug and protein features poses challenges for accurate interaction prediction, necessitating the use of computational techniques. Docking-based methods rely on 3D structures, while ligand-based methods have limitations such as reliance on known ligands and neglecting protein structure. Therefore, the preferred approach is the chemogenomics-based approach using machine learning, which considers both drug and protein characteristics for DPI prediction. Methods: In machine learning, feature selection plays a vital role in improving model performance, reducing overfitting, enhancing interpretability, and making the learning process more efficient. It helps extract meaningful patterns from drug and protein data while eliminating irrelevant or redundant information, resulting in more effective machine-learning models. On the other hand, classification is of great importance as it enables pattern recognition, decision-making, predictive modeling, anomaly detection, data exploration, and automation. It empowers machines to make accurate predictions and facilitates efficient decision-making in DPI prediction. For this research work, protein data was sourced from the KEGG database, while drug data was obtained from the DrugBank data machine-learning base. Results: To address the issue of imbalanced Drug Protein Pairs (DPP), different balancing techniques like Random Over Sampling (ROS), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive SMOTE were employed. Given the large number of features associated with drugs and proteins, feature selection becomes necessary. Various feature selection methods were evaluated: Correlation, Information Gain (IG), Chi-Square (CS), and Relief. Multiple classification methods, including Support Vector Machines (SVM), Random Forest (RF), Adaboost, and Logistic Regression (LR), were used to predict DPI. Finally, this research identifies the best balancing, feature selection, and classification methods for accurate DPI prediction. Conclusion: This comprehensive approach aims to overcome the limitations of existing methods and provide more reliable and efficient predictions in drug-protein interaction studies.","PeriodicalId":10770,"journal":{"name":"Current drug metabolism","volume":"79 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current drug metabolism","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2174/0113892002268739231211063718","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Drug-Protein Interaction (DPI) identification is crucial in drug discovery. The high dimensionality of drug and protein features poses challenges for accurate interaction prediction, necessitating the use of computational techniques. Docking-based methods rely on 3D structures, while ligand-based methods have limitations such as reliance on known ligands and neglecting protein structure. Therefore, the preferred approach is the chemogenomics-based approach using machine learning, which considers both drug and protein characteristics for DPI prediction. Methods: In machine learning, feature selection plays a vital role in improving model performance, reducing overfitting, enhancing interpretability, and making the learning process more efficient. It helps extract meaningful patterns from drug and protein data while eliminating irrelevant or redundant information, resulting in more effective machine-learning models. On the other hand, classification is of great importance as it enables pattern recognition, decision-making, predictive modeling, anomaly detection, data exploration, and automation. It empowers machines to make accurate predictions and facilitates efficient decision-making in DPI prediction. For this research work, protein data was sourced from the KEGG database, while drug data was obtained from the DrugBank data machine-learning base. Results: To address the issue of imbalanced Drug Protein Pairs (DPP), different balancing techniques like Random Over Sampling (ROS), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive SMOTE were employed. Given the large number of features associated with drugs and proteins, feature selection becomes necessary. Various feature selection methods were evaluated: Correlation, Information Gain (IG), Chi-Square (CS), and Relief. Multiple classification methods, including Support Vector Machines (SVM), Random Forest (RF), Adaboost, and Logistic Regression (LR), were used to predict DPI. Finally, this research identifies the best balancing, feature selection, and classification methods for accurate DPI prediction. Conclusion: This comprehensive approach aims to overcome the limitations of existing methods and provide more reliable and efficient predictions in drug-protein interaction studies.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用特征选择和分类技术的药物-蛋白质相互作用预测模型

背景：药物-蛋白质相互作用（DPI）的鉴定在药物发现中至关重要。药物和蛋白质特征的高维度给准确预测相互作用带来了挑战，因此有必要使用计算技术。基于 Docking 的方法依赖于三维结构，而基于配体的方法有其局限性，如依赖于已知配体和忽略蛋白质结构。因此，首选的方法是基于化学基因组学的机器学习方法，这种方法在预测 DPI 时同时考虑了药物和蛋白质的特征。方法：在机器学习中，特征选择在提高模型性能、减少过拟合、增强可解释性以及提高学习过程效率方面起着至关重要的作用。它有助于从药物和蛋白质数据中提取有意义的模式，同时消除无关或冗余信息，从而建立更有效的机器学习模型。另一方面，分类也非常重要，因为它可以实现模式识别、决策、预测建模、异常检测、数据探索和自动化。它使机器能够做出准确的预测，并促进 DPI 预测中的高效决策。在这项研究工作中，蛋白质数据来自 KEGG 数据库，而药物数据则来自 DrugBank 数据机器学习库。研究结果为了解决药物蛋白质对（DPP）不平衡的问题，我们采用了不同的平衡技术，如随机过度采样（ROS）、合成少数过度采样技术（SMOTE）和自适应 SMOTE。鉴于与药物和蛋白质相关的特征数量庞大，特征选择变得十分必要。对各种特征选择方法进行了评估：相关性、信息增益 (IG)、Chi-Square (CS) 和救济。多种分类方法，包括支持向量机 (SVM)、随机森林 (RF)、Adaboost 和逻辑回归 (LR) 被用于预测 DPI。最后，本研究确定了准确预测 DPI 的最佳平衡、特征选择和分类方法。结论这种综合方法旨在克服现有方法的局限性，为药物蛋白相互作用研究提供更可靠、更高效的预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Current drug metabolism 医学-生化与分子生物学

CiteScore

4.30

自引率

4.30%

发文量

审稿时长

4-8 weeks

期刊介绍： Current Drug Metabolism aims to cover all the latest and outstanding developments in drug metabolism, pharmacokinetics, and drug disposition. The journal serves as an international forum for the publication of full-length/mini review, research articles and guest edited issues in drug metabolism. Current Drug Metabolism is an essential journal for academic, clinical, government and pharmaceutical scientists who wish to be kept informed and up-to-date with the most important developments. The journal covers the following general topic areas: pharmaceutics, pharmacokinetics, toxicology, and most importantly drug metabolism. More specifically, in vitro and in vivo drug metabolism of phase I and phase II enzymes or metabolic pathways; drug-drug interactions and enzyme kinetics; pharmacokinetics, pharmacokinetic-pharmacodynamic modeling, and toxicokinetics; interspecies differences in metabolism or pharmacokinetics, species scaling and extrapolations; drug transporters; target organ toxicity and interindividual variability in drug exposure-response; extrahepatic metabolism; bioactivation, reactive metabolites, and developments for the identification of drug metabolites. Preclinical and clinical reviews describing the drug metabolism and pharmacokinetics of marketed drugs or drug classes.