利用VolSurf主要特性，应用机器学习模型预测离子液体的细胞毒性

IF 3.1 Q2 TOXICOLOGY Computational Toxicology Pub Date : 2023-05-01 DOI:10.1016/j.comtox.2023.100266

Grace Amabel Tabaaza , Bennet Nii Tackie-Otoo , Dzulkarnain B. Zaini , Daniel Asante Otchere , Bhajan Lal

{"title":"利用VolSurf主要特性，应用机器学习模型预测离子液体的细胞毒性","authors":"Grace Amabel Tabaaza , Bennet Nii Tackie-Otoo , Dzulkarnain B. Zaini , Daniel Asante Otchere , Bhajan Lal","doi":"10.1016/j.comtox.2023.100266","DOIUrl":null,"url":null,"abstract":"<div><p>Ionic Liquids (ILs) are considered greener alternatives to traditional organic solvents due to their unique physical and chemical properties. Nevertheless, recent studies showed that ILs can induce toxic effects in ecosystem. Therefore, it is essential to determine the level of risk to the aquatic life to successfully use these ILs. Toxicity measurement of various ILs on a broad spectrum of conditions through experimental techniques is way demanding on time, resources, and is at times impractical. Various research works have been performed in Quantitative Property Relationship (QSAR/QSPR) for IL toxicity prediction expressed as EC50. In this study, five supervised machine learning models were trained and tested using nine Principal Properties (PPs) as descriptors to predict leukemia rat cell line (IPC-81) cytotoxicity. Then eight feature selection techniques were used to preprocess the data to improve the performance of the best machine learning model among the preliminary trained models. Analysis of the performance of the models on predicting the out-of-sample data set showed that the Extreme Gradient Boosting (XGBoost) supervised machine learning model is the best in predicting with the highest test score (R<sup>2</sup> = 0.79). This model was the most parsimonious (minimum AIC of 46.50), consistent (minimum RMSE of 0.45), and precise (minimum MAE of 0.32) in predicting IPC-81 cytotoxicity. The feature importance attribute of XGBoost confirmed that the structural features of ILs’ cation like cationic hydrophilicity and the side chain length have significant impact on the toxicity. Nevertheless, the anionic part of IL is also important to their toxicity and needs to be considered in toxicity prediction. Among the tested feature selection techniques, the random forest technique was the best in improving model performance (i.e., the least error matrices: AIC = 41.22, MAE = 0.31 and RMSE = 0.4259 respectively) but at longer execution time. However, the wrapper methods were the most robust in improving computational efficiency (i.e, improved the model performance at the shortest execution time). Therefore, this study improves QSPR studies on toxicity prediction of new ILs with the application of machine learning and feature selection techniques.</p></div>","PeriodicalId":37651,"journal":{"name":"Computational Toxicology","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Application of machine learning models to predict cytotoxicity of ionic liquids using VolSurf principal properties\",\"authors\":\"Grace Amabel Tabaaza , Bennet Nii Tackie-Otoo , Dzulkarnain B. Zaini , Daniel Asante Otchere , Bhajan Lal\",\"doi\":\"10.1016/j.comtox.2023.100266\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Ionic Liquids (ILs) are considered greener alternatives to traditional organic solvents due to their unique physical and chemical properties. Nevertheless, recent studies showed that ILs can induce toxic effects in ecosystem. Therefore, it is essential to determine the level of risk to the aquatic life to successfully use these ILs. Toxicity measurement of various ILs on a broad spectrum of conditions through experimental techniques is way demanding on time, resources, and is at times impractical. Various research works have been performed in Quantitative Property Relationship (QSAR/QSPR) for IL toxicity prediction expressed as EC50. In this study, five supervised machine learning models were trained and tested using nine Principal Properties (PPs) as descriptors to predict leukemia rat cell line (IPC-81) cytotoxicity. Then eight feature selection techniques were used to preprocess the data to improve the performance of the best machine learning model among the preliminary trained models. Analysis of the performance of the models on predicting the out-of-sample data set showed that the Extreme Gradient Boosting (XGBoost) supervised machine learning model is the best in predicting with the highest test score (R<sup>2</sup> = 0.79). This model was the most parsimonious (minimum AIC of 46.50), consistent (minimum RMSE of 0.45), and precise (minimum MAE of 0.32) in predicting IPC-81 cytotoxicity. The feature importance attribute of XGBoost confirmed that the structural features of ILs’ cation like cationic hydrophilicity and the side chain length have significant impact on the toxicity. Nevertheless, the anionic part of IL is also important to their toxicity and needs to be considered in toxicity prediction. Among the tested feature selection techniques, the random forest technique was the best in improving model performance (i.e., the least error matrices: AIC = 41.22, MAE = 0.31 and RMSE = 0.4259 respectively) but at longer execution time. However, the wrapper methods were the most robust in improving computational efficiency (i.e, improved the model performance at the shortest execution time). Therefore, this study improves QSPR studies on toxicity prediction of new ILs with the application of machine learning and feature selection techniques.</p></div>\",\"PeriodicalId\":37651,\"journal\":{\"name\":\"Computational Toxicology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Toxicology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2468111323000075\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"TOXICOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Toxicology","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468111323000075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"TOXICOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

离子液体由于其独特的物理和化学性质被认为是传统有机溶剂的绿色替代品。然而，最近的研究表明，白藜芦醇可以引起生态系统的毒性作用。因此，必须确定对水生生物的风险水平，才能成功地使用这些化学物质。通过实验技术在广谱条件下对各种il进行毒性测量对时间和资源的要求很高，而且有时不切实际。以EC50表示的IL毒性预测的定量性质关系(QSAR/QSPR)进行了各种研究工作。在这项研究中，使用9个主要属性(PPs)作为描述符对5个监督机器学习模型进行了训练和测试，以预测白血病大鼠细胞系(IPC-81)的细胞毒性。然后使用8种特征选择技术对数据进行预处理，以提高初步训练模型中最佳机器学习模型的性能。对模型预测样本外数据集的性能分析表明，Extreme Gradient Boosting (XGBoost)监督机器学习模型的预测效果最好，测试分数最高(R2 = 0.79)。该模型在预测IPC-81细胞毒性方面最为简洁(最小AIC为46.50)、一致(最小RMSE为0.45)和精确(最小MAE为0.32)。XGBoost的特征重要性属性证实了il阳离子的结构特征如阳离子亲水性和侧链长度对毒性有显著影响。然而，IL的阴离子部分对其毒性也很重要，需要在毒性预测中加以考虑。在所测试的特征选择技术中，随机森林技术在提高模型性能方面效果最好(即误差矩阵最小:AIC = 41.22, MAE = 0.31, RMSE = 0.4259)，但执行时间较长。然而，包装器方法在提高计算效率(即在最短的执行时间内提高模型性能)方面是最健壮的。因此，本研究通过应用机器学习和特征选择技术，改进了QSPR研究在新il毒性预测中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Application of machine learning models to predict cytotoxicity of ionic liquids using VolSurf principal properties

Ionic Liquids (ILs) are considered greener alternatives to traditional organic solvents due to their unique physical and chemical properties. Nevertheless, recent studies showed that ILs can induce toxic effects in ecosystem. Therefore, it is essential to determine the level of risk to the aquatic life to successfully use these ILs. Toxicity measurement of various ILs on a broad spectrum of conditions through experimental techniques is way demanding on time, resources, and is at times impractical. Various research works have been performed in Quantitative Property Relationship (QSAR/QSPR) for IL toxicity prediction expressed as EC50. In this study, five supervised machine learning models were trained and tested using nine Principal Properties (PPs) as descriptors to predict leukemia rat cell line (IPC-81) cytotoxicity. Then eight feature selection techniques were used to preprocess the data to improve the performance of the best machine learning model among the preliminary trained models. Analysis of the performance of the models on predicting the out-of-sample data set showed that the Extreme Gradient Boosting (XGBoost) supervised machine learning model is the best in predicting with the highest test score (R² = 0.79). This model was the most parsimonious (minimum AIC of 46.50), consistent (minimum RMSE of 0.45), and precise (minimum MAE of 0.32) in predicting IPC-81 cytotoxicity. The feature importance attribute of XGBoost confirmed that the structural features of ILs’ cation like cationic hydrophilicity and the side chain length have significant impact on the toxicity. Nevertheless, the anionic part of IL is also important to their toxicity and needs to be considered in toxicity prediction. Among the tested feature selection techniques, the random forest technique was the best in improving model performance (i.e., the least error matrices: AIC = 41.22, MAE = 0.31 and RMSE = 0.4259 respectively) but at longer execution time. However, the wrapper methods were the most robust in improving computational efficiency (i.e, improved the model performance at the shortest execution time). Therefore, this study improves QSPR studies on toxicity prediction of new ILs with the application of machine learning and feature selection techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Toxicology Computer Science-Computer Science Applications

CiteScore

5.50

自引率

0.00%

发文量

审稿时长

56 days

期刊介绍： Computational Toxicology is an international journal publishing computational approaches that assist in the toxicological evaluation of new and existing chemical substances assisting in their safety assessment. -All effects relating to human health and environmental toxicity and fate -Prediction of toxicity, metabolism, fate and physico-chemical properties -The development of models from read-across, (Q)SARs, PBPK, QIVIVE, Multi-Scale Models -Big Data in toxicology: integration, management, analysis -Implementation of models through AOPs, IATA, TTC -Regulatory acceptance of models: evaluation, verification and validation -From metals, to small organic molecules to nanoparticles -Pharmaceuticals, pesticides, foods, cosmetics, fine chemicals -Bringing together the views of industry, regulators, academia, NGOs