Ionic liquids are unique in their properties and potential to be green solvents. Still, the toxicity concern remains, compelling the need for excellent predictive models for safe design and application. This work reports the introduction of a general, robust meta-ensemble learning framework for predicting the toxicity of ionic liquids using molecular descriptors and fingerprints. The proposed model incorporates the Random Forest, Support Vector Regression, Categorical Boosting, Chemical Convolutional Neural Network as a base classifier and an Extreme Gradient Boosting meta-classifier. The framework uses Recursive Feature Elimination for feature selection and GridSearchCV for tuning the best hyperparameters. Without augmentation of the data, the RMSE equals 0.38, MAE equals 0.29, coefficient of determination () equals 0.87, and Pearson correlation equals 0.94. Data augmentation further improved model performance: RMSE = 0.06, MAE = 0.024, = 0.99, and a Pearson correlation of 0.99. In addition, this indicates that the data-augmented model outperforms all existing models with prominence in its strength and prediction capacity. Thus, the present framework provides a superior tool for computer-aided molecular design of safer and more effective ionic liquids.
离子液体具有独特的性质和成为绿色溶剂的潜力。尽管如此,毒性问题仍然存在,迫切需要为安全设计和应用提供优秀的预测模型。这项工作报告了一个通用的、健壮的元集成学习框架的引入,用于使用分子描述符和指纹来预测离子液体的毒性。该模型结合了随机森林、支持向量回归、分类增强、化学卷积神经网络作为基本分类器和极端梯度增强元分类器。该框架使用递归特征消去进行特征选择,使用GridSearchCV优化最佳超参数。在不加值的情况下,RMSE = 0.38, MAE = 0.29,决定系数(R2) = 0.87, Pearson相关= 0.94。数据扩充进一步提高了模型性能:RMSE = 0.06, MAE = 0.024, R2 = 0.99, Pearson相关系数为0.99。此外,这表明数据增强模型在强度和预测能力方面优于所有现有模型。因此,本框架为更安全、更有效的离子液体的计算机辅助分子设计提供了一个优越的工具。
{"title":"Enhanced prediction of ionic liquid toxicity using a meta-ensemble learning framework with data augmentation","authors":"Safa Sadaghiyanfam , Hiqmet Kamberaj , Yalcin Isler","doi":"10.1016/j.aichem.2025.100087","DOIUrl":"10.1016/j.aichem.2025.100087","url":null,"abstract":"<div><div>Ionic liquids are unique in their properties and potential to be green solvents. Still, the toxicity concern remains, compelling the need for excellent predictive models for safe design and application. This work reports the introduction of a general, robust meta-ensemble learning framework for predicting the toxicity of ionic liquids using molecular descriptors and fingerprints. The proposed model incorporates the Random Forest, Support Vector Regression, Categorical Boosting, Chemical Convolutional Neural Network as a base classifier and an Extreme Gradient Boosting meta-classifier. The framework uses Recursive Feature Elimination for feature selection and GridSearchCV for tuning the best hyperparameters. Without augmentation of the data, the RMSE equals 0.38, MAE equals 0.29, coefficient of determination (<span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>) equals 0.87, and Pearson correlation equals 0.94. Data augmentation further improved model performance: RMSE = 0.06, MAE = 0.024, <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.99, and a Pearson correlation of 0.99. In addition, this indicates that the data-augmented model outperforms all existing models with prominence in its strength and prediction capacity. Thus, the present framework provides a superior tool for computer-aided molecular design of safer and more effective ionic liquids.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100087"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01DOI: 10.1016/j.aichem.2025.100085
Runhan Shi, Gufeng Yu, Letian Chen, Yang Yang
Predicting chemical reaction yields is a critical yet challenging task in organic chemistry. While integrating multi-modal information has shown promise, existing methods typically encode the entire reaction in different modalities and then align these embeddings for the same reactions. Such a coarse-grained modal fusion strategy may neglect atomic-level interactions crucial for accurate predictions. Recognizing the crucial role of modal fusion in multi-modal learning and the limitations of current methods in real-world scenarios, we propose YieldFCP, a reaction prediction model based on ine-grained ross-modal re-training. Its cross-modal projector links the molecular SMILES sequence with 3D geometric data, focusing on the atomic-level interactions to achieve fine-grained modal fusion and enhance yield prediction. YieldFCP is pre-trained on a large-scale dataset leveraging cross-modal self-supervised learning techniques. Experimental results on the high-throughput experiments, real-world electronic laboratory notebook, and real-world organic reaction publication datasets demonstrate the effectiveness of our approach. Particularly, YieldFCP outperforms the state-of-the-art methods in real-world scenarios and successfully recognizes key components that determine reaction yields with valuable interpretability.
{"title":"YieldFCP: Enhancing Reaction Yield Prediction via Fine-grained Cross-modal Pre-training","authors":"Runhan Shi, Gufeng Yu, Letian Chen, Yang Yang","doi":"10.1016/j.aichem.2025.100085","DOIUrl":"10.1016/j.aichem.2025.100085","url":null,"abstract":"<div><div>Predicting chemical reaction yields is a critical yet challenging task in organic chemistry. While integrating multi-modal information has shown promise, existing methods typically encode the entire reaction in different modalities and then align these embeddings for the same reactions. Such a coarse-grained modal fusion strategy may neglect atomic-level interactions crucial for accurate predictions. Recognizing the crucial role of modal fusion in multi-modal learning and the limitations of current methods in real-world scenarios, we propose YieldFCP, a reaction <span><math><munder><mrow><mtext>Yield</mtext></mrow><mo>̲</mo></munder></math></span> prediction model based on <span><math><munder><mrow><mtext>F</mtext></mrow><mo>̲</mo></munder></math></span>ine-grained <span><math><munder><mrow><mtext>C</mtext></mrow><mo>̲</mo></munder></math></span>ross-modal <span><math><munder><mrow><mtext>P</mtext></mrow><mo>̲</mo></munder></math></span>re-training. Its cross-modal projector links the molecular SMILES sequence with 3D geometric data, focusing on the atomic-level interactions to achieve fine-grained modal fusion and enhance yield prediction. YieldFCP is pre-trained on a large-scale dataset leveraging cross-modal self-supervised learning techniques. Experimental results on the high-throughput experiments, real-world electronic laboratory notebook, and real-world organic reaction publication datasets demonstrate the effectiveness of our approach. Particularly, YieldFCP outperforms the state-of-the-art methods in real-world scenarios and successfully recognizes key components that determine reaction yields with valuable interpretability.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100085"},"PeriodicalIF":0.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Green hydrogen, produced through water electrolysis powered by renewable energy, is essential for a sustainable energy future. However, proton exchange membrane (PEM) water electrolyzers face durability issues, particularly corrosion of porous transport layers (PTLs), which limits their widespread commercialization. Protective coatings are used to mitigate PTL corrosion and improve durability. Traditional approaches to predicting coating performance in terms of corrosion resistance rely on extensive experimentation and intricate physical-electrochemical modelling, resulting in substantial time and cost. This study is the first to apply machine learning (ML) models to predict the corrosion behaviour of PTL coatings with varying alloy compositions for PEM water electrolyzers. Using Nb-Ta coated PTLs with different alloying ratios, coating performance is evaluated through potentiostatic polarization and end-of-life (EOL) tests. The data is split into two datasets: one for predicting corrosion current density and the other for predicting EOL voltage. Extreme gradient boosting (XGB) and artificial neural network (ANN) models are developed. To assess the models, mean absolute error (MAE) and mean squared error (MSE) are used as loss functions. The ANN model with the MSE loss function achieved the best performance, with an R2 of 0.993 for corrosion current density. Additionally, the ANN model with a 0.1 dropout probability and MSE loss function resulted in an R2 of 0.966 for EOL voltage predictions, outperforming the XGB models. These findings demonstrate the ability of ML models to accurately predict the anti-corrosion performance of PTL coatings, facilitating a faster approach to optimizing PTL coating compositions for PEM water electrolyzer applications.
{"title":"Data-driven modelling of corrosion behaviour in coated porous transport layers for PEM water electrolyzers","authors":"Pramoth Varsan Madhavan , Leila Moradizadeh , Samaneh Shahgaldi , Xianguo Li","doi":"10.1016/j.aichem.2025.100086","DOIUrl":"10.1016/j.aichem.2025.100086","url":null,"abstract":"<div><div>Green hydrogen, produced through water electrolysis powered by renewable energy, is essential for a sustainable energy future. However, proton exchange membrane (PEM) water electrolyzers face durability issues, particularly corrosion of porous transport layers (PTLs), which limits their widespread commercialization. Protective coatings are used to mitigate PTL corrosion and improve durability. Traditional approaches to predicting coating performance in terms of corrosion resistance rely on extensive experimentation and intricate physical-electrochemical modelling, resulting in substantial time and cost. This study is the first to apply machine learning (ML) models to predict the corrosion behaviour of PTL coatings with varying alloy compositions for PEM water electrolyzers. Using Nb-Ta coated PTLs with different alloying ratios, coating performance is evaluated through potentiostatic polarization and end-of-life (EOL) tests. The data is split into two datasets: one for predicting corrosion current density and the other for predicting EOL voltage. Extreme gradient boosting (XGB) and artificial neural network (ANN) models are developed. To assess the models, mean absolute error (MAE) and mean squared error (MSE) are used as loss functions. The ANN model with the MSE loss function achieved the best performance, with an R<sup>2</sup> of 0.993 for corrosion current density. Additionally, the ANN model with a 0.1 dropout probability and MSE loss function resulted in an R<sup>2</sup> of 0.966 for EOL voltage predictions, outperforming the XGB models. These findings demonstrate the ability of ML models to accurately predict the anti-corrosion performance of PTL coatings, facilitating a faster approach to optimizing PTL coating compositions for PEM water electrolyzer applications.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100086"},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-05DOI: 10.1016/j.aichem.2025.100084
Natalia V. Karimova , Ravithree D. Senanayake
Toxoplasmosis, caused by Toxoplasma gondii (T. gondii), is a serious global health concern, particularly in immunocompromised individuals. Inhibiting the enzyme TgDHFR is a promising strategy for developing treatments. This Artificial Intelligence (AI)-driven Quantitative Structure-Activity Relationship (QSAR) study applies deep neural networks (DNNs) to predict pIC50 values for potential inhibitors, using 2D and 3D molecular descriptors and fingerprints. To address training data limitations, we introduced a novel methodology combining targeted descriptor selection, Gaussian noise-based data augmentation, and an ensemble of DNNs. This approach significantly enhanced model performance, increasing the R² from 0.75 with the original dataset to 0.85. The model was further validated using two FDA-approved drugs for T. gondii treatment—pyrimethamine and trimethoprim—yielding relative errors of 3.35 % and 2.15 % in pIC50 predictions compared to experimental values. Finally, the model was applied to screen FDA-approved drugs after filtering out molecules that did not align with the characteristics of the training dataset. The predicted pIC50 values were further used to calculate ligand efficiency (LE), binding efficiency index (BEI), lipophilic ligand efficiency (LLE), and surface efficiency index (SEI), identifying the most promising TgDHFR inhibitors for further investigation. By leveraging AI and data augmentation approach, this study provides a powerful tool for pIC50 predictions of TgDHFR inhibitors, which can be adapted to other systems.
{"title":"AI-driven prediction of drug activity against Toxoplasma gondii: Data augmentation and deep neural networks for limited datasets","authors":"Natalia V. Karimova , Ravithree D. Senanayake","doi":"10.1016/j.aichem.2025.100084","DOIUrl":"10.1016/j.aichem.2025.100084","url":null,"abstract":"<div><div>Toxoplasmosis, caused by <em>Toxoplasma gondii</em> (<em>T. gondii</em>), is a serious global health concern, particularly in immunocompromised individuals. Inhibiting the enzyme TgDHFR is a promising strategy for developing treatments. This Artificial Intelligence (AI)-driven Quantitative Structure-Activity Relationship (QSAR) study applies deep neural networks (DNNs) to predict pIC<sub>50</sub> values for potential inhibitors, using 2D and 3D molecular descriptors and fingerprints. To address training data limitations, we introduced a novel methodology combining targeted descriptor selection, Gaussian noise-based data augmentation, and an ensemble of DNNs. This approach significantly enhanced model performance, increasing the R² from 0.75 with the original dataset to 0.85. The model was further validated using two FDA-approved drugs for <em>T. gondii</em> treatment—pyrimethamine and trimethoprim—yielding relative errors of 3.35 % and 2.15 % in pIC<sub>50</sub> predictions compared to experimental values. Finally, the model was applied to screen FDA-approved drugs after filtering out molecules that did not align with the characteristics of the training dataset. The predicted pIC<sub>50</sub> values were further used to calculate ligand efficiency (LE), binding efficiency index (BEI), lipophilic ligand efficiency (LLE), and surface efficiency index (SEI), identifying the most promising TgDHFR inhibitors for further investigation. By leveraging AI and data augmentation approach, this study provides a powerful tool for pIC<sub>50</sub> predictions of TgDHFR inhibitors, which can be adapted to other systems.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100084"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143350515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-07DOI: 10.1016/j.aichem.2024.100083
Jiaqi Chen , Junqing Li , Ziyi Liu, Shitao Sun, Shijia Zhou, Dongqi Wang
This work aims at the proper application of machine learning screening of direct propane dehydrogenation (PDH) reaction and oxidative dehydrogenation (ODH) of propane, which are two main protocols to convert propane to propylene and featured by limited available experimental data. Current studies mainly adopt trial-and-error strategy, which is time consuming and raises concerns on environment and health owing to the release of chemical waste. This motivates the introduction of data-driven research paradigm to alleviate the deficiency of the traditional trial-and-error strategy, which however relies on large quantity of high quality data. In this work, a dataset enveloping PDH and ODH data was constructed, and the performance of machine learning algorithms in the study of light alkane activation was evaluated, based on which a strategy appropriate for small dataset was proposed: for small unbalanced datasets, it is sensible to train the model by treating the dataset as a whole rather than to fuse multiple specific models based on divided smaller pieces of data. The results show that the trained models using ensemble algorithms exhibited the best predictability of propylene selectivity, i.e. CatBoost and random forest for PDH and LightGBM for ODH, respectively. Based on the optimal model, the key influencing factors in PDH and ODH were identified. This study demonstrates the proper use of data-driven strategy in the catalytic science, which can be adopted in other scientific problems that suffer from the limited available high quality data and contribute to the gain of novel understanding, e.g. the rational design and optimization of the catalytic systems.
{"title":"Small-dataset-orientated data-driven screening for catalytic propane activation","authors":"Jiaqi Chen , Junqing Li , Ziyi Liu, Shitao Sun, Shijia Zhou, Dongqi Wang","doi":"10.1016/j.aichem.2024.100083","DOIUrl":"10.1016/j.aichem.2024.100083","url":null,"abstract":"<div><div>This work aims at the proper application of machine learning screening of direct propane dehydrogenation (PDH) reaction and oxidative dehydrogenation (ODH) of propane, which are two main protocols to convert propane to propylene and featured by limited available experimental data. Current studies mainly adopt trial-and-error strategy, which is time consuming and raises concerns on environment and health owing to the release of chemical waste. This motivates the introduction of data-driven research paradigm to alleviate the deficiency of the traditional trial-and-error strategy, which however relies on large quantity of high quality data. In this work, a dataset enveloping PDH and ODH data was constructed, and the performance of machine learning algorithms in the study of light alkane activation was evaluated, based on which a strategy appropriate for small dataset was proposed: for small unbalanced datasets, it is sensible to train the model by treating the dataset as a whole rather than to fuse multiple specific models based on divided smaller pieces of data. The results show that the trained models using ensemble algorithms exhibited the best predictability of propylene selectivity, i.e. CatBoost and random forest for PDH and LightGBM for ODH, respectively. Based on the optimal model, the key influencing factors in PDH and ODH were identified. This study demonstrates the proper use of data-driven strategy in the catalytic science, which can be adopted in other scientific problems that suffer from the limited available high quality data and contribute to the gain of novel understanding, e.g. the rational design and optimization of the catalytic systems.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100083"},"PeriodicalIF":0.0,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143100098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-04DOI: 10.1016/j.aichem.2024.100082
Jie Sun, Zi-Hao Li, Yi-Fei Yang, Shu-Yu Zhang
Privileged structures, like quinoline, have diverse biological activities, and their synthetic versatility makes them crucial for drug design. In traditional synthesis methods, the C-H functionalization of quinoline can be effectively achieved using different conditions, especially transition metal catalysis. Machine learning (ML) techniques enable rapid prediction of C-H functionalization, facilitating drug design and synthesis. In this study, a generalizable approach to predict site selectivity is accomplished by using artificial neural network (ANN), which is suitable for the site prediction of derivatives of quinoline. In an 80/10/10 training/validation/testing split of 2467 compounds, the model takes SMILES strings as input format and uses six quantum chemical descriptors to identify reactive site(s) of the compound. On the external validation set, 86 .5% of all molecules were correctly predicted. This model allows chemists to rapidly predict which site is more likely to produce electrophilic substitution reaction.
{"title":"Machine learning for active sites prediction of quinoline derivatives","authors":"Jie Sun, Zi-Hao Li, Yi-Fei Yang, Shu-Yu Zhang","doi":"10.1016/j.aichem.2024.100082","DOIUrl":"10.1016/j.aichem.2024.100082","url":null,"abstract":"<div><div>Privileged structures, like quinoline, have diverse biological activities, and their synthetic versatility makes them crucial for drug design. In traditional synthesis methods, the C-H functionalization of quinoline can be effectively achieved using different conditions, especially transition metal catalysis. Machine learning (ML) techniques enable rapid prediction of C-H functionalization, facilitating drug design and synthesis. In this study, a generalizable approach to predict site selectivity is accomplished by using artificial neural network (ANN), which is suitable for the site prediction of derivatives of quinoline. In an 80/10/10 training/validation/testing split of 2467 compounds, the model takes SMILES strings as input format and uses six quantum chemical descriptors to identify reactive site(s) of the compound. On the external validation set, 86 .5% of all molecules were correctly predicted. This model allows chemists to rapidly predict which site is more likely to produce electrophilic substitution reaction.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100082"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143100097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1016/j.aichem.2024.100080
Abhishek Ojha , Satya S. Bulusu , Arup Banerjee
The polarizability of molecules describes their response to an external electric field. It quantifies the ability of a system to form an induced dipole moment when subjected to an electric field. In this work, we investigated isotropic polarizability and anisotropy in the polarizability of gold nanoclusters using various machine-learning algorithms. We utilized high-order invariant descriptors based on spherical harmonics, integrated with machine-learning models like artificial neural network, Gaussian process regression, and kernel ridge regression. Our results demonstrate the efficacy of machine-learning in accurately predicting the polarizability of gold nanoclusters. We find that ANN-based model performs better than the other models.
{"title":"Machine learning approaches for modelling of molecular polarizability in gold nanoclusters","authors":"Abhishek Ojha , Satya S. Bulusu , Arup Banerjee","doi":"10.1016/j.aichem.2024.100080","DOIUrl":"10.1016/j.aichem.2024.100080","url":null,"abstract":"<div><div>The polarizability of molecules describes their response to an external electric field. It quantifies the ability of a system to form an induced dipole moment when subjected to an electric field. In this work, we investigated isotropic polarizability and anisotropy in the polarizability of gold nanoclusters using various machine-learning algorithms. We utilized high-order invariant descriptors based on spherical harmonics, integrated with machine-learning models like artificial neural network, Gaussian process regression, and kernel ridge regression. Our results demonstrate the efficacy of machine-learning in accurately predicting the polarizability of gold nanoclusters. We find that ANN-based model performs better than the other models.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100080"},"PeriodicalIF":0.0,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-17DOI: 10.1016/j.aichem.2024.100078
Sophia Li , Emma Wang , Leia Pei , Sourodeep Deb , Prashanth Prabhala , Sai Hruday Reddy Nara , Raina Panda , Shiven Eltepu , Marx Akl , Larry McMahan , Edward Njoo
The introduction of fluorine in compounds plays a crucial role in drug development as it greatly influences their final pharmacokinetic and dynamic properties. Due to the prevalence of fluorine in FDA-approved drugs in recent years, identifying the mechanisms driving their chemical transformations has become crucial in the drug discovery landscape. 19F NMR spectroscopy is a powerful analytical technique that allows for the examination of fluorine-containing compounds, offering valuable information about their structure, dynamics, and reactivity. NMR spectra can be interpreted through the leveraging of Density Functional Theory (DFT). However, the screening of compounds and discovery of feasible drug candidates is limited due to its computational cost. Here, we present a machine learning approach to accelerate the prediction of DFT-calculated 19F NMR chemical shifts. The fluorine atoms’ features in the models were derived from their local three-dimensional environments, representing their neighboring atoms within a radius of n Å away from the given fluorine atom in the compound. A comparative analysis of thirteen regression models was conducted using features extracted from 501 fluorinated compounds in our laboratory’s chemical inventory. Among the models, Gradient Boosting Regression (GBR) exhibited the highest performance, achieving a mean absolute error of 3.31 ppm with a local environment radius of 3 Å. This demonstrates a comparable accuracy to DFT calculations while reducing computational time from several hundred seconds to milliseconds. 3 Å was also found to be the most optimal radius across all models when encoding features for local atomic environments.
氟在化合物中的引入在药物开发中起着至关重要的作用,因为它会极大地影响药物的最终药代动力学和动态特性。由于近年来氟在 FDA 批准药物中的普遍存在,确定其化学变化的驱动机制已成为药物发现领域的关键。19F NMR 光谱是一种功能强大的分析技术,可用于检查含氟化合物,提供有关其结构、动力学和反应性的宝贵信息。NMR 光谱可通过密度泛函理论 (DFT) 进行解释。然而,由于计算成本的原因,化合物的筛选和可行候选药物的发现受到了限制。在此,我们提出了一种机器学习方法来加速预测 DFT 计算的 19F NMR 化学位移。模型中氟原子的特征来自于它们的局部三维环境,代表化合物中与给定氟原子相距 n Å 半径范围内的相邻原子。利用从我们实验室化学库存中的 501 种含氟化合物中提取的特征,对 13 个回归模型进行了比较分析。在这些模型中,梯度提升回归模型(GBR)的性能最高,在局部环境半径为 3 Å 的情况下,平均绝对误差为 3.31 ppm。这表明其精度与 DFT 计算相当,同时将计算时间从几百秒缩短到了几毫秒。在对局部原子环境特征进行编码时,3 Å 也被认为是所有模型中最理想的半径。
{"title":"Evaluation of machine learning models for the accelerated prediction of density functional theory calculated 19F chemical shifts based on local atomic environments","authors":"Sophia Li , Emma Wang , Leia Pei , Sourodeep Deb , Prashanth Prabhala , Sai Hruday Reddy Nara , Raina Panda , Shiven Eltepu , Marx Akl , Larry McMahan , Edward Njoo","doi":"10.1016/j.aichem.2024.100078","DOIUrl":"10.1016/j.aichem.2024.100078","url":null,"abstract":"<div><div>The introduction of fluorine in compounds plays a crucial role in drug development as it greatly influences their final pharmacokinetic and dynamic properties. Due to the prevalence of fluorine in FDA-approved drugs in recent years, identifying the mechanisms driving their chemical transformations has become crucial in the drug discovery landscape. <sup>19</sup>F NMR spectroscopy is a powerful analytical technique that allows for the examination of fluorine-containing compounds, offering valuable information about their structure, dynamics, and reactivity. NMR spectra can be interpreted through the leveraging of Density Functional Theory (DFT). However, the screening of compounds and discovery of feasible drug candidates is limited due to its computational cost. Here, we present a machine learning approach to accelerate the prediction of DFT-calculated <sup>19</sup>F NMR chemical shifts. The fluorine atoms’ features in the models were derived from their local three-dimensional environments, representing their neighboring atoms within a radius of <em>n</em> Å away from the given fluorine atom in the compound. A comparative analysis of thirteen regression models was conducted using features extracted from 501 fluorinated compounds in our laboratory’s chemical inventory. Among the models, Gradient Boosting Regression (GBR) exhibited the highest performance, achieving a mean absolute error of 3.31 ppm with a local environment radius of 3 Å. This demonstrates a comparable accuracy to DFT calculations while reducing computational time from several hundred seconds to milliseconds. 3 Å was also found to be the most optimal radius across all models when encoding features for local atomic environments.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100078"},"PeriodicalIF":0.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-16DOI: 10.1016/j.aichem.2024.100079
Vaneet Saini , Ranjeet Kumar
The Hammett constants, σm and σp, reflect the electron-withdrawing and electron-donating abilities of substituents on aromatic compounds, and have been successfully used in various structure-activity relationship studies. However, determining these constants experimentally is both resource-intensive and time-consuming approach. In this study, we explore the use of graph neural networks (GNNs) to predict Hammett constant parameters using graph-based features. This innovative approach aims to provide rapid and efficient predictions of σm and σp values, eliminating the need for extensive computational and experimental setups. By leveraging the power of GNNs, we hope to streamline the process of obtaining these critical parameters, thereby facilitating more efficient reaction design and enhancing the applicability of linear free energy relationship studies in chemical research.
{"title":"Leveraging graph neural networks to predict Hammett’s constants for benzoic acid derivatives","authors":"Vaneet Saini , Ranjeet Kumar","doi":"10.1016/j.aichem.2024.100079","DOIUrl":"10.1016/j.aichem.2024.100079","url":null,"abstract":"<div><div>The Hammett constants, σ<sub>m</sub> and σ<sub>p</sub>, reflect the electron-withdrawing and electron-donating abilities of substituents on aromatic compounds, and have been successfully used in various structure-activity relationship studies. However, determining these constants experimentally is both resource-intensive and time-consuming approach. In this study, we explore the use of graph neural networks (GNNs) to predict Hammett constant parameters using graph-based features. This innovative approach aims to provide rapid and efficient predictions of σ<sub>m</sub> and σ<sub>p</sub> values, eliminating the need for extensive computational and experimental setups. By leveraging the power of GNNs, we hope to streamline the process of obtaining these critical parameters, thereby facilitating more efficient reaction design and enhancing the applicability of linear free energy relationship studies in chemical research.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100079"},"PeriodicalIF":0.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-31DOI: 10.1016/j.aichem.2024.100077
Kenneth López-Pérez , Juan F. Avellaneda-Tamayo , Lexin Chen , Edgar López-López , K. Eurídice Juárez-Mercado , José L. Medina-Franco , Ramón Alain Miranda-Quintana
Molecular similarity pervades much of our understanding and rationalization of chemistry. This has become particularly evident in the current data-intensive era of chemical research, with similarity measures serving as the backbone of many Machine Learning (ML) supervised and unsupervised procedures. Here, we present a discussion on the role of molecular similarity in drug design, chemical space exploration, chemical “art” generation, molecular representations, and many more. We also discuss more recent topics in molecular similarity, like the ability to efficiently compare large molecular libraries.
{"title":"Molecular similarity: Theory, applications, and perspectives","authors":"Kenneth López-Pérez , Juan F. Avellaneda-Tamayo , Lexin Chen , Edgar López-López , K. Eurídice Juárez-Mercado , José L. Medina-Franco , Ramón Alain Miranda-Quintana","doi":"10.1016/j.aichem.2024.100077","DOIUrl":"10.1016/j.aichem.2024.100077","url":null,"abstract":"<div><p>Molecular similarity pervades much of our understanding and rationalization of chemistry. This has become particularly evident in the current data-intensive era of chemical research, with similarity measures serving as the backbone of many Machine Learning (ML) supervised and unsupervised procedures. Here, we present a discussion on the role of molecular similarity in drug design, chemical space exploration, chemical “art” generation, molecular representations, and many more. We also discuss more recent topics in molecular similarity, like the ability to efficiently compare large molecular libraries.</p></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100077"},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949747724000356/pdfft?md5=7238a1972b367d1732b52f425b046ba9&pid=1-s2.0-S2949747724000356-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142150935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}