首页 > 最新文献

Artificial intelligence chemistry最新文献

英文 中文
Enhanced prediction of ionic liquid toxicity using a meta-ensemble learning framework with data augmentation 利用带数据增强功能的元集合学习框架加强离子液体毒性预测
Pub Date : 2025-03-05 DOI: 10.1016/j.aichem.2025.100087
Safa Sadaghiyanfam , Hiqmet Kamberaj , Yalcin Isler
Ionic liquids are unique in their properties and potential to be green solvents. Still, the toxicity concern remains, compelling the need for excellent predictive models for safe design and application. This work reports the introduction of a general, robust meta-ensemble learning framework for predicting the toxicity of ionic liquids using molecular descriptors and fingerprints. The proposed model incorporates the Random Forest, Support Vector Regression, Categorical Boosting, Chemical Convolutional Neural Network as a base classifier and an Extreme Gradient Boosting meta-classifier. The framework uses Recursive Feature Elimination for feature selection and GridSearchCV for tuning the best hyperparameters. Without augmentation of the data, the RMSE equals 0.38, MAE equals 0.29, coefficient of determination (R2) equals 0.87, and Pearson correlation equals 0.94. Data augmentation further improved model performance: RMSE = 0.06, MAE = 0.024, R2 = 0.99, and a Pearson correlation of 0.99. In addition, this indicates that the data-augmented model outperforms all existing models with prominence in its strength and prediction capacity. Thus, the present framework provides a superior tool for computer-aided molecular design of safer and more effective ionic liquids.
离子液体具有独特的性质和成为绿色溶剂的潜力。尽管如此,毒性问题仍然存在,迫切需要为安全设计和应用提供优秀的预测模型。这项工作报告了一个通用的、健壮的元集成学习框架的引入,用于使用分子描述符和指纹来预测离子液体的毒性。该模型结合了随机森林、支持向量回归、分类增强、化学卷积神经网络作为基本分类器和极端梯度增强元分类器。该框架使用递归特征消去进行特征选择,使用GridSearchCV优化最佳超参数。在不加值的情况下,RMSE = 0.38, MAE = 0.29,决定系数(R2) = 0.87, Pearson相关= 0.94。数据扩充进一步提高了模型性能:RMSE = 0.06, MAE = 0.024, R2 = 0.99, Pearson相关系数为0.99。此外,这表明数据增强模型在强度和预测能力方面优于所有现有模型。因此,本框架为更安全、更有效的离子液体的计算机辅助分子设计提供了一个优越的工具。
{"title":"Enhanced prediction of ionic liquid toxicity using a meta-ensemble learning framework with data augmentation","authors":"Safa Sadaghiyanfam ,&nbsp;Hiqmet Kamberaj ,&nbsp;Yalcin Isler","doi":"10.1016/j.aichem.2025.100087","DOIUrl":"10.1016/j.aichem.2025.100087","url":null,"abstract":"<div><div>Ionic liquids are unique in their properties and potential to be green solvents. Still, the toxicity concern remains, compelling the need for excellent predictive models for safe design and application. This work reports the introduction of a general, robust meta-ensemble learning framework for predicting the toxicity of ionic liquids using molecular descriptors and fingerprints. The proposed model incorporates the Random Forest, Support Vector Regression, Categorical Boosting, Chemical Convolutional Neural Network as a base classifier and an Extreme Gradient Boosting meta-classifier. The framework uses Recursive Feature Elimination for feature selection and GridSearchCV for tuning the best hyperparameters. Without augmentation of the data, the RMSE equals 0.38, MAE equals 0.29, coefficient of determination (<span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>) equals 0.87, and Pearson correlation equals 0.94. Data augmentation further improved model performance: RMSE = 0.06, MAE = 0.024, <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> = 0.99, and a Pearson correlation of 0.99. In addition, this indicates that the data-augmented model outperforms all existing models with prominence in its strength and prediction capacity. Thus, the present framework provides a superior tool for computer-aided molecular design of safer and more effective ionic liquids.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100087"},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
YieldFCP: Enhancing Reaction Yield Prediction via Fine-grained Cross-modal Pre-training YieldFCP:通过细粒度交叉模态预训练增强反应产率预测
Pub Date : 2025-03-01 DOI: 10.1016/j.aichem.2025.100085
Runhan Shi, Gufeng Yu, Letian Chen, Yang Yang
Predicting chemical reaction yields is a critical yet challenging task in organic chemistry. While integrating multi-modal information has shown promise, existing methods typically encode the entire reaction in different modalities and then align these embeddings for the same reactions. Such a coarse-grained modal fusion strategy may neglect atomic-level interactions crucial for accurate predictions. Recognizing the crucial role of modal fusion in multi-modal learning and the limitations of current methods in real-world scenarios, we propose YieldFCP, a reaction Yield̲ prediction model based on F̲ine-grained C̲ross-modal P̲re-training. Its cross-modal projector links the molecular SMILES sequence with 3D geometric data, focusing on the atomic-level interactions to achieve fine-grained modal fusion and enhance yield prediction. YieldFCP is pre-trained on a large-scale dataset leveraging cross-modal self-supervised learning techniques. Experimental results on the high-throughput experiments, real-world electronic laboratory notebook, and real-world organic reaction publication datasets demonstrate the effectiveness of our approach. Particularly, YieldFCP outperforms the state-of-the-art methods in real-world scenarios and successfully recognizes key components that determine reaction yields with valuable interpretability.
在有机化学中,预测化学反应产率是一项关键而又具有挑战性的任务。虽然整合多模态信息显示出了希望,但现有的方法通常是以不同的模态对整个反应进行编码,然后对相同的反应对齐这些嵌入。这种粗粒度的模态融合策略可能会忽略对准确预测至关重要的原子级相互作用。认识到模态融合在多模态学习中的关键作用以及当前方法在现实场景中的局限性,我们提出了YieldFCP,这是一个基于F -细粒度C -交叉模态P -再训练的反应产率预测模型。它的跨模态投影仪将分子SMILES序列与3D几何数据连接起来,专注于原子水平的相互作用,以实现细粒度模态融合并提高产量预测。YieldFCP利用跨模态自监督学习技术在大规模数据集上进行预训练。在高通量实验、真实世界的电子实验室笔记和真实世界的有机反应出版物数据集上的实验结果证明了我们方法的有效性。特别是,YieldFCP在现实场景中优于最先进的方法,并成功识别出决定反应产率的关键成分,具有有价值的可解释性。
{"title":"YieldFCP: Enhancing Reaction Yield Prediction via Fine-grained Cross-modal Pre-training","authors":"Runhan Shi,&nbsp;Gufeng Yu,&nbsp;Letian Chen,&nbsp;Yang Yang","doi":"10.1016/j.aichem.2025.100085","DOIUrl":"10.1016/j.aichem.2025.100085","url":null,"abstract":"<div><div>Predicting chemical reaction yields is a critical yet challenging task in organic chemistry. While integrating multi-modal information has shown promise, existing methods typically encode the entire reaction in different modalities and then align these embeddings for the same reactions. Such a coarse-grained modal fusion strategy may neglect atomic-level interactions crucial for accurate predictions. Recognizing the crucial role of modal fusion in multi-modal learning and the limitations of current methods in real-world scenarios, we propose YieldFCP, a reaction <span><math><munder><mrow><mtext>Yield</mtext></mrow><mo>̲</mo></munder></math></span> prediction model based on <span><math><munder><mrow><mtext>F</mtext></mrow><mo>̲</mo></munder></math></span>ine-grained <span><math><munder><mrow><mtext>C</mtext></mrow><mo>̲</mo></munder></math></span>ross-modal <span><math><munder><mrow><mtext>P</mtext></mrow><mo>̲</mo></munder></math></span>re-training. Its cross-modal projector links the molecular SMILES sequence with 3D geometric data, focusing on the atomic-level interactions to achieve fine-grained modal fusion and enhance yield prediction. YieldFCP is pre-trained on a large-scale dataset leveraging cross-modal self-supervised learning techniques. Experimental results on the high-throughput experiments, real-world electronic laboratory notebook, and real-world organic reaction publication datasets demonstrate the effectiveness of our approach. Particularly, YieldFCP outperforms the state-of-the-art methods in real-world scenarios and successfully recognizes key components that determine reaction yields with valuable interpretability.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100085"},"PeriodicalIF":0.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143561922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Data-driven modelling of corrosion behaviour in coated porous transport layers for PEM water electrolyzers PEM水电解槽涂层多孔传输层腐蚀行为的数据驱动模型
Pub Date : 2025-02-25 DOI: 10.1016/j.aichem.2025.100086
Pramoth Varsan Madhavan , Leila Moradizadeh , Samaneh Shahgaldi , Xianguo Li
Green hydrogen, produced through water electrolysis powered by renewable energy, is essential for a sustainable energy future. However, proton exchange membrane (PEM) water electrolyzers face durability issues, particularly corrosion of porous transport layers (PTLs), which limits their widespread commercialization. Protective coatings are used to mitigate PTL corrosion and improve durability. Traditional approaches to predicting coating performance in terms of corrosion resistance rely on extensive experimentation and intricate physical-electrochemical modelling, resulting in substantial time and cost. This study is the first to apply machine learning (ML) models to predict the corrosion behaviour of PTL coatings with varying alloy compositions for PEM water electrolyzers. Using Nb-Ta coated PTLs with different alloying ratios, coating performance is evaluated through potentiostatic polarization and end-of-life (EOL) tests. The data is split into two datasets: one for predicting corrosion current density and the other for predicting EOL voltage. Extreme gradient boosting (XGB) and artificial neural network (ANN) models are developed. To assess the models, mean absolute error (MAE) and mean squared error (MSE) are used as loss functions. The ANN model with the MSE loss function achieved the best performance, with an R2 of 0.993 for corrosion current density. Additionally, the ANN model with a 0.1 dropout probability and MSE loss function resulted in an R2 of 0.966 for EOL voltage predictions, outperforming the XGB models. These findings demonstrate the ability of ML models to accurately predict the anti-corrosion performance of PTL coatings, facilitating a faster approach to optimizing PTL coating compositions for PEM water electrolyzer applications.
绿色氢是由可再生能源驱动的水电解产生的,对于可持续能源的未来至关重要。然而,质子交换膜(PEM)水电解槽面临耐久性问题,特别是多孔传输层(ptl)的腐蚀,这限制了其广泛的商业化。保护涂层用于减轻PTL腐蚀,提高耐久性。传统的预测涂层耐腐蚀性能的方法依赖于大量的实验和复杂的物理电化学建模,这导致了大量的时间和成本。这项研究首次应用机器学习(ML)模型来预测PEM水电解槽中不同合金成分的PTL涂层的腐蚀行为。采用不同合金配比的铌钽包覆ptl,通过恒电位极化和寿命终止(EOL)测试对涂层性能进行了评价。数据分为两个数据集:一个用于预测腐蚀电流密度,另一个用于预测EOL电压。提出了极限梯度增强(XGB)和人工神经网络(ANN)模型。为了评估模型,使用平均绝对误差(MAE)和均方误差(MSE)作为损失函数。基于MSE损失函数的人工神经网络模型性能最佳,腐蚀电流密度的R2为0.993。此外,具有0.1 dropout概率和MSE损失函数的ANN模型对EOL电压的预测结果的R2为0.966,优于XGB模型。这些发现证明了ML模型能够准确预测PTL涂层的防腐性能,有助于更快地优化PEM水电解槽应用的PTL涂层成分。
{"title":"Data-driven modelling of corrosion behaviour in coated porous transport layers for PEM water electrolyzers","authors":"Pramoth Varsan Madhavan ,&nbsp;Leila Moradizadeh ,&nbsp;Samaneh Shahgaldi ,&nbsp;Xianguo Li","doi":"10.1016/j.aichem.2025.100086","DOIUrl":"10.1016/j.aichem.2025.100086","url":null,"abstract":"<div><div>Green hydrogen, produced through water electrolysis powered by renewable energy, is essential for a sustainable energy future. However, proton exchange membrane (PEM) water electrolyzers face durability issues, particularly corrosion of porous transport layers (PTLs), which limits their widespread commercialization. Protective coatings are used to mitigate PTL corrosion and improve durability. Traditional approaches to predicting coating performance in terms of corrosion resistance rely on extensive experimentation and intricate physical-electrochemical modelling, resulting in substantial time and cost. This study is the first to apply machine learning (ML) models to predict the corrosion behaviour of PTL coatings with varying alloy compositions for PEM water electrolyzers. Using Nb-Ta coated PTLs with different alloying ratios, coating performance is evaluated through potentiostatic polarization and end-of-life (EOL) tests. The data is split into two datasets: one for predicting corrosion current density and the other for predicting EOL voltage. Extreme gradient boosting (XGB) and artificial neural network (ANN) models are developed. To assess the models, mean absolute error (MAE) and mean squared error (MSE) are used as loss functions. The ANN model with the MSE loss function achieved the best performance, with an R<sup>2</sup> of 0.993 for corrosion current density. Additionally, the ANN model with a 0.1 dropout probability and MSE loss function resulted in an R<sup>2</sup> of 0.966 for EOL voltage predictions, outperforming the XGB models. These findings demonstrate the ability of ML models to accurately predict the anti-corrosion performance of PTL coatings, facilitating a faster approach to optimizing PTL coating compositions for PEM water electrolyzer applications.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100086"},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI-driven prediction of drug activity against Toxoplasma gondii: Data augmentation and deep neural networks for limited datasets 人工智能驱动的弓形虫药物活性预测:有限数据集的数据增强和深度神经网络
Pub Date : 2025-02-05 DOI: 10.1016/j.aichem.2025.100084
Natalia V. Karimova , Ravithree D. Senanayake
Toxoplasmosis, caused by Toxoplasma gondii (T. gondii), is a serious global health concern, particularly in immunocompromised individuals. Inhibiting the enzyme TgDHFR is a promising strategy for developing treatments. This Artificial Intelligence (AI)-driven Quantitative Structure-Activity Relationship (QSAR) study applies deep neural networks (DNNs) to predict pIC50 values for potential inhibitors, using 2D and 3D molecular descriptors and fingerprints. To address training data limitations, we introduced a novel methodology combining targeted descriptor selection, Gaussian noise-based data augmentation, and an ensemble of DNNs. This approach significantly enhanced model performance, increasing the R² from 0.75 with the original dataset to 0.85. The model was further validated using two FDA-approved drugs for T. gondii treatment—pyrimethamine and trimethoprim—yielding relative errors of 3.35 % and 2.15 % in pIC50 predictions compared to experimental values. Finally, the model was applied to screen FDA-approved drugs after filtering out molecules that did not align with the characteristics of the training dataset. The predicted pIC50 values were further used to calculate ligand efficiency (LE), binding efficiency index (BEI), lipophilic ligand efficiency (LLE), and surface efficiency index (SEI), identifying the most promising TgDHFR inhibitors for further investigation. By leveraging AI and data augmentation approach, this study provides a powerful tool for pIC50 predictions of TgDHFR inhibitors, which can be adapted to other systems.
由刚地弓形虫(弓形虫)引起的弓形虫病是一个严重的全球卫生问题,特别是在免疫功能低下的个体中。抑制TgDHFR酶是一种很有前途的治疗策略。这项人工智能(AI)驱动的定量构效关系(QSAR)研究应用深度神经网络(dnn)来预测潜在抑制剂的pIC50值,使用2D和3D分子描述符和指纹。为了解决训练数据的局限性,我们引入了一种新的方法,结合了目标描述符选择、基于高斯噪声的数据增强和dnn集成。该方法显著提高了模型性能,将原始数据集的R²从0.75提高到0.85。使用fda批准的两种治疗弓形虫的药物乙胺嘧啶和甲氧苄啶进一步验证了该模型,与实验值相比,pIC50预测的相对误差为3.35 %和2.15 %。最后,在过滤掉与训练数据集特征不一致的分子后,该模型被应用于筛选fda批准的药物。利用预测的pIC50值进一步计算配体效率(LE)、结合效率指数(BEI)、亲脂配体效率(LLE)和表面效率指数(SEI),确定最有希望进行进一步研究的TgDHFR抑制剂。通过利用人工智能和数据增强方法,本研究为TgDHFR抑制剂的pIC50预测提供了一个强大的工具,可以适用于其他系统。
{"title":"AI-driven prediction of drug activity against Toxoplasma gondii: Data augmentation and deep neural networks for limited datasets","authors":"Natalia V. Karimova ,&nbsp;Ravithree D. Senanayake","doi":"10.1016/j.aichem.2025.100084","DOIUrl":"10.1016/j.aichem.2025.100084","url":null,"abstract":"<div><div>Toxoplasmosis, caused by <em>Toxoplasma gondii</em> (<em>T. gondii</em>), is a serious global health concern, particularly in immunocompromised individuals. Inhibiting the enzyme TgDHFR is a promising strategy for developing treatments. This Artificial Intelligence (AI)-driven Quantitative Structure-Activity Relationship (QSAR) study applies deep neural networks (DNNs) to predict pIC<sub>50</sub> values for potential inhibitors, using 2D and 3D molecular descriptors and fingerprints. To address training data limitations, we introduced a novel methodology combining targeted descriptor selection, Gaussian noise-based data augmentation, and an ensemble of DNNs. This approach significantly enhanced model performance, increasing the R² from 0.75 with the original dataset to 0.85. The model was further validated using two FDA-approved drugs for <em>T. gondii</em> treatment—pyrimethamine and trimethoprim—yielding relative errors of 3.35 % and 2.15 % in pIC<sub>50</sub> predictions compared to experimental values. Finally, the model was applied to screen FDA-approved drugs after filtering out molecules that did not align with the characteristics of the training dataset. The predicted pIC<sub>50</sub> values were further used to calculate ligand efficiency (LE), binding efficiency index (BEI), lipophilic ligand efficiency (LLE), and surface efficiency index (SEI), identifying the most promising TgDHFR inhibitors for further investigation. By leveraging AI and data augmentation approach, this study provides a powerful tool for pIC<sub>50</sub> predictions of TgDHFR inhibitors, which can be adapted to other systems.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100084"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143350515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Small-dataset-orientated data-driven screening for catalytic propane activation 面向小数据集的催化丙烷活化数据驱动筛选
Pub Date : 2024-12-07 DOI: 10.1016/j.aichem.2024.100083
Jiaqi Chen , Junqing Li , Ziyi Liu, Shitao Sun, Shijia Zhou, Dongqi Wang
This work aims at the proper application of machine learning screening of direct propane dehydrogenation (PDH) reaction and oxidative dehydrogenation (ODH) of propane, which are two main protocols to convert propane to propylene and featured by limited available experimental data. Current studies mainly adopt trial-and-error strategy, which is time consuming and raises concerns on environment and health owing to the release of chemical waste. This motivates the introduction of data-driven research paradigm to alleviate the deficiency of the traditional trial-and-error strategy, which however relies on large quantity of high quality data. In this work, a dataset enveloping PDH and ODH data was constructed, and the performance of machine learning algorithms in the study of light alkane activation was evaluated, based on which a strategy appropriate for small dataset was proposed: for small unbalanced datasets, it is sensible to train the model by treating the dataset as a whole rather than to fuse multiple specific models based on divided smaller pieces of data. The results show that the trained models using ensemble algorithms exhibited the best predictability of propylene selectivity, i.e. CatBoost and random forest for PDH and LightGBM for ODH, respectively. Based on the optimal model, the key influencing factors in PDH and ODH were identified. This study demonstrates the proper use of data-driven strategy in the catalytic science, which can be adopted in other scientific problems that suffer from the limited available high quality data and contribute to the gain of novel understanding, e.g. the rational design and optimization of the catalytic systems.
丙烷直接脱氢(PDH)反应和氧化脱氢(ODH)反应是丙烷制丙烯的两种主要工艺,实验数据有限,本研究旨在将机器学习技术应用于丙烷直接脱氢(PDH)反应和氧化脱氢(ODH)反应的筛选。目前的研究主要采用试错策略,这种策略耗时,并且由于化学废物的释放而引起对环境和健康的关注。这促使数据驱动研究范式的引入,以缓解传统的试错策略的不足,而传统的试错策略依赖于大量高质量的数据。本文构建了一个包含PDH和ODH数据的数据集,并对机器学习算法在轻烷烃活化研究中的性能进行了评估,在此基础上提出了一种适合小数据集的策略:对于小的不平衡数据集,将数据集作为一个整体来训练模型是明智的,而不是基于分割的小块数据融合多个特定模型。结果表明,使用集成算法训练的模型对丙烯选择性具有最佳的可预测性,即CatBoost和random forest分别对PDH和LightGBM对ODH具有最佳的可预测性。基于优化模型,确定了影响PDH和ODH的关键因素。本研究展示了数据驱动策略在催化科学中的正确使用,该策略可用于解决其他科学问题,这些问题受到可用高质量数据的限制,并有助于获得新的理解,例如催化系统的合理设计和优化。
{"title":"Small-dataset-orientated data-driven screening for catalytic propane activation","authors":"Jiaqi Chen ,&nbsp;Junqing Li ,&nbsp;Ziyi Liu,&nbsp;Shitao Sun,&nbsp;Shijia Zhou,&nbsp;Dongqi Wang","doi":"10.1016/j.aichem.2024.100083","DOIUrl":"10.1016/j.aichem.2024.100083","url":null,"abstract":"<div><div>This work aims at the proper application of machine learning screening of direct propane dehydrogenation (PDH) reaction and oxidative dehydrogenation (ODH) of propane, which are two main protocols to convert propane to propylene and featured by limited available experimental data. Current studies mainly adopt trial-and-error strategy, which is time consuming and raises concerns on environment and health owing to the release of chemical waste. This motivates the introduction of data-driven research paradigm to alleviate the deficiency of the traditional trial-and-error strategy, which however relies on large quantity of high quality data. In this work, a dataset enveloping PDH and ODH data was constructed, and the performance of machine learning algorithms in the study of light alkane activation was evaluated, based on which a strategy appropriate for small dataset was proposed: for small unbalanced datasets, it is sensible to train the model by treating the dataset as a whole rather than to fuse multiple specific models based on divided smaller pieces of data. The results show that the trained models using ensemble algorithms exhibited the best predictability of propylene selectivity, i.e. CatBoost and random forest for PDH and LightGBM for ODH, respectively. Based on the optimal model, the key influencing factors in PDH and ODH were identified. This study demonstrates the proper use of data-driven strategy in the catalytic science, which can be adopted in other scientific problems that suffer from the limited available high quality data and contribute to the gain of novel understanding, e.g. the rational design and optimization of the catalytic systems.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100083"},"PeriodicalIF":0.0,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143100098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning for active sites prediction of quinoline derivatives 喹啉衍生物活性位点预测的机器学习
Pub Date : 2024-12-04 DOI: 10.1016/j.aichem.2024.100082
Jie Sun, Zi-Hao Li, Yi-Fei Yang, Shu-Yu Zhang
Privileged structures, like quinoline, have diverse biological activities, and their synthetic versatility makes them crucial for drug design. In traditional synthesis methods, the C-H functionalization of quinoline can be effectively achieved using different conditions, especially transition metal catalysis. Machine learning (ML) techniques enable rapid prediction of C-H functionalization, facilitating drug design and synthesis. In this study, a generalizable approach to predict site selectivity is accomplished by using artificial neural network (ANN), which is suitable for the site prediction of derivatives of quinoline. In an 80/10/10 training/validation/testing split of 2467 compounds, the model takes SMILES strings as input format and uses six quantum chemical descriptors to identify reactive site(s) of the compound. On the external validation set, 86 .5% of all molecules were correctly predicted. This model allows chemists to rapidly predict which site is more likely to produce electrophilic substitution reaction.
像喹啉这样的特殊结构具有多种生物活性,它们的合成多功能性使它们对药物设计至关重要。在传统的合成方法中,喹啉的C-H功能化可以通过不同的条件,特别是过渡金属催化,有效地实现。机器学习(ML)技术能够快速预测C-H功能化,促进药物设计和合成。本研究利用人工神经网络(ANN)实现了一种适用于喹啉衍生物位点预测的泛化预测方法。在2467个化合物的80/10/10训练/验证/测试分割中,该模型以SMILES字符串作为输入格式,并使用6个量子化学描述符来识别化合物的活性位点。在外部验证集上,86 。5%的分子被正确预测。这个模型允许化学家快速预测哪个位点更可能产生亲电取代反应。
{"title":"Machine learning for active sites prediction of quinoline derivatives","authors":"Jie Sun,&nbsp;Zi-Hao Li,&nbsp;Yi-Fei Yang,&nbsp;Shu-Yu Zhang","doi":"10.1016/j.aichem.2024.100082","DOIUrl":"10.1016/j.aichem.2024.100082","url":null,"abstract":"<div><div>Privileged structures, like quinoline, have diverse biological activities, and their synthetic versatility makes them crucial for drug design. In traditional synthesis methods, the C-H functionalization of quinoline can be effectively achieved using different conditions, especially transition metal catalysis. Machine learning (ML) techniques enable rapid prediction of C-H functionalization, facilitating drug design and synthesis. In this study, a generalizable approach to predict site selectivity is accomplished by using artificial neural network (ANN), which is suitable for the site prediction of derivatives of quinoline. In an 80/10/10 training/validation/testing split of 2467 compounds, the model takes SMILES strings as input format and uses six quantum chemical descriptors to identify reactive site(s) of the compound. On the external validation set, 86 .5% of all molecules were correctly predicted. This model allows chemists to rapidly predict which site is more likely to produce electrophilic substitution reaction.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100082"},"PeriodicalIF":0.0,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143100097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning approaches for modelling of molecular polarizability in gold nanoclusters 金纳米团簇分子极化性建模的机器学习方法
Pub Date : 2024-11-07 DOI: 10.1016/j.aichem.2024.100080
Abhishek Ojha , Satya S. Bulusu , Arup Banerjee
The polarizability of molecules describes their response to an external electric field. It quantifies the ability of a system to form an induced dipole moment when subjected to an electric field. In this work, we investigated isotropic polarizability and anisotropy in the polarizability of gold nanoclusters using various machine-learning algorithms. We utilized high-order invariant descriptors based on spherical harmonics, integrated with machine-learning models like artificial neural network, Gaussian process regression, and kernel ridge regression. Our results demonstrate the efficacy of machine-learning in accurately predicting the polarizability of gold nanoclusters. We find that ANN-based model performs better than the other models.
分子的极化性描述了分子对外部电场的反应。它量化了一个系统在受到电场作用时形成诱导偶极矩的能力。在这项工作中,我们使用各种机器学习算法研究了金纳米团簇的各向同性极化性和各向异性极化性。我们利用了基于球谐波的高阶不变描述符,并将其与人工神经网络、高斯过程回归和核脊回归等机器学习模型相结合。我们的研究结果证明了机器学习在准确预测金纳米团簇极化性方面的功效。我们发现基于人工神经网络的模型比其他模型表现更好。
{"title":"Machine learning approaches for modelling of molecular polarizability in gold nanoclusters","authors":"Abhishek Ojha ,&nbsp;Satya S. Bulusu ,&nbsp;Arup Banerjee","doi":"10.1016/j.aichem.2024.100080","DOIUrl":"10.1016/j.aichem.2024.100080","url":null,"abstract":"<div><div>The polarizability of molecules describes their response to an external electric field. It quantifies the ability of a system to form an induced dipole moment when subjected to an electric field. In this work, we investigated isotropic polarizability and anisotropy in the polarizability of gold nanoclusters using various machine-learning algorithms. We utilized high-order invariant descriptors based on spherical harmonics, integrated with machine-learning models like artificial neural network, Gaussian process regression, and kernel ridge regression. Our results demonstrate the efficacy of machine-learning in accurately predicting the polarizability of gold nanoclusters. We find that ANN-based model performs better than the other models.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100080"},"PeriodicalIF":0.0,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142653223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluation of machine learning models for the accelerated prediction of density functional theory calculated 19F chemical shifts based on local atomic environments 评估基于局部原子环境加速预测密度泛函理论计算的 19F 化学位移的机器学习模型
Pub Date : 2024-10-17 DOI: 10.1016/j.aichem.2024.100078
Sophia Li , Emma Wang , Leia Pei , Sourodeep Deb , Prashanth Prabhala , Sai Hruday Reddy Nara , Raina Panda , Shiven Eltepu , Marx Akl , Larry McMahan , Edward Njoo
The introduction of fluorine in compounds plays a crucial role in drug development as it greatly influences their final pharmacokinetic and dynamic properties. Due to the prevalence of fluorine in FDA-approved drugs in recent years, identifying the mechanisms driving their chemical transformations has become crucial in the drug discovery landscape. 19F NMR spectroscopy is a powerful analytical technique that allows for the examination of fluorine-containing compounds, offering valuable information about their structure, dynamics, and reactivity. NMR spectra can be interpreted through the leveraging of Density Functional Theory (DFT). However, the screening of compounds and discovery of feasible drug candidates is limited due to its computational cost. Here, we present a machine learning approach to accelerate the prediction of DFT-calculated 19F NMR chemical shifts. The fluorine atoms’ features in the models were derived from their local three-dimensional environments, representing their neighboring atoms within a radius of n Å away from the given fluorine atom in the compound. A comparative analysis of thirteen regression models was conducted using features extracted from 501 fluorinated compounds in our laboratory’s chemical inventory. Among the models, Gradient Boosting Regression (GBR) exhibited the highest performance, achieving a mean absolute error of 3.31 ppm with a local environment radius of 3 Å. This demonstrates a comparable accuracy to DFT calculations while reducing computational time from several hundred seconds to milliseconds. 3 Å was also found to be the most optimal radius across all models when encoding features for local atomic environments.
氟在化合物中的引入在药物开发中起着至关重要的作用,因为它会极大地影响药物的最终药代动力学和动态特性。由于近年来氟在 FDA 批准药物中的普遍存在,确定其化学变化的驱动机制已成为药物发现领域的关键。19F NMR 光谱是一种功能强大的分析技术,可用于检查含氟化合物,提供有关其结构、动力学和反应性的宝贵信息。NMR 光谱可通过密度泛函理论 (DFT) 进行解释。然而,由于计算成本的原因,化合物的筛选和可行候选药物的发现受到了限制。在此,我们提出了一种机器学习方法来加速预测 DFT 计算的 19F NMR 化学位移。模型中氟原子的特征来自于它们的局部三维环境,代表化合物中与给定氟原子相距 n Å 半径范围内的相邻原子。利用从我们实验室化学库存中的 501 种含氟化合物中提取的特征,对 13 个回归模型进行了比较分析。在这些模型中,梯度提升回归模型(GBR)的性能最高,在局部环境半径为 3 Å 的情况下,平均绝对误差为 3.31 ppm。这表明其精度与 DFT 计算相当,同时将计算时间从几百秒缩短到了几毫秒。在对局部原子环境特征进行编码时,3 Å 也被认为是所有模型中最理想的半径。
{"title":"Evaluation of machine learning models for the accelerated prediction of density functional theory calculated 19F chemical shifts based on local atomic environments","authors":"Sophia Li ,&nbsp;Emma Wang ,&nbsp;Leia Pei ,&nbsp;Sourodeep Deb ,&nbsp;Prashanth Prabhala ,&nbsp;Sai Hruday Reddy Nara ,&nbsp;Raina Panda ,&nbsp;Shiven Eltepu ,&nbsp;Marx Akl ,&nbsp;Larry McMahan ,&nbsp;Edward Njoo","doi":"10.1016/j.aichem.2024.100078","DOIUrl":"10.1016/j.aichem.2024.100078","url":null,"abstract":"<div><div>The introduction of fluorine in compounds plays a crucial role in drug development as it greatly influences their final pharmacokinetic and dynamic properties. Due to the prevalence of fluorine in FDA-approved drugs in recent years, identifying the mechanisms driving their chemical transformations has become crucial in the drug discovery landscape. <sup>19</sup>F NMR spectroscopy is a powerful analytical technique that allows for the examination of fluorine-containing compounds, offering valuable information about their structure, dynamics, and reactivity. NMR spectra can be interpreted through the leveraging of Density Functional Theory (DFT). However, the screening of compounds and discovery of feasible drug candidates is limited due to its computational cost. Here, we present a machine learning approach to accelerate the prediction of DFT-calculated <sup>19</sup>F NMR chemical shifts. The fluorine atoms’ features in the models were derived from their local three-dimensional environments, representing their neighboring atoms within a radius of <em>n</em> Å away from the given fluorine atom in the compound. A comparative analysis of thirteen regression models was conducted using features extracted from 501 fluorinated compounds in our laboratory’s chemical inventory. Among the models, Gradient Boosting Regression (GBR) exhibited the highest performance, achieving a mean absolute error of 3.31 ppm with a local environment radius of 3 Å. This demonstrates a comparable accuracy to DFT calculations while reducing computational time from several hundred seconds to milliseconds. 3 Å was also found to be the most optimal radius across all models when encoding features for local atomic environments.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100078"},"PeriodicalIF":0.0,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging graph neural networks to predict Hammett’s constants for benzoic acid derivatives 利用图神经网络预测苯甲酸衍生物的哈米特常数
Pub Date : 2024-10-16 DOI: 10.1016/j.aichem.2024.100079
Vaneet Saini , Ranjeet Kumar
The Hammett constants, σm and σp, reflect the electron-withdrawing and electron-donating abilities of substituents on aromatic compounds, and have been successfully used in various structure-activity relationship studies. However, determining these constants experimentally is both resource-intensive and time-consuming approach. In this study, we explore the use of graph neural networks (GNNs) to predict Hammett constant parameters using graph-based features. This innovative approach aims to provide rapid and efficient predictions of σm and σp values, eliminating the need for extensive computational and experimental setups. By leveraging the power of GNNs, we hope to streamline the process of obtaining these critical parameters, thereby facilitating more efficient reaction design and enhancing the applicability of linear free energy relationship studies in chemical research.
哈米特常数 σm 和 σp 反映了芳香化合物中取代基的吸电子和放电子能力,已成功用于各种结构-活性关系研究。然而,通过实验确定这些常数既耗费资源又耗费时间。在本研究中,我们探索使用图神经网络 (GNN) 来预测基于图特征的哈米特常数参数。这种创新方法旨在快速高效地预测 σm 和 σp 值,无需大量计算和实验设置。通过利用 GNN 的强大功能,我们希望简化获取这些关键参数的过程,从而促进更高效的反应设计,并提高线性自由能关系研究在化学研究中的适用性。
{"title":"Leveraging graph neural networks to predict Hammett’s constants for benzoic acid derivatives","authors":"Vaneet Saini ,&nbsp;Ranjeet Kumar","doi":"10.1016/j.aichem.2024.100079","DOIUrl":"10.1016/j.aichem.2024.100079","url":null,"abstract":"<div><div>The Hammett constants, σ<sub>m</sub> and σ<sub>p</sub>, reflect the electron-withdrawing and electron-donating abilities of substituents on aromatic compounds, and have been successfully used in various structure-activity relationship studies. However, determining these constants experimentally is both resource-intensive and time-consuming approach. In this study, we explore the use of graph neural networks (GNNs) to predict Hammett constant parameters using graph-based features. This innovative approach aims to provide rapid and efficient predictions of σ<sub>m</sub> and σ<sub>p</sub> values, eliminating the need for extensive computational and experimental setups. By leveraging the power of GNNs, we hope to streamline the process of obtaining these critical parameters, thereby facilitating more efficient reaction design and enhancing the applicability of linear free energy relationship studies in chemical research.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100079"},"PeriodicalIF":0.0,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142535586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular similarity: Theory, applications, and perspectives 分子相似性:理论、应用与展望
Pub Date : 2024-08-31 DOI: 10.1016/j.aichem.2024.100077
Kenneth López-Pérez , Juan F. Avellaneda-Tamayo , Lexin Chen , Edgar López-López , K. Eurídice Juárez-Mercado , José L. Medina-Franco , Ramón Alain Miranda-Quintana

Molecular similarity pervades much of our understanding and rationalization of chemistry. This has become particularly evident in the current data-intensive era of chemical research, with similarity measures serving as the backbone of many Machine Learning (ML) supervised and unsupervised procedures. Here, we present a discussion on the role of molecular similarity in drug design, chemical space exploration, chemical “art” generation, molecular representations, and many more. We also discuss more recent topics in molecular similarity, like the ability to efficiently compare large molecular libraries.

分子相似性贯穿了我们对化学的大部分理解和合理化。在当前数据密集型的化学研究时代,这一点变得尤为明显,相似性度量成为许多机器学习(ML)监督和非监督程序的支柱。在此,我们将讨论分子相似性在药物设计、化学空间探索、化学 "艺术 "生成、分子表征等方面的作用。我们还讨论了分子相似性的最新话题,如高效比较大型分子库的能力。
{"title":"Molecular similarity: Theory, applications, and perspectives","authors":"Kenneth López-Pérez ,&nbsp;Juan F. Avellaneda-Tamayo ,&nbsp;Lexin Chen ,&nbsp;Edgar López-López ,&nbsp;K. Eurídice Juárez-Mercado ,&nbsp;José L. Medina-Franco ,&nbsp;Ramón Alain Miranda-Quintana","doi":"10.1016/j.aichem.2024.100077","DOIUrl":"10.1016/j.aichem.2024.100077","url":null,"abstract":"<div><p>Molecular similarity pervades much of our understanding and rationalization of chemistry. This has become particularly evident in the current data-intensive era of chemical research, with similarity measures serving as the backbone of many Machine Learning (ML) supervised and unsupervised procedures. Here, we present a discussion on the role of molecular similarity in drug design, chemical space exploration, chemical “art” generation, molecular representations, and many more. We also discuss more recent topics in molecular similarity, like the ability to efficiently compare large molecular libraries.</p></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"2 2","pages":"Article 100077"},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949747724000356/pdfft?md5=7238a1972b367d1732b52f425b046ba9&pid=1-s2.0-S2949747724000356-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142150935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Artificial intelligence chemistry
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1