Artificial intelligence chemistry最新文献_第2页

A method for predicting molecular point group based on graph neural networks 基于图神经网络的分子点群预测方法

Artificial intelligence chemistry

Pub Date : 2025-11-04 DOI: 10.1016/j.aichem.2025.100097

Siyuan Zeng , Kuanping Gong , Yongquan Jiang , Yan Yang

Molecular symmetry is fundamental to understanding molecular properties, designing functional materials, and optimizing chemical structures. Traditional symmetry determination methods, based on mathematical and rule-based approaches, are often limited by high computational cost and low efficiency. At present, deep learning methods predicting molecular 3D conformations from 2D structures also neglect molecular symmetry and point group considerations. To address these challenges, we propose a novel task: predicting the point group of a molecule's most stable 3D conformation using only its 2D topological graph, thereby enabling symmetry-aware conformation prediction. We adopt Graph Neural Networks (GNNs) to learn from molecular graph structures, and evaluate several GNN variants on this task. Among them, the Graph Isomorphism Network (GIN) achieves the highest accuracy by effectively capturing both local connectivity and global structural information. Experiments on the QM9 dataset show that our method achieves 92.7 % accuracy and an F1-score of 0.924 on the test set, significantly surpassing both traditional approaches and other GNN-based methods. This work demonstrates the potential of deep learning in automated, efficient, and accurate molecular symmetry prediction, providing a valuable tool for future research in computational chemistry and material science.

分子对称是理解分子性质、设计功能材料和优化化学结构的基础。传统的对称性确定方法基于数学和基于规则的方法，计算成本高，效率低。目前，从二维结构预测分子三维构象的深度学习方法也忽略了分子对称性和点群的考虑。为了解决这些挑战，我们提出了一个新的任务：仅使用其二维拓扑图预测分子最稳定的3D构象的点群，从而实现对称感知构象预测。我们采用图神经网络（GNN）从分子图结构中学习，并在此任务中评估几种GNN变体。其中，图同构网络（GIN）通过有效捕获局部连通性和全局结构信息，达到了最高的精度。在QM9数据集上的实验表明，我们的方法准确率达到了92.7 %，在测试集上的f1得分达到了0.924，大大超过了传统方法和其他基于gnn的方法。这项工作证明了深度学习在自动化、高效和准确的分子对称性预测方面的潜力，为计算化学和材料科学的未来研究提供了有价值的工具。

{"title":"A method for predicting molecular point group based on graph neural networks","authors":"Siyuan Zeng , Kuanping Gong , Yongquan Jiang , Yan Yang","doi":"10.1016/j.aichem.2025.100097","DOIUrl":"10.1016/j.aichem.2025.100097","url":null,"abstract":"<div><div>Molecular symmetry is fundamental to understanding molecular properties, designing functional materials, and optimizing chemical structures. Traditional symmetry determination methods, based on mathematical and rule-based approaches, are often limited by high computational cost and low efficiency. At present, deep learning methods predicting molecular 3D conformations from 2D structures also neglect molecular symmetry and point group considerations. To address these challenges, we propose a novel task: predicting the point group of a molecule's most stable 3D conformation using only its 2D topological graph, thereby enabling symmetry-aware conformation prediction. We adopt Graph Neural Networks (GNNs) to learn from molecular graph structures, and evaluate several GNN variants on this task. Among them, the Graph Isomorphism Network (GIN) achieves the highest accuracy by effectively capturing both local connectivity and global structural information. Experiments on the QM9 dataset show that our method achieves 92.7 % accuracy and an F1-score of 0.924 on the test set, significantly surpassing both traditional approaches and other GNN-based methods. This work demonstrates the potential of deep learning in automated, efficient, and accurate molecular symmetry prediction, providing a valuable tool for future research in computational chemistry and material science.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100097"},"PeriodicalIF":0.0,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145473664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning-guided synthesis of prospective organic molecular materials: An algorithm with latent variables for understanding and predicting experimentally unobservable reactions 机器学习引导下的有机分子材料合成：一种用于理解和预测实验上不可观察反应的潜在变量算法

Artificial intelligence chemistry

Pub Date : 2025-10-13 DOI: 10.1016/j.aichem.2025.100096

Kazuhiro Takeda , Naoya Ohtsuka , Toshiyasu Suzuki , Norie Momiyama

Chemists have traditionally relied on heuristic approaches to qualitatively assess chemical structure–property relationships and interpret experimental outcomes. However, these methods are inherently limited in handling large volumes of data and integrating them effectively into experimental planning. Understanding the interrelationships among different substitution patterns of organic molecular materials is crucial for optimizing synthetic conditions and expanding their applicability. In this study, we developed a machine learning (ML) algorithm incorporating latent variables to predict unobservable reactions and synthetic conditions for organic materials, specifically perfluoro-iodinated naphthalene derivatives. The algorithm accurately estimated substitution pattern relationships and reaction yields, which were experimentally validated with high-yield outcomes. Our findings reveal that latent variables effectively capture underlying physicochemical relationships, achieving an R value > 0.99. This approach establishes an ML-guided framework that complements heuristic decision-making in chemistry and optimizes synthetic processes for the target molecule in an extrapolative manner. Further applications of this algorithm will focus on synthetic design and physicochemical property prediction, particularly for catalyst discovery and organic semiconductor optimization.

化学家传统上依靠启发式方法定性地评估化学结构-性质关系并解释实验结果。然而，这些方法在处理大量数据并将其有效地整合到实验计划方面存在固有的局限性。了解有机分子材料不同取代模式之间的相互关系对于优化合成条件和扩大其适用性至关重要。在这项研究中，我们开发了一种包含潜在变量的机器学习（ML）算法，用于预测有机材料，特别是全氟碘化萘衍生物的不可观察反应和合成条件。该算法准确估计了取代模式关系和反应产率，并得到了高产率的实验验证。我们的研究结果表明，潜在变量有效地捕获了潜在的物理化学关系，达到R值>； 0.99。这种方法建立了一个ml引导的框架，补充了化学中的启发式决策，并以外推的方式优化了目标分子的合成过程。该算法的进一步应用将集中在合成设计和物理化学性质预测，特别是催化剂的发现和有机半导体的优化。

{"title":"Machine learning-guided synthesis of prospective organic molecular materials: An algorithm with latent variables for understanding and predicting experimentally unobservable reactions","authors":"Kazuhiro Takeda , Naoya Ohtsuka , Toshiyasu Suzuki , Norie Momiyama","doi":"10.1016/j.aichem.2025.100096","DOIUrl":"10.1016/j.aichem.2025.100096","url":null,"abstract":"<div><div>Chemists have traditionally relied on heuristic approaches to qualitatively assess chemical structure–property relationships and interpret experimental outcomes. However, these methods are inherently limited in handling large volumes of data and integrating them effectively into experimental planning. Understanding the interrelationships among different substitution patterns of organic molecular materials is crucial for optimizing synthetic conditions and expanding their applicability. In this study, we developed a machine learning (ML) algorithm incorporating latent variables to predict unobservable reactions and synthetic conditions for organic materials, specifically perfluoro-iodinated naphthalene derivatives. The algorithm accurately estimated substitution pattern relationships and reaction yields, which were experimentally validated with high-yield outcomes. Our findings reveal that latent variables effectively capture underlying physicochemical relationships, achieving an R value > 0.99. This approach establishes an ML-guided framework that complements heuristic decision-making in chemistry and optimizes synthetic processes for the target molecule in an extrapolative manner. Further applications of this algorithm will focus on synthetic design and physicochemical property prediction, particularly for catalyst discovery and organic semiconductor optimization.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100096"},"PeriodicalIF":0.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145424603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimization of catalyst composition and performance for PEM fuel cells: A data-driven approach PEM燃料电池催化剂组成和性能的优化：数据驱动的方法

Artificial intelligence chemistry

Pub Date : 2025-09-14 DOI: 10.1016/j.aichem.2025.100095

Pramoth Varsan Madhavan , Xin Zeng , Samaneh Shahgaldi , Sushanta K. Mitra , Xianguo Li

Transportation’s rising negative environmental impacts and energy demands highlight the urgent need for clean alternative power sources such as proton exchange membrane (PEM) fuel cells. However, the high cost of platinum catalysts hinders its commercialization, making the development of low-platinum, high-performance catalysts essential for achieving net-zero targets. This study employs a data-driven machine learning approach to optimize the oxygen reduction reaction (ORR) catalyst composition and predict its long-term performance using extreme gradient boosting (XGB), artificial neural networks (ANN), and genetic algorithm (GA). Linear sweep voltammetry (LSV) data is collected for three distinct catalyst compositions and divided into separate datasets. Data is preprocessed and model hyperparameters are fine-tuned to enhance model accuracy. XGB models trained on these datasets accurately predicted LSV polarization plots for unseen data, as evidenced by R² values > 0.99. To further optimize ORR catalyst design, an ANN model trained on data from three different catalyst compositions is integrated with a genetic algorithm. This predictive framework effectively identified optimal catalyst composition by maximizing the mass activity of the catalyst. Experimental validation of this optimized composition yielded strong agreement with predicted LSV current values, confirming the reliability of the ANN-GA approach. This research underscores the potential of machine learning-based predictive frameworks to accelerate the development of advanced ORR catalysts for PEM fuel cells.

交通运输对环境的负面影响和能源需求的增加凸显了对质子交换膜（PEM）燃料电池等清洁替代能源的迫切需求。然而，铂催化剂的高成本阻碍了其商业化，因此开发低铂、高性能的催化剂对于实现净零目标至关重要。本研究采用数据驱动的机器学习方法优化氧还原反应（ORR）催化剂组成，并利用极限梯度增强（XGB）、人工神经网络（ANN）和遗传算法（GA）预测其长期性能。线性扫描伏安法（LSV）数据收集了三种不同的催化剂组成，并分为单独的数据集。对数据进行预处理，对模型超参数进行微调，提高模型精度。在这些数据集上训练的XGB模型准确地预测了未见数据的LSV极化图，R2值>； 0.99证明了这一点。为了进一步优化ORR催化剂设计，将基于三种不同催化剂组成数据的人工神经网络模型与遗传算法相结合。该预测框架通过最大化催化剂的质量活性有效地确定了最佳催化剂组成。该优化组合的实验验证结果与预测的LSV电流值非常吻合，证实了ANN-GA方法的可靠性。这项研究强调了基于机器学习的预测框架在加速PEM燃料电池先进ORR催化剂开发方面的潜力。

{"title":"Optimization of catalyst composition and performance for PEM fuel cells: A data-driven approach","authors":"Pramoth Varsan Madhavan , Xin Zeng , Samaneh Shahgaldi , Sushanta K. Mitra , Xianguo Li","doi":"10.1016/j.aichem.2025.100095","DOIUrl":"10.1016/j.aichem.2025.100095","url":null,"abstract":"<div><div>Transportation’s rising negative environmental impacts and energy demands highlight the urgent need for clean alternative power sources such as proton exchange membrane (PEM) fuel cells. However, the high cost of platinum catalysts hinders its commercialization, making the development of low-platinum, high-performance catalysts essential for achieving net-zero targets. This study employs a data-driven machine learning approach to optimize the oxygen reduction reaction (ORR) catalyst composition and predict its long-term performance using extreme gradient boosting (XGB), artificial neural networks (ANN), and genetic algorithm (GA). Linear sweep voltammetry (LSV) data is collected for three distinct catalyst compositions and divided into separate datasets. Data is preprocessed and model hyperparameters are fine-tuned to enhance model accuracy. XGB models trained on these datasets accurately predicted LSV polarization plots for unseen data, as evidenced by R<sup>2</sup> values > 0.99. To further optimize ORR catalyst design, an ANN model trained on data from three different catalyst compositions is integrated with a genetic algorithm. This predictive framework effectively identified optimal catalyst composition by maximizing the mass activity of the catalyst. Experimental validation of this optimized composition yielded strong agreement with predicted LSV current values, confirming the reliability of the ANN-GA approach. This research underscores the potential of machine learning-based predictive frameworks to accelerate the development of advanced ORR catalysts for PEM fuel cells.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100095"},"PeriodicalIF":0.0,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145104422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphSLA: Graph machine learning for predicting small molecule - lncRNA associations GraphSLA：用于预测小分子- lncRNA关联的图机器学习

Artificial intelligence chemistry

Pub Date : 2025-08-11 DOI: 10.1016/j.aichem.2025.100094

Ashish Panghalia, Parth Kumar, Vikram Singh

Long non-coding RNAs are increasingly reported to have critical roles in gene expression, regulation of cellular processes, and in the onset and manifestation of various diseases. Recent studies have highlighted the role of small molecules (SMs) in controlling the functioning of lncRNAs, making SM-lncRNA associations (SLAs) a promising approach for therapeutic development. In this study, using 3563 curated SLAs among 115 SMs and 2826 lncRNAs, five graph learning algorithms are developed for the SLA classification. Node2Vec was used to extract the contextual features of SMs and lncRNAs from their bipartite association network, while Mol2Vec and Doc2Vec algorithms were used for the extraction of molecular features of the SMs and lncRNAs, respectively. Principal components corresponding to the 95 % variability in feature vectors were used to train five graph-learning models, namely, Graph Neural Network (GNN), Graph Convolutional Network (GCN), Graph Attention Network (GAT), Graph Sample and Aggregate (GraphSAGE), and Simplified Graph Convolution (SGConv). Among these five models, GraphSAGE achieved the best performance with an accuracy of 98.0 % and an AUC-ROC of 99.4 % when evaluated over 10 training epochs. Generalizability studies were also conducted to assess whether the developed models maintain robustness, reliability, and practical utility when applied to real-world data. The overall results reported in this work exhibit better performance over previously developed SLA prediction methods. This study underscores the potential of graph-learning methods to effectively capture the intricate associations among SMs and lncRNAs, facilitating the discovery of novel SLAs.

越来越多的报道称，长链非编码rna在基因表达、细胞过程调控以及各种疾病的发病和表现中发挥着关键作用。最近的研究强调了小分子（SMs）在控制lncrna功能中的作用，使SM-lncRNA关联（sla）成为一种有前景的治疗开发方法。在本研究中，使用115个SMs和2826个lncrna中的3563个策划SLA，开发了五种用于SLA分类的图学习算法。使用Node2Vec算法从SMs和lncrna的二部关联网络中提取上下文特征，使用Mol2Vec和Doc2Vec算法分别提取SMs和lncrna的分子特征。利用特征向量中95% %变异率对应的主成分训练5个图学习模型，分别是图神经网络（GNN）、图卷积网络（GCN）、图注意网络（GAT）、图样本与聚合（GraphSAGE）和简化图卷积（SGConv）。在这5个模型中，GraphSAGE在超过10个训练epoch的评估中，准确率达到98.0 %，AUC-ROC达到99.4 %。还进行了概括性研究，以评估所开发的模型在应用于真实世界数据时是否保持稳健性、可靠性和实用性。本工作报告的总体结果比以前开发的SLA预测方法表现出更好的性能。本研究强调了图学习方法在有效捕获SMs和lncrna之间复杂关联方面的潜力，从而促进了新的sla的发现。

{"title":"GraphSLA: Graph machine learning for predicting small molecule - lncRNA associations","authors":"Ashish Panghalia, Parth Kumar, Vikram Singh","doi":"10.1016/j.aichem.2025.100094","DOIUrl":"10.1016/j.aichem.2025.100094","url":null,"abstract":"<div><div>Long non-coding RNAs are increasingly reported to have critical roles in gene expression, regulation of cellular processes, and in the onset and manifestation of various diseases. Recent studies have highlighted the role of small molecules (SMs) in controlling the functioning of lncRNAs, making SM-lncRNA associations (SLAs) a promising approach for therapeutic development. In this study, using 3563 curated SLAs among 115 SMs and 2826 lncRNAs, five graph learning algorithms are developed for the SLA classification. Node2Vec was used to extract the contextual features of SMs and lncRNAs from their bipartite association network, while Mol2Vec and Doc2Vec algorithms were used for the extraction of molecular features of the SMs and lncRNAs, respectively. Principal components corresponding to the 95 % variability in feature vectors were used to train five graph-learning models, namely, Graph Neural Network (GNN), Graph Convolutional Network (GCN), Graph Attention Network (GAT), Graph Sample and Aggregate (GraphSAGE), and Simplified Graph Convolution (SGConv). Among these five models, GraphSAGE achieved the best performance with an accuracy of 98.0 % and an AUC-ROC of 99.4 % when evaluated over 10 training epochs. Generalizability studies were also conducted to assess whether the developed models maintain robustness, reliability, and practical utility when applied to real-world data. The overall results reported in this work exhibit better performance over previously developed SLA prediction methods. This study underscores the potential of graph-learning methods to effectively capture the intricate associations among SMs and lncRNAs, facilitating the discovery of novel SLAs.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100094"},"PeriodicalIF":0.0,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144841823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine learning prediction of pKa of organic acids 有机酸pKa的机器学习预测

Artificial intelligence chemistry

Pub Date : 2025-08-08 DOI: 10.1016/j.aichem.2025.100092

Juda Baikété , Alhadji Malloum , Jeanet Conradie

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE

=

0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.

对数酸解离常数pKa反映了一种化学物质的电离，它影响亲脂性、溶解度、蛋白质结合和穿过质膜的能力。它影响吸收、分布、代谢、排泄和毒性等化学性质。因此，准确预测pKa值对于理解和调节有机分子的酸碱度至关重要，并在药物发现、材料科学和环境化学中得到应用。在这里，我们提出了四种基于树的机器学习模型，用于有机分子的pKa预测。随机森林（Random Forest， RF）、额外树（Extra Trees, ExTr）、直方图梯度增强（Histogram Gradient Boosting, HGBoost）和梯度增强（Gradient Boosting, GBoost）四种模型在实验pKa数据集上进行训练，并在两个外部数据集SAMPL6和SAMPL7上进行测试。引入基于结构和有机参数（SPOC）的描述符来表示分子的物理化学性质。使用密度泛函理论（DFT）计算和RDKit库生成了进一步的分子描述符。用ExTr算法训练的模型预测效果最好，总体平均绝对误差（MAE）为1.41 pKa单位。我们的模型（ExTr）在一系列基准数据上优于所选模型，同时提供两个独特的优势：(1)与专有黑盒相比，完全透明（开放描述符和数据）；(2)与混合QM/ML方法相比，降低了计算成本。虽然像QupKake （MAE = 0.67）这样的专业工具实现了更好的准确性，但我们的框架为可解释的pKa预测提供了一个开源基础，有效地结合了分子物理和机器学习。该模型代表了pKa预测的重大进步，为化学和其他领域的各种应用提供了强大的工具。

{"title":"Machine learning prediction of pKa of organic acids","authors":"Juda Baikété , Alhadji Malloum , Jeanet Conradie","doi":"10.1016/j.aichem.2025.100092","DOIUrl":"10.1016/j.aichem.2025.100092","url":null,"abstract":"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100092"},"PeriodicalIF":0.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144830953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine Learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) modeling of the tissue-to-plasma partition coefficient (Kp) of drugs across different tissues 机器学习（ML）驱动的药物在不同组织间的组织-血浆分配系数（Kp）的定量结构-药代动力学关系（QSPKR）建模

Artificial intelligence chemistry

Pub Date : 2025-07-31 DOI: 10.1016/j.aichem.2025.100093

Souvik Pore, Kunal Roy

In drug discovery, estimating the drug candidate's pharmacokinetic (PK) parameters is crucial for determining its safety and efficacy within the body. The tissue-to-plasma partition coefficient (Kp) indicates how a drug partitions within a tissue, potentially leading to tissue-specific activity or toxicity. Therefore, determining K_p values for a drug is essential for its safety assessment. However, only a limited number of such studies are available. Here, we developed machine learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) models to predict the K_p values for drugs across 11 different tissues. Initially, we developed models to predict K_p values for drugs with missing K_p values for specific tissues within the dataset solely based on the structural and physicochemical properties of the drugs. Subsequently, another set of models was developed using both structural and physicochemical properties and the K_p values from other tissues. In this case, predicted values from the initial models were also incorporated where experimental K_p values were unavailable. These models demonstrate significant improvement in predictability (Q²_F1 = 0.79–0.95, Q²_F2 = 0.78–0.95) for a drug compared to the initial models. Here, we conducted a screening using a true external dataset from the SIDER database. This analysis indicates that compounds with higher tissue partitioning are more likely to exhibit toxicity to that specific tissue. Finally, a Python-based tool (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/kp-calculator) was created to predict K_p values for drugs in different tissues.

在药物发现过程中，估计候选药物的药代动力学（PK）参数对于确定其在体内的安全性和有效性至关重要。组织-血浆分配系数（Kp）表明药物如何在组织内分配，可能导致组织特异性活性或毒性。因此，确定一种药物的Kp值对其安全性评估至关重要。然而，这类研究的数量有限。在这里，我们开发了机器学习（ML）驱动的定量结构-药代动力学关系（QSPKR）模型来预测药物在11种不同组织中的Kp值。最初，我们开发了模型，仅基于药物的结构和物理化学性质来预测数据集中特定组织中缺失Kp值的药物的Kp值。随后，利用结构和物理化学性质以及其他组织的Kp值开发了另一组模型。在这种情况下，在实验Kp值不可用的地方，也纳入了初始模型的预测值。与初始模型相比，这些模型在药物的可预测性方面有显著改善（Q2F1 = 0.79-0.95, Q2F2 = 0.78-0.95）。在这里，我们使用来自SIDER数据库的真正外部数据集进行筛选。这一分析表明，具有较高组织分配的化合物更有可能对特定组织表现出毒性。最后，创建了一个基于python的工具（https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/kp-calculator）来预测药物在不同组织中的Kp值。

{"title":"Machine Learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) modeling of the tissue-to-plasma partition coefficient (Kp) of drugs across different tissues","authors":"Souvik Pore, Kunal Roy","doi":"10.1016/j.aichem.2025.100093","DOIUrl":"10.1016/j.aichem.2025.100093","url":null,"abstract":"<div><div>In drug discovery, estimating the drug candidate's pharmacokinetic (PK) parameters is crucial for determining its safety and efficacy within the body. The tissue-to-plasma partition coefficient (Kp) indicates how a drug partitions within a tissue, potentially leading to tissue-specific activity or toxicity. Therefore, determining K<sub>p</sub> values for a drug is essential for its safety assessment. However, only a limited number of such studies are available. Here, we developed machine learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) models to predict the K<sub>p</sub> values for drugs across 11 different tissues. Initially, we developed models to predict K<sub>p</sub> values for drugs with missing K<sub>p</sub> values for specific tissues within the dataset solely based on the structural and physicochemical properties of the drugs. Subsequently, another set of models was developed using both structural and physicochemical properties and the K<sub>p</sub> values from other tissues. In this case, predicted values from the initial models were also incorporated where experimental K<sub>p</sub> values were unavailable. These models demonstrate significant improvement in predictability (Q<sup>2</sup><sub>F1</sub> = 0.79–0.95, Q<sup>2</sup><sub>F2</sub> = 0.78–0.95) for a drug compared to the initial models. Here, we conducted a screening using a true external dataset from the SIDER database. This analysis indicates that compounds with higher tissue partitioning are more likely to exhibit toxicity to that specific tissue. Finally, a Python-based tool (<span><span>https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/kp-calculator</span><svg><path></path></svg></span>) was created to predict K<sub>p</sub> values for drugs in different tissues.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100093"},"PeriodicalIF":0.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144763649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ChiralCat: Molecular chirality classification with enhanced spatial representation using learnable queries ChiralCat：使用可学习查询增强空间表示的分子手性分类

Artificial intelligence chemistry

Pub Date : 2025-06-27 DOI: 10.1016/j.aichem.2025.100091

Yichuan Peng , Gufeng Yu , Runhan Shi , Letian Chen , Xi Wang , Wenjie Du , Xiaohong Huo , Yang Yang

Molecular chirality is a key focus of research in chemistry and biology. In nature, there are many complex categories of chirality and it can strongly alter biochemical activities and interactions, particularly in asymmetric catalysis and protein–drug binding. Despite advancements in molecular property prediction approaches, a computational method capable of identifying chiral types has been absent, impeding progress in chirality studies. This gap is primarily due to the inability of current molecular representation models to capture chiral-related spatial features and the scarcity of annotated datasets for complex chiral types. To address these limitations, we develop ChiralCat, a pioneering machine learning method for molecular chirality classification. ChiralCat’s core is a pre-trained multi-modal classifier that enhances spatial molecular representations. This is achieved through learnable queries, guided by chirality-related descriptions generated by a large language model (LLM). To facilitate the model’s training, we construct an extensive chiral molecule dataset comprising 17,181 molecules across various chiral categories. Our experimental results, both quantitative and visualized, reveal that ChiralCat outperforms existing 3D molecular representation learning models in capturing spatial information pertinent to chirality, thereby exhibiting superior capability in discerning complex chiral types.

分子手性是化学和生物学研究的热点。在自然界中，手性有许多复杂的类别，它可以强烈地改变生物化学活动和相互作用，特别是在不对称催化和蛋白质-药物结合方面。尽管分子性质预测方法取得了进步，但缺乏一种能够识别手性类型的计算方法，阻碍了手性研究的进展。这种差距主要是由于目前的分子表示模型无法捕获与手性相关的空间特征，以及缺乏针对复杂手性类型的注释数据集。为了解决这些限制，我们开发了ChiralCat，这是一种用于分子手性分类的开创性机器学习方法。ChiralCat的核心是一个预训练的多模态分类器，可以增强空间分子表征。这是通过可学习的查询实现的，由大型语言模型（LLM）生成的手性相关描述指导。为了方便模型的训练，我们构建了一个广泛的手性分子数据集，包括各种手性类别的17,181个分子。我们的实验结果，无论是定量的还是可视化的，都表明ChiralCat在捕获与手性相关的空间信息方面优于现有的3D分子表征学习模型，从而在识别复杂的手性类型方面表现出卓越的能力。

{"title":"ChiralCat: Molecular chirality classification with enhanced spatial representation using learnable queries","authors":"Yichuan Peng , Gufeng Yu , Runhan Shi , Letian Chen , Xi Wang , Wenjie Du , Xiaohong Huo , Yang Yang","doi":"10.1016/j.aichem.2025.100091","DOIUrl":"10.1016/j.aichem.2025.100091","url":null,"abstract":"<div><div>Molecular chirality is a key focus of research in chemistry and biology. In nature, there are many complex categories of chirality and it can strongly alter biochemical activities and interactions, particularly in asymmetric catalysis and protein–drug binding. Despite advancements in molecular property prediction approaches, a computational method capable of identifying chiral types has been absent, impeding progress in chirality studies. This gap is primarily due to the inability of current molecular representation models to capture chiral-related spatial features and the scarcity of annotated datasets for complex chiral types. To address these limitations, we develop ChiralCat, a pioneering machine learning method for molecular chirality classification. ChiralCat’s core is a pre-trained multi-modal classifier that enhances spatial molecular representations. This is achieved through learnable queries, guided by chirality-related descriptions generated by a large language model (LLM). To facilitate the model’s training, we construct an extensive chiral molecule dataset comprising 17,181 molecules across various chiral categories. Our experimental results, both quantitative and visualized, reveal that ChiralCat outperforms existing 3D molecular representation learning models in capturing spatial information pertinent to chirality, thereby exhibiting superior capability in discerning complex chiral types.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100091"},"PeriodicalIF":0.0,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144548861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Erratum regarding missing statements in previously published article 关于先前发表的文章中缺失陈述的勘误

Artificial intelligence chemistry

Pub Date : 2025-06-01 DOI: 10.1016/j.aichem.2024.100081

引用次数: 0

Corrigendum to “Machine learning assisted analysis and prediction of rubber formulation using existing databases” [Artif. Intell. Chem. 2/1 (2024) 100054] “使用现有数据库的机器学习辅助分析和预测橡胶配方”的勘误表[Artif。智能。化学。2/1 (2024)100054]

Artificial intelligence chemistry

Pub Date : 2025-06-01 DOI: 10.1016/j.aichem.2025.100088

Wei Deng , Yuehua Zhao , Yafang Zheng , Yuan Yin , Yan Huan , Lijun Liu , Dapeng Wang

引用次数: 0

Generating eco-friendly ionic liquids with enhanced CO2 solubility using language models 使用语言模型生成具有增强二氧化碳溶解度的环保离子液体

Artificial intelligence chemistry

Pub Date : 2025-05-22 DOI: 10.1016/j.aichem.2025.100089

Adroit T.N. Fajar , Guillaume Lambard , Md. Amirul Islam , Bidyut B. Saha , Zakiah D. Nurfajrin , Kevin Septioga

This study presents a viable approach for designing eco-friendly ionic liquids (ILs) with enhanced CO₂ solubility using language models, specifically GPT-2 in conjunction with SMILES-X. The GPT-2 model was fine-tuned on a relatively small, unlabeled IL dataset and subsequently used to generate diverse IL structures. SMILES-X models, trained on IL datasets labeled with CO₂ solubility and eco-toxicity values, were employed to predict the properties of the generated ILs. Trends observed in the predicted IL properties were validated using density functional theory (DFT) and COSMO-RS calculations. The GPT-2 model was then fine-tuned iteratively, with the training data updated by including the top generated ILs from previous cycles. This iterative process led to a gradual improvement in the properties of the generated ILs. It was also observed, however, that continuously adding curated generated ILs to the training data eventually caused the model to produce correct but unrealistic IL structures. These findings highlight both the potential and limitations of language models in designing novel chemicals. Additionally, the CO₂ adsorption capacity of a surrogate IL was experimentally measured, demonstrating the potential of this approach in advancing decarbonization technologies.

本研究提出了一种可行的方法来设计具有增强二氧化碳溶解度的生态友好型离子液体（ILs），使用语言模型，特别是GPT-2与SMILES-X结合。GPT-2模型在相对较小的未标记IL数据集上进行了微调，随后用于生成不同的IL结构。smile - x模型在标记了CO2溶解度和生态毒性值的IL数据集上进行训练，用于预测生成的IL的性质。利用密度泛函理论（DFT）和cosmos - rs计算验证了在预测IL性质中观察到的趋势。然后对GPT-2模型进行迭代微调，并通过包括前一个周期中生成的顶部il来更新训练数据。这一迭代过程导致生成的il的性质逐渐改善。然而，我们也观察到，不断向训练数据中添加精心生成的IL，最终会导致模型产生正确但不现实的IL结构。这些发现突出了语言模型在设计新型化学物质方面的潜力和局限性。此外，通过实验测量了替代IL的CO2吸附能力，证明了该方法在推进脱碳技术方面的潜力。

{"title":"Generating eco-friendly ionic liquids with enhanced CO2 solubility using language models","authors":"Adroit T.N. Fajar , Guillaume Lambard , Md. Amirul Islam , Bidyut B. Saha , Zakiah D. Nurfajrin , Kevin Septioga","doi":"10.1016/j.aichem.2025.100089","DOIUrl":"10.1016/j.aichem.2025.100089","url":null,"abstract":"<div><div>This study presents a viable approach for designing eco-friendly ionic liquids (ILs) with enhanced CO<sub>2</sub> solubility using language models, specifically GPT-2 in conjunction with SMILES-X. The GPT-2 model was fine-tuned on a relatively small, unlabeled IL dataset and subsequently used to generate diverse IL structures. SMILES-X models, trained on IL datasets labeled with CO<sub>2</sub> solubility and eco-toxicity values, were employed to predict the properties of the generated ILs. Trends observed in the predicted IL properties were validated using density functional theory (DFT) and COSMO-RS calculations. The GPT-2 model was then fine-tuned iteratively, with the training data updated by including the top generated ILs from previous cycles. This iterative process led to a gradual improvement in the properties of the generated ILs. It was also observed, however, that continuously adding curated generated ILs to the training data eventually caused the model to produce correct but unrealistic IL structures. These findings highlight both the potential and limitations of language models in designing novel chemicals. Additionally, the CO<sub>2</sub> adsorption capacity of a surrogate IL was experimentally measured, demonstrating the potential of this approach in advancing decarbonization technologies.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100089"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0