Pub Date : 2025-11-04DOI: 10.1016/j.aichem.2025.100097
Siyuan Zeng , Kuanping Gong , Yongquan Jiang , Yan Yang
Molecular symmetry is fundamental to understanding molecular properties, designing functional materials, and optimizing chemical structures. Traditional symmetry determination methods, based on mathematical and rule-based approaches, are often limited by high computational cost and low efficiency. At present, deep learning methods predicting molecular 3D conformations from 2D structures also neglect molecular symmetry and point group considerations. To address these challenges, we propose a novel task: predicting the point group of a molecule's most stable 3D conformation using only its 2D topological graph, thereby enabling symmetry-aware conformation prediction. We adopt Graph Neural Networks (GNNs) to learn from molecular graph structures, and evaluate several GNN variants on this task. Among them, the Graph Isomorphism Network (GIN) achieves the highest accuracy by effectively capturing both local connectivity and global structural information. Experiments on the QM9 dataset show that our method achieves 92.7 % accuracy and an F1-score of 0.924 on the test set, significantly surpassing both traditional approaches and other GNN-based methods. This work demonstrates the potential of deep learning in automated, efficient, and accurate molecular symmetry prediction, providing a valuable tool for future research in computational chemistry and material science.
{"title":"A method for predicting molecular point group based on graph neural networks","authors":"Siyuan Zeng , Kuanping Gong , Yongquan Jiang , Yan Yang","doi":"10.1016/j.aichem.2025.100097","DOIUrl":"10.1016/j.aichem.2025.100097","url":null,"abstract":"<div><div>Molecular symmetry is fundamental to understanding molecular properties, designing functional materials, and optimizing chemical structures. Traditional symmetry determination methods, based on mathematical and rule-based approaches, are often limited by high computational cost and low efficiency. At present, deep learning methods predicting molecular 3D conformations from 2D structures also neglect molecular symmetry and point group considerations. To address these challenges, we propose a novel task: predicting the point group of a molecule's most stable 3D conformation using only its 2D topological graph, thereby enabling symmetry-aware conformation prediction. We adopt Graph Neural Networks (GNNs) to learn from molecular graph structures, and evaluate several GNN variants on this task. Among them, the Graph Isomorphism Network (GIN) achieves the highest accuracy by effectively capturing both local connectivity and global structural information. Experiments on the QM9 dataset show that our method achieves 92.7 % accuracy and an F1-score of 0.924 on the test set, significantly surpassing both traditional approaches and other GNN-based methods. This work demonstrates the potential of deep learning in automated, efficient, and accurate molecular symmetry prediction, providing a valuable tool for future research in computational chemistry and material science.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100097"},"PeriodicalIF":0.0,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145473664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chemists have traditionally relied on heuristic approaches to qualitatively assess chemical structure–property relationships and interpret experimental outcomes. However, these methods are inherently limited in handling large volumes of data and integrating them effectively into experimental planning. Understanding the interrelationships among different substitution patterns of organic molecular materials is crucial for optimizing synthetic conditions and expanding their applicability. In this study, we developed a machine learning (ML) algorithm incorporating latent variables to predict unobservable reactions and synthetic conditions for organic materials, specifically perfluoro-iodinated naphthalene derivatives. The algorithm accurately estimated substitution pattern relationships and reaction yields, which were experimentally validated with high-yield outcomes. Our findings reveal that latent variables effectively capture underlying physicochemical relationships, achieving an R value > 0.99. This approach establishes an ML-guided framework that complements heuristic decision-making in chemistry and optimizes synthetic processes for the target molecule in an extrapolative manner. Further applications of this algorithm will focus on synthetic design and physicochemical property prediction, particularly for catalyst discovery and organic semiconductor optimization.
{"title":"Machine learning-guided synthesis of prospective organic molecular materials: An algorithm with latent variables for understanding and predicting experimentally unobservable reactions","authors":"Kazuhiro Takeda , Naoya Ohtsuka , Toshiyasu Suzuki , Norie Momiyama","doi":"10.1016/j.aichem.2025.100096","DOIUrl":"10.1016/j.aichem.2025.100096","url":null,"abstract":"<div><div>Chemists have traditionally relied on heuristic approaches to qualitatively assess chemical structure–property relationships and interpret experimental outcomes. However, these methods are inherently limited in handling large volumes of data and integrating them effectively into experimental planning. Understanding the interrelationships among different substitution patterns of organic molecular materials is crucial for optimizing synthetic conditions and expanding their applicability. In this study, we developed a machine learning (ML) algorithm incorporating latent variables to predict unobservable reactions and synthetic conditions for organic materials, specifically perfluoro-iodinated naphthalene derivatives. The algorithm accurately estimated substitution pattern relationships and reaction yields, which were experimentally validated with high-yield outcomes. Our findings reveal that latent variables effectively capture underlying physicochemical relationships, achieving an R value > 0.99. This approach establishes an ML-guided framework that complements heuristic decision-making in chemistry and optimizes synthetic processes for the target molecule in an extrapolative manner. Further applications of this algorithm will focus on synthetic design and physicochemical property prediction, particularly for catalyst discovery and organic semiconductor optimization.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100096"},"PeriodicalIF":0.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145424603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-14DOI: 10.1016/j.aichem.2025.100095
Pramoth Varsan Madhavan , Xin Zeng , Samaneh Shahgaldi , Sushanta K. Mitra , Xianguo Li
Transportation’s rising negative environmental impacts and energy demands highlight the urgent need for clean alternative power sources such as proton exchange membrane (PEM) fuel cells. However, the high cost of platinum catalysts hinders its commercialization, making the development of low-platinum, high-performance catalysts essential for achieving net-zero targets. This study employs a data-driven machine learning approach to optimize the oxygen reduction reaction (ORR) catalyst composition and predict its long-term performance using extreme gradient boosting (XGB), artificial neural networks (ANN), and genetic algorithm (GA). Linear sweep voltammetry (LSV) data is collected for three distinct catalyst compositions and divided into separate datasets. Data is preprocessed and model hyperparameters are fine-tuned to enhance model accuracy. XGB models trained on these datasets accurately predicted LSV polarization plots for unseen data, as evidenced by R2 values > 0.99. To further optimize ORR catalyst design, an ANN model trained on data from three different catalyst compositions is integrated with a genetic algorithm. This predictive framework effectively identified optimal catalyst composition by maximizing the mass activity of the catalyst. Experimental validation of this optimized composition yielded strong agreement with predicted LSV current values, confirming the reliability of the ANN-GA approach. This research underscores the potential of machine learning-based predictive frameworks to accelerate the development of advanced ORR catalysts for PEM fuel cells.
{"title":"Optimization of catalyst composition and performance for PEM fuel cells: A data-driven approach","authors":"Pramoth Varsan Madhavan , Xin Zeng , Samaneh Shahgaldi , Sushanta K. Mitra , Xianguo Li","doi":"10.1016/j.aichem.2025.100095","DOIUrl":"10.1016/j.aichem.2025.100095","url":null,"abstract":"<div><div>Transportation’s rising negative environmental impacts and energy demands highlight the urgent need for clean alternative power sources such as proton exchange membrane (PEM) fuel cells. However, the high cost of platinum catalysts hinders its commercialization, making the development of low-platinum, high-performance catalysts essential for achieving net-zero targets. This study employs a data-driven machine learning approach to optimize the oxygen reduction reaction (ORR) catalyst composition and predict its long-term performance using extreme gradient boosting (XGB), artificial neural networks (ANN), and genetic algorithm (GA). Linear sweep voltammetry (LSV) data is collected for three distinct catalyst compositions and divided into separate datasets. Data is preprocessed and model hyperparameters are fine-tuned to enhance model accuracy. XGB models trained on these datasets accurately predicted LSV polarization plots for unseen data, as evidenced by R<sup>2</sup> values > 0.99. To further optimize ORR catalyst design, an ANN model trained on data from three different catalyst compositions is integrated with a genetic algorithm. This predictive framework effectively identified optimal catalyst composition by maximizing the mass activity of the catalyst. Experimental validation of this optimized composition yielded strong agreement with predicted LSV current values, confirming the reliability of the ANN-GA approach. This research underscores the potential of machine learning-based predictive frameworks to accelerate the development of advanced ORR catalysts for PEM fuel cells.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100095"},"PeriodicalIF":0.0,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145104422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-11DOI: 10.1016/j.aichem.2025.100094
Ashish Panghalia, Parth Kumar, Vikram Singh
Long non-coding RNAs are increasingly reported to have critical roles in gene expression, regulation of cellular processes, and in the onset and manifestation of various diseases. Recent studies have highlighted the role of small molecules (SMs) in controlling the functioning of lncRNAs, making SM-lncRNA associations (SLAs) a promising approach for therapeutic development. In this study, using 3563 curated SLAs among 115 SMs and 2826 lncRNAs, five graph learning algorithms are developed for the SLA classification. Node2Vec was used to extract the contextual features of SMs and lncRNAs from their bipartite association network, while Mol2Vec and Doc2Vec algorithms were used for the extraction of molecular features of the SMs and lncRNAs, respectively. Principal components corresponding to the 95 % variability in feature vectors were used to train five graph-learning models, namely, Graph Neural Network (GNN), Graph Convolutional Network (GCN), Graph Attention Network (GAT), Graph Sample and Aggregate (GraphSAGE), and Simplified Graph Convolution (SGConv). Among these five models, GraphSAGE achieved the best performance with an accuracy of 98.0 % and an AUC-ROC of 99.4 % when evaluated over 10 training epochs. Generalizability studies were also conducted to assess whether the developed models maintain robustness, reliability, and practical utility when applied to real-world data. The overall results reported in this work exhibit better performance over previously developed SLA prediction methods. This study underscores the potential of graph-learning methods to effectively capture the intricate associations among SMs and lncRNAs, facilitating the discovery of novel SLAs.
{"title":"GraphSLA: Graph machine learning for predicting small molecule - lncRNA associations","authors":"Ashish Panghalia, Parth Kumar, Vikram Singh","doi":"10.1016/j.aichem.2025.100094","DOIUrl":"10.1016/j.aichem.2025.100094","url":null,"abstract":"<div><div>Long non-coding RNAs are increasingly reported to have critical roles in gene expression, regulation of cellular processes, and in the onset and manifestation of various diseases. Recent studies have highlighted the role of small molecules (SMs) in controlling the functioning of lncRNAs, making SM-lncRNA associations (SLAs) a promising approach for therapeutic development. In this study, using 3563 curated SLAs among 115 SMs and 2826 lncRNAs, five graph learning algorithms are developed for the SLA classification. Node2Vec was used to extract the contextual features of SMs and lncRNAs from their bipartite association network, while Mol2Vec and Doc2Vec algorithms were used for the extraction of molecular features of the SMs and lncRNAs, respectively. Principal components corresponding to the 95 % variability in feature vectors were used to train five graph-learning models, namely, Graph Neural Network (GNN), Graph Convolutional Network (GCN), Graph Attention Network (GAT), Graph Sample and Aggregate (GraphSAGE), and Simplified Graph Convolution (SGConv). Among these five models, GraphSAGE achieved the best performance with an accuracy of 98.0 % and an AUC-ROC of 99.4 % when evaluated over 10 training epochs. Generalizability studies were also conducted to assess whether the developed models maintain robustness, reliability, and practical utility when applied to real-world data. The overall results reported in this work exhibit better performance over previously developed SLA prediction methods. This study underscores the potential of graph-learning methods to effectively capture the intricate associations among SMs and lncRNAs, facilitating the discovery of novel SLAs.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100094"},"PeriodicalIF":0.0,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144841823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1016/j.aichem.2025.100092
Juda Baikété , Alhadji Malloum , Jeanet Conradie
The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.
{"title":"Machine learning prediction of pKa of organic acids","authors":"Juda Baikété , Alhadji Malloum , Jeanet Conradie","doi":"10.1016/j.aichem.2025.100092","DOIUrl":"10.1016/j.aichem.2025.100092","url":null,"abstract":"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100092"},"PeriodicalIF":0.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144830953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-31DOI: 10.1016/j.aichem.2025.100093
Souvik Pore, Kunal Roy
In drug discovery, estimating the drug candidate's pharmacokinetic (PK) parameters is crucial for determining its safety and efficacy within the body. The tissue-to-plasma partition coefficient (Kp) indicates how a drug partitions within a tissue, potentially leading to tissue-specific activity or toxicity. Therefore, determining Kp values for a drug is essential for its safety assessment. However, only a limited number of such studies are available. Here, we developed machine learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) models to predict the Kp values for drugs across 11 different tissues. Initially, we developed models to predict Kp values for drugs with missing Kp values for specific tissues within the dataset solely based on the structural and physicochemical properties of the drugs. Subsequently, another set of models was developed using both structural and physicochemical properties and the Kp values from other tissues. In this case, predicted values from the initial models were also incorporated where experimental Kp values were unavailable. These models demonstrate significant improvement in predictability (Q2F1 = 0.79–0.95, Q2F2 = 0.78–0.95) for a drug compared to the initial models. Here, we conducted a screening using a true external dataset from the SIDER database. This analysis indicates that compounds with higher tissue partitioning are more likely to exhibit toxicity to that specific tissue. Finally, a Python-based tool (https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/kp-calculator) was created to predict Kp values for drugs in different tissues.
{"title":"Machine Learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) modeling of the tissue-to-plasma partition coefficient (Kp) of drugs across different tissues","authors":"Souvik Pore, Kunal Roy","doi":"10.1016/j.aichem.2025.100093","DOIUrl":"10.1016/j.aichem.2025.100093","url":null,"abstract":"<div><div>In drug discovery, estimating the drug candidate's pharmacokinetic (PK) parameters is crucial for determining its safety and efficacy within the body. The tissue-to-plasma partition coefficient (Kp) indicates how a drug partitions within a tissue, potentially leading to tissue-specific activity or toxicity. Therefore, determining K<sub>p</sub> values for a drug is essential for its safety assessment. However, only a limited number of such studies are available. Here, we developed machine learning (ML)-driven quantitative structure-pharmacokinetic relationship (QSPKR) models to predict the K<sub>p</sub> values for drugs across 11 different tissues. Initially, we developed models to predict K<sub>p</sub> values for drugs with missing K<sub>p</sub> values for specific tissues within the dataset solely based on the structural and physicochemical properties of the drugs. Subsequently, another set of models was developed using both structural and physicochemical properties and the K<sub>p</sub> values from other tissues. In this case, predicted values from the initial models were also incorporated where experimental K<sub>p</sub> values were unavailable. These models demonstrate significant improvement in predictability (Q<sup>2</sup><sub>F1</sub> = 0.79–0.95, Q<sup>2</sup><sub>F2</sub> = 0.78–0.95) for a drug compared to the initial models. Here, we conducted a screening using a true external dataset from the SIDER database. This analysis indicates that compounds with higher tissue partitioning are more likely to exhibit toxicity to that specific tissue. Finally, a Python-based tool (<span><span>https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/kp-calculator</span><svg><path></path></svg></span>) was created to predict K<sub>p</sub> values for drugs in different tissues.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100093"},"PeriodicalIF":0.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144763649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-27DOI: 10.1016/j.aichem.2025.100091
Yichuan Peng , Gufeng Yu , Runhan Shi , Letian Chen , Xi Wang , Wenjie Du , Xiaohong Huo , Yang Yang
Molecular chirality is a key focus of research in chemistry and biology. In nature, there are many complex categories of chirality and it can strongly alter biochemical activities and interactions, particularly in asymmetric catalysis and protein–drug binding. Despite advancements in molecular property prediction approaches, a computational method capable of identifying chiral types has been absent, impeding progress in chirality studies. This gap is primarily due to the inability of current molecular representation models to capture chiral-related spatial features and the scarcity of annotated datasets for complex chiral types. To address these limitations, we develop ChiralCat, a pioneering machine learning method for molecular chirality classification. ChiralCat’s core is a pre-trained multi-modal classifier that enhances spatial molecular representations. This is achieved through learnable queries, guided by chirality-related descriptions generated by a large language model (LLM). To facilitate the model’s training, we construct an extensive chiral molecule dataset comprising 17,181 molecules across various chiral categories. Our experimental results, both quantitative and visualized, reveal that ChiralCat outperforms existing 3D molecular representation learning models in capturing spatial information pertinent to chirality, thereby exhibiting superior capability in discerning complex chiral types.
{"title":"ChiralCat: Molecular chirality classification with enhanced spatial representation using learnable queries","authors":"Yichuan Peng , Gufeng Yu , Runhan Shi , Letian Chen , Xi Wang , Wenjie Du , Xiaohong Huo , Yang Yang","doi":"10.1016/j.aichem.2025.100091","DOIUrl":"10.1016/j.aichem.2025.100091","url":null,"abstract":"<div><div>Molecular chirality is a key focus of research in chemistry and biology. In nature, there are many complex categories of chirality and it can strongly alter biochemical activities and interactions, particularly in asymmetric catalysis and protein–drug binding. Despite advancements in molecular property prediction approaches, a computational method capable of identifying chiral types has been absent, impeding progress in chirality studies. This gap is primarily due to the inability of current molecular representation models to capture chiral-related spatial features and the scarcity of annotated datasets for complex chiral types. To address these limitations, we develop ChiralCat, a pioneering machine learning method for molecular chirality classification. ChiralCat’s core is a pre-trained multi-modal classifier that enhances spatial molecular representations. This is achieved through learnable queries, guided by chirality-related descriptions generated by a large language model (LLM). To facilitate the model’s training, we construct an extensive chiral molecule dataset comprising 17,181 molecules across various chiral categories. Our experimental results, both quantitative and visualized, reveal that ChiralCat outperforms existing 3D molecular representation learning models in capturing spatial information pertinent to chirality, thereby exhibiting superior capability in discerning complex chiral types.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100091"},"PeriodicalIF":0.0,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144548861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-22DOI: 10.1016/j.aichem.2025.100089
Adroit T.N. Fajar , Guillaume Lambard , Md. Amirul Islam , Bidyut B. Saha , Zakiah D. Nurfajrin , Kevin Septioga
This study presents a viable approach for designing eco-friendly ionic liquids (ILs) with enhanced CO2 solubility using language models, specifically GPT-2 in conjunction with SMILES-X. The GPT-2 model was fine-tuned on a relatively small, unlabeled IL dataset and subsequently used to generate diverse IL structures. SMILES-X models, trained on IL datasets labeled with CO2 solubility and eco-toxicity values, were employed to predict the properties of the generated ILs. Trends observed in the predicted IL properties were validated using density functional theory (DFT) and COSMO-RS calculations. The GPT-2 model was then fine-tuned iteratively, with the training data updated by including the top generated ILs from previous cycles. This iterative process led to a gradual improvement in the properties of the generated ILs. It was also observed, however, that continuously adding curated generated ILs to the training data eventually caused the model to produce correct but unrealistic IL structures. These findings highlight both the potential and limitations of language models in designing novel chemicals. Additionally, the CO2 adsorption capacity of a surrogate IL was experimentally measured, demonstrating the potential of this approach in advancing decarbonization technologies.
{"title":"Generating eco-friendly ionic liquids with enhanced CO2 solubility using language models","authors":"Adroit T.N. Fajar , Guillaume Lambard , Md. Amirul Islam , Bidyut B. Saha , Zakiah D. Nurfajrin , Kevin Septioga","doi":"10.1016/j.aichem.2025.100089","DOIUrl":"10.1016/j.aichem.2025.100089","url":null,"abstract":"<div><div>This study presents a viable approach for designing eco-friendly ionic liquids (ILs) with enhanced CO<sub>2</sub> solubility using language models, specifically GPT-2 in conjunction with SMILES-X. The GPT-2 model was fine-tuned on a relatively small, unlabeled IL dataset and subsequently used to generate diverse IL structures. SMILES-X models, trained on IL datasets labeled with CO<sub>2</sub> solubility and eco-toxicity values, were employed to predict the properties of the generated ILs. Trends observed in the predicted IL properties were validated using density functional theory (DFT) and COSMO-RS calculations. The GPT-2 model was then fine-tuned iteratively, with the training data updated by including the top generated ILs from previous cycles. This iterative process led to a gradual improvement in the properties of the generated ILs. It was also observed, however, that continuously adding curated generated ILs to the training data eventually caused the model to produce correct but unrealistic IL structures. These findings highlight both the potential and limitations of language models in designing novel chemicals. Additionally, the CO<sub>2</sub> adsorption capacity of a surrogate IL was experimentally measured, demonstrating the potential of this approach in advancing decarbonization technologies.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 1","pages":"Article 100089"},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144138240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}