Mael A Briand, Loïc Dreano, Ashenafi Legehar, Evgeni Grazhdankin, Leo Ghemtio, Henri Xhaard
Cooperative molecular contacts play an important role in protein structure and ligand binding. Here, we constructed a PostgreSQL database that stores structural information in the form of atomic environments and allows flexible mining of molecular contacts. Taking the Ser-His-Asp/Glu catalytic triad as a first test case, we demonstrate that the presence of a carboxylate oxygen atom in the vicinity of a His is associated with shorter Ser-OH..N-His bond in the PDB30 subset. We prospectively mine catalytic triads in unannotated proteins, suggesting catalytic functions for unannotated proteins. As a second test case, we demonstrate that this database system can include ligand atoms, represented by Sybyl atom types, by evaluating the proportion of counter-ions for ligand carboxylate oxygens.
{"title":"Exploring cooperative molecular contacts using a PostgreSQL database system.","authors":"Mael A Briand, Loïc Dreano, Ashenafi Legehar, Evgeni Grazhdankin, Leo Ghemtio, Henri Xhaard","doi":"10.1002/minf.202200235","DOIUrl":"https://doi.org/10.1002/minf.202200235","url":null,"abstract":"<p><p>Cooperative molecular contacts play an important role in protein structure and ligand binding. Here, we constructed a PostgreSQL database that stores structural information in the form of atomic environments and allows flexible mining of molecular contacts. Taking the Ser-His-Asp/Glu catalytic triad as a first test case, we demonstrate that the presence of a carboxylate oxygen atom in the vicinity of a His is associated with shorter Ser-OH..N-His bond in the PDB30 subset. We prospectively mine catalytic triads in unannotated proteins, suggesting catalytic functions for unannotated proteins. As a second test case, we demonstrate that this database system can include ligand atoms, represented by Sybyl atom types, by evaluating the proportion of counter-ions for ligand carboxylate oxygens.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9457184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug Target Interactions (DTIs) are crucial in drug discovery as it reduces the range of candidate searches, speeding up the drug screening process. Considering in vitro and in vivo experimentations are time and cost-expensive, there has been a surge in computational techniques, especially ML methods for DTIs prediction. Therefore, this study aims to present a methodology that uses molecular structures and amino acid sequences for generating PSSM and PubChem fingerprints for drugs and targets respectively. The proposed work uses a novel technique NearestCUS for handling the class imbalance problem of the benchmark datasets. We use Isomap Embedding to extract features from PSSMs. Feature selection is performed using ANOVA. CatBoost is used for predicting the interaction between drugs and targets for the first time. To quantify the efficacy of NearestCUS, we compared it with other sampling techniques. We found that the proposed methodology performed better than state-of-the-art approaches.
{"title":"A machine learning strategy with clustering under sampling of majority instances for predicting drug target interactions.","authors":"Tanya Liyaqat, Tanvir Ahmad","doi":"10.1002/minf.202200102","DOIUrl":"https://doi.org/10.1002/minf.202200102","url":null,"abstract":"<p><p>Drug Target Interactions (DTIs) are crucial in drug discovery as it reduces the range of candidate searches, speeding up the drug screening process. Considering in vitro and in vivo experimentations are time and cost-expensive, there has been a surge in computational techniques, especially ML methods for DTIs prediction. Therefore, this study aims to present a methodology that uses molecular structures and amino acid sequences for generating PSSM and PubChem fingerprints for drugs and targets respectively. The proposed work uses a novel technique NearestCUS for handling the class imbalance problem of the benchmark datasets. We use Isomap Embedding to extract features from PSSMs. Feature selection is performed using ANOVA. CatBoost is used for predicting the interaction between drugs and targets for the first time. To quantify the efficacy of NearestCUS, we compared it with other sampling techniques. We found that the proposed methodology performed better than state-of-the-art approaches.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9460164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph generative models have recently emerged as an interesting approach to construct molecular structures atom-by-atom or fragment-by-fragment. In this study, we adopt the fragment-based strategy and decompose each input molecule into a set of small chemical fragments. In drug discovery, a few drug molecules are designed by replacing certain chemical substituents with their bioisosteres or alternative chemical moieties. This inspires us to group decomposed fragments into different fragment clusters according to their local structural environment around bond-breaking positions. In this way, an input structure can be transformed into an equivalent three-layer graph, in which individual atoms, decomposed fragments, or obtained fragment clusters act as graph nodes at each corresponding layer. We further implement a prototype model, named multi-resolution graph variational autoencoder (MRGVAE), to learn embeddings of constituted nodes at each layer in a fine-to-coarse order. Our decoder adopts a similar but conversely hierarchical structure. It first predicts the next possible fragment cluster, then samples an exact fragment structure out of the determined fragment cluster, and sequentially attaches it to the preceding chemical moiety. Our proposed approach demonstrates comparatively good performance in molecular evaluation metrics compared with several other graph-based molecular generative models. The introduction of the additional fragment cluster graph layer will hopefully increase the odds of assembling new chemical moieties absent in the original training set and enhance their structural diversity. We hope that our prototyping work will inspire more creative research to explore the possibility of incorporating different kinds of chemical domain knowledge into a similar multi-resolution neural network architecture.
{"title":"Fragment-based deep molecular generation using hierarchical chemical graph representation and multi-resolution graph variational autoencoder.","authors":"Zhenxiang Gao, Xinyu Wang, Blake Blumenfeld Gaines, Xuetao Shi, Jinbo Bi, Minghu Song","doi":"10.1002/minf.202200215","DOIUrl":"https://doi.org/10.1002/minf.202200215","url":null,"abstract":"<p><p>Graph generative models have recently emerged as an interesting approach to construct molecular structures atom-by-atom or fragment-by-fragment. In this study, we adopt the fragment-based strategy and decompose each input molecule into a set of small chemical fragments. In drug discovery, a few drug molecules are designed by replacing certain chemical substituents with their bioisosteres or alternative chemical moieties. This inspires us to group decomposed fragments into different fragment clusters according to their local structural environment around bond-breaking positions. In this way, an input structure can be transformed into an equivalent three-layer graph, in which individual atoms, decomposed fragments, or obtained fragment clusters act as graph nodes at each corresponding layer. We further implement a prototype model, named multi-resolution graph variational autoencoder (MRGVAE), to learn embeddings of constituted nodes at each layer in a fine-to-coarse order. Our decoder adopts a similar but conversely hierarchical structure. It first predicts the next possible fragment cluster, then samples an exact fragment structure out of the determined fragment cluster, and sequentially attaches it to the preceding chemical moiety. Our proposed approach demonstrates comparatively good performance in molecular evaluation metrics compared with several other graph-based molecular generative models. The introduction of the additional fragment cluster graph layer will hopefully increase the odds of assembling new chemical moieties absent in the original training set and enhance their structural diversity. We hope that our prototyping work will inspire more creative research to explore the possibility of incorporating different kinds of chemical domain knowledge into a similar multi-resolution neural network architecture.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9455075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Kohlbacher, Gökhan Ibis, Christian Permann, Sharon Bryant, Thierry Langer, Thomas Seidel
Dissemination of novel research methods, especially in the form of chemoinformatics software, depends heavily on their ease of applicability for non-expert users with only a little or no programming skills and knowledge in computer science. Visual programming has become widely popular over the last few years, also enabling researchers without in-depth programming skills to develop tailored data processing pipelines using elements from a repository of predefined standard procedures. In this work, we present the development of a set of nodes for the KNIME platform implementing the QPhAR algorithm. We show how the developed KNIME nodes can be included in a typical workflow for biological activity prediction. Furthermore, we present best-practice guidelines that should be followed to obtain high-quality QPhAR models. Finally, we show a typical workflow to train and optimise a QPhAR model in KNIME for a set of given input compounds, applying the discussed best practices.
{"title":"A new set of KNIME nodes implementing the QPhAR algorithm.","authors":"Stefan Kohlbacher, Gökhan Ibis, Christian Permann, Sharon Bryant, Thierry Langer, Thomas Seidel","doi":"10.1002/minf.202200245","DOIUrl":"https://doi.org/10.1002/minf.202200245","url":null,"abstract":"<p><p>Dissemination of novel research methods, especially in the form of chemoinformatics software, depends heavily on their ease of applicability for non-expert users with only a little or no programming skills and knowledge in computer science. Visual programming has become widely popular over the last few years, also enabling researchers without in-depth programming skills to develop tailored data processing pipelines using elements from a repository of predefined standard procedures. In this work, we present the development of a set of nodes for the KNIME platform implementing the QPhAR algorithm. We show how the developed KNIME nodes can be included in a typical workflow for biological activity prediction. Furthermore, we present best-practice guidelines that should be followed to obtain high-quality QPhAR models. Finally, we show a typical workflow to train and optimise a QPhAR model in KNIME for a set of given input compounds, applying the discussed best practices.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9826136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The main protease (Mpro ) is an essential enzyme for the life cycle of SARS-CoV-2 and a validated target for treatment of COVID-19 infection. Structure-based pharmacophore modeling combined with QSAR calculations were employed to identify new chemical scaffolds of Mpro inhibitors from natural products repository. Hundreds of pharmacophore models were manually built from their corresponding X-ray crystallographic structures. A pharmacophore model that was validated by receiver operating characteristic (ROC) curve analysis and selected using the statistically optimum QSAR equation was implemented as a 3D-search tool to mine AnalytiCon Discovery database of natural products. Captured hits that showed the highest predicted inhibitory activities were bioassayed. Three active Mpro inhibitors (pseurotin A, lactupicrin, and alpinetin) were successfully identified with IC50 values in low micromolar range.
{"title":"Discovery of natural-derived M<sup>pro</sup> inhibitors as therapeutic candidates for COVID-19: Structure-based pharmacophore screening combined with QSAR analysis.","authors":"Mohammad A Khanfar, Nada Salaas, Reem Abumostafa","doi":"10.1002/minf.202200198","DOIUrl":"https://doi.org/10.1002/minf.202200198","url":null,"abstract":"<p><p>The main protease (M<sup>pro</sup> ) is an essential enzyme for the life cycle of SARS-CoV-2 and a validated target for treatment of COVID-19 infection. Structure-based pharmacophore modeling combined with QSAR calculations were employed to identify new chemical scaffolds of M<sup>pro</sup> inhibitors from natural products repository. Hundreds of pharmacophore models were manually built from their corresponding X-ray crystallographic structures. A pharmacophore model that was validated by receiver operating characteristic (ROC) curve analysis and selected using the statistically optimum QSAR equation was implemented as a 3D-search tool to mine AnalytiCon Discovery database of natural products. Captured hits that showed the highest predicted inhibitory activities were bioassayed. Three active M<sup>pro</sup> inhibitors (pseurotin A, lactupicrin, and alpinetin) were successfully identified with IC<sub>50</sub> values in low micromolar range.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9660815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philippe Pinel, Gwenn Guichaoua, Matthieu Najm, Stéphanie Labouille, Nicolas Drizard, Yann Gaston-Mathé, Brice Hoffmann, Véronique Stoven
Identification of novel chemotypes with biological activity similar to a known active molecule is an important challenge in drug discovery called 'scaffold hopping'. Small-, medium-, and large-step scaffold hopping efforts may lead to increasing degrees of chemical structure novelty with respect to the parent compound. In the present paper, we focus on the problem of large-step scaffold hopping. We assembled a high quality and well characterized dataset of scaffold hopping examples comprising pairs of active molecules and including a variety of protein targets. This dataset was used to build a benchmark corresponding to the setting of real-life applications: one active molecule is known, and the second active is searched among a set of decoys chosen in a way to avoid statistical bias. This allowed us to evaluate the performance of computational methods for solving large-step scaffold hopping problems. In particular, we assessed how difficult these problems are, particularly for classical 2D and 3D ligand-based methods. We also showed that a machine-learning chemogenomic algorithm outperforms classical methods and we provided some useful hints for future improvements.
{"title":"Exploring isofunctional molecules: Design of a benchmark and evaluation of prediction performance.","authors":"Philippe Pinel, Gwenn Guichaoua, Matthieu Najm, Stéphanie Labouille, Nicolas Drizard, Yann Gaston-Mathé, Brice Hoffmann, Véronique Stoven","doi":"10.1002/minf.202200216","DOIUrl":"https://doi.org/10.1002/minf.202200216","url":null,"abstract":"<p><p>Identification of novel chemotypes with biological activity similar to a known active molecule is an important challenge in drug discovery called 'scaffold hopping'. Small-, medium-, and large-step scaffold hopping efforts may lead to increasing degrees of chemical structure novelty with respect to the parent compound. In the present paper, we focus on the problem of large-step scaffold hopping. We assembled a high quality and well characterized dataset of scaffold hopping examples comprising pairs of active molecules and including a variety of protein targets. This dataset was used to build a benchmark corresponding to the setting of real-life applications: one active molecule is known, and the second active is searched among a set of decoys chosen in a way to avoid statistical bias. This allowed us to evaluate the performance of computational methods for solving large-step scaffold hopping problems. In particular, we assessed how difficult these problems are, particularly for classical 2D and 3D ligand-based methods. We also showed that a machine-learning chemogenomic algorithm outperforms classical methods and we provided some useful hints for future improvements.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9645704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanoch Senderowitz, Malkeet Singh Bahia, Omer Kaspi, Meir Touitou, Idan Binayev, Seema Dhail, Jacob Spiegel, Netaly Khazanov, Abraham Yosipof
QSAR models are widely and successfully used in many research areas. The success of such models highly depends on molecular descriptors typically classified as 1D, 2D, 3D, or 4D. While 3D information is likely important, e. g., for modeling ligand-protein binding, previous comparisons between the performances of 2D and 3D descriptors were inconclusive. Yet in such comparisons the modeled ligands were not necessarily represented by their bioactive conformations. With this in mind, we mined the PDB for sets of protein-ligand complexes sharing the same protein for which uniform activity data were reported. The results, totaling 461 structures spread across six series were compiled into a carefully curated, first of its kind dataset in which each ligand is represented by its bioactive conformation. Next, each set was characterized by 2D, 3D and 2D + 3D descriptors and modeled using three machine learning algorithms, namely, k-Nearest Neighbors, Random Forest and Lasso Regression. Models' performances were evaluated on external test sets derived from the parent datasets either randomly or in a rational manner. We found that many more significant models were obtained when combining 2D and 3D descriptors. We attribute these improvements to the ability of 2D and 3D descriptors to code for different, yet complementary molecular properties.
{"title":"A comparison between 2D and 3D descriptors in QSAR modeling based on bio-active conformations.","authors":"Hanoch Senderowitz, Malkeet Singh Bahia, Omer Kaspi, Meir Touitou, Idan Binayev, Seema Dhail, Jacob Spiegel, Netaly Khazanov, Abraham Yosipof","doi":"10.1002/minf.202200186","DOIUrl":"https://doi.org/10.1002/minf.202200186","url":null,"abstract":"<p><p>QSAR models are widely and successfully used in many research areas. The success of such models highly depends on molecular descriptors typically classified as 1D, 2D, 3D, or 4D. While 3D information is likely important, e. g., for modeling ligand-protein binding, previous comparisons between the performances of 2D and 3D descriptors were inconclusive. Yet in such comparisons the modeled ligands were not necessarily represented by their bioactive conformations. With this in mind, we mined the PDB for sets of protein-ligand complexes sharing the same protein for which uniform activity data were reported. The results, totaling 461 structures spread across six series were compiled into a carefully curated, first of its kind dataset in which each ligand is represented by its bioactive conformation. Next, each set was characterized by 2D, 3D and 2D + 3D descriptors and modeled using three machine learning algorithms, namely, k-Nearest Neighbors, Random Forest and Lasso Regression. Models' performances were evaluated on external test sets derived from the parent datasets either randomly or in a rational manner. We found that many more significant models were obtained when combining 2D and 3D descriptors. We attribute these improvements to the ability of 2D and 3D descriptors to code for different, yet complementary molecular properties.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9296517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to analyze the Chimiothèque Nationale (CN) - The French National Compound Library - in the context of screening and biologically relevant compounds, the library was compared with ZINC in-stock collection and ChEMBL. This includes the study of chemical space coverage, physicochemical properties and Bemis-Murcko (BM) scaffold populations. More than 5 K CN-unique scaffolds (relative to ZINC and ChEMBL collections) were identified. Generative Topographic Maps (GTMs) accommodating those libraries were generated and used to compare the compound populations. Hierarchical GTM («zooming») was applied to generate an ensemble of maps at various resolution levels, from global overview to precise mapping of individual structures. The respective maps were added to the ChemSpace Atlas website. The analysis of synthetic accessibility in the context of combinatorial chemistry showed that only 29,7 % of CN compounds can be fully synthesized using commercially available building blocks.
{"title":"French dispatch: GTM-based analysis of the Chimiothèque Nationale Chemical Space.","authors":"Polina Oleneva, Yuliana Zabolotna, Dragos Horvath, Gilles Marcou, Fanny Bonachera, Alexandre Varnek","doi":"10.1002/minf.202200208","DOIUrl":"https://doi.org/10.1002/minf.202200208","url":null,"abstract":"<p><p>In order to analyze the Chimiothèque Nationale (CN) - The French National Compound Library - in the context of screening and biologically relevant compounds, the library was compared with ZINC in-stock collection and ChEMBL. This includes the study of chemical space coverage, physicochemical properties and Bemis-Murcko (BM) scaffold populations. More than 5 K CN-unique scaffolds (relative to ZINC and ChEMBL collections) were identified. Generative Topographic Maps (GTMs) accommodating those libraries were generated and used to compare the compound populations. Hierarchical GTM («zooming») was applied to generate an ensemble of maps at various resolution levels, from global overview to precise mapping of individual structures. The respective maps were added to the ChemSpace Atlas website. The analysis of synthetic accessibility in the context of combinatorial chemistry showed that only 29,7 % of CN compounds can be fully synthesized using commercially available building blocks.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9653057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arkaprava Banerjee, Agnieszka Gajewicz-Skretna, K Roy
In this study, the specific surface area of various perovskites was modeled using a novel quantitative read-across structure-property relationship (q-RASPR) approach, which clubs both Read-Across (RA) and quantitative structure-property relationship (QSPR) together. After optimization of the hyper-parameters, certain similarity-based error measures for each query compound were obtained. Clubbing some of these error-based measures with the previously selected features along with the Read-Across prediction function, a number of machine learning models were developed using Partial Least Squares (PLS), Ridge Regression (RR), Linear Support Vector Regression (LSVR), Random Forest (RF) regression, Gradient Boost (GBoost), Adaptive Boosting (Adaboost), Multiple Layer Perceptron (MLP) regression and k-Nearest Neighbor (kNN) regression. Based on the repeated cross-validation as well as external prediction quality and interpretability, the PLS model (nTraining = 38, nTest = 12, =0.737, was selected as the best predictor which underscored the previously reported results. The finally selected model should efficiently predict specific surface areas of other perovskites for their use in photocatalysis. The new q-RASPR method also appears promising for the prediction of several other property endpoints of interest in materials science.
在本研究中,使用一种新的定量跨读结构-性质关系(q-RASPR)方法对各种钙钛矿的比表面积进行了建模,该方法将跨读(RA)和定量结构-性质关系(QSPR)结合在一起。通过对超参数的优化,得到了基于相似性的误差度量。将其中一些基于误差的度量与先前选择的特征以及Read-Across预测函数结合起来,使用偏最小二乘(PLS)、岭回归(RR)、线性支持向量回归(LSVR)、随机森林(RF)回归、梯度增强(GBoost)、自适应增强(Adaboost)、多层感知器(MLP)回归和k-最近邻(kNN)回归开发了许多机器学习模型。基于重复交叉验证以及外部预测质量和可解释性,PLS模型(nTraining = 38, nTest = 12, R T R ain 2 ${{R}_{Train}^{2}}$ =0.737, Q L O O 2 =0。637 R T T s T 2 = 0。qf1t = 0。901) ${{Q}_{LOO}^{2}=0.637, {R}_{Test}^{2}=0.898,{rm } {Q}_{F1left(Testright)}^{2}=0.901)}$被选为最佳预测因子,强调了先前报道的结果。最后选择的模型应该能够有效地预测其他钙钛矿在光催化中的比表面积。新的q-RASPR方法似乎也有希望预测材料科学中其他几个感兴趣的属性端点。
{"title":"A machine learning q-RASPR approach for efficient predictions of the specific surface area of perovskites.","authors":"Arkaprava Banerjee, Agnieszka Gajewicz-Skretna, K Roy","doi":"10.1002/minf.202200261","DOIUrl":"https://doi.org/10.1002/minf.202200261","url":null,"abstract":"<p><p>In this study, the specific surface area of various perovskites was modeled using a novel quantitative read-across structure-property relationship (q-RASPR) approach, which clubs both Read-Across (RA) and quantitative structure-property relationship (QSPR) together. After optimization of the hyper-parameters, certain similarity-based error measures for each query compound were obtained. Clubbing some of these error-based measures with the previously selected features along with the Read-Across prediction function, a number of machine learning models were developed using Partial Least Squares (PLS), Ridge Regression (RR), Linear Support Vector Regression (LSVR), Random Forest (RF) regression, Gradient Boost (GBoost), Adaptive Boosting (Adaboost), Multiple Layer Perceptron (MLP) regression and k-Nearest Neighbor (kNN) regression. Based on the repeated cross-validation as well as external prediction quality and interpretability, the PLS model (n<sub>Training</sub> = 38, n<sub>Test</sub> = 12, <math> <semantics><msubsup><mi>R</mi> <mrow><mi>T</mi> <mi>r</mi> <mi>a</mi> <mi>i</mi> <mi>n</mi></mrow> <mn>2</mn></msubsup> <annotation>${{R}_{Train}^{2}}$</annotation> </semantics> </math> =0.737, <math> <semantics> <mrow><msubsup><mi>Q</mi> <mrow><mi>L</mi> <mi>O</mi> <mi>O</mi></mrow> <mn>2</mn></msubsup> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>637</mn> <mo>,</mo> <mspace></mspace> <msubsup><mi>R</mi> <mrow><mi>T</mi> <mi>e</mi> <mi>s</mi> <mi>t</mi></mrow> <mn>2</mn></msubsup> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>898</mn> <mo>,</mo> <mspace></mspace> <mspace></mspace> <msubsup><mi>Q</mi> <mrow><mi>F</mi> <mn>1</mn> <mfenced><mi>T</mi> <mi>e</mi> <mi>s</mi> <mi>t</mi></mfenced> </mrow> <mn>2</mn></msubsup> <mrow><mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>901</mn> <mo>)</mo></mrow> </mrow> <annotation>${{Q}_{LOO}^{2}=0.637, {R}_{Test}^{2}=0.898,{rm } {Q}_{F1left(Testright)}^{2}=0.901)}$</annotation> </semantics> </math> was selected as the best predictor which underscored the previously reported results. The finally selected model should efficiently predict specific surface areas of other perovskites for their use in photocatalysis. The new q-RASPR method also appears promising for the prediction of several other property endpoints of interest in materials science.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":null,"pages":null},"PeriodicalIF":3.6,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9284533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}