{"title":"基于自洽分类器的SAR。","authors":"L A Stolbov, D A Filimonov, V V Poroikov","doi":"10.1080/1062936X.2022.2139751","DOIUrl":null,"url":null,"abstract":"<p><p>The accuracy and performance of (Q)SAR models depend significantly on the data used for training. Datasets prepared on the basis of publicly available databases contain structures belonging to different chemical classes and have a highly imbalanced actives/inactives ratio. Currently, hundreds of structural descriptors are used in (Q)SAR studies. The abundance of structural descriptors gives rise to the problem of the constructed (Q)SAR models stability. The methods frequently used for the selection of a small fraction of the 'best' descriptors usually do not have sufficient mathematical justification. We propose a new approach to a self-consistent classifier for SAR analysis in order to overcome these problems. Logistic (SCLC) and extreme (SCEC) extensions of self-consistent regression (SCR) were implemented to enhance the classification capabilities of SCR. The approach was applied to classification models' development for inhibiting activity endpoints in HIV-1-related data and toxicity endpoints with subsequent fivefold cross-validation to estimate the models' performance. Comparison of the proposed SCLC and SCEC models with those developed using the original SCR and support vector machine demonstrated the comparable accuracy. Advantages in feature selection using our approach provide more generalizable (Q)SAR models. In particular, the crucial factors responsible for the observed value are determined unambiguously.</p>","PeriodicalId":21446,"journal":{"name":"SAR and QSAR in Environmental Research","volume":null,"pages":null},"PeriodicalIF":2.3000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAR based on self consistent classifier.\",\"authors\":\"L A Stolbov, D A Filimonov, V V Poroikov\",\"doi\":\"10.1080/1062936X.2022.2139751\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The accuracy and performance of (Q)SAR models depend significantly on the data used for training. Datasets prepared on the basis of publicly available databases contain structures belonging to different chemical classes and have a highly imbalanced actives/inactives ratio. Currently, hundreds of structural descriptors are used in (Q)SAR studies. The abundance of structural descriptors gives rise to the problem of the constructed (Q)SAR models stability. The methods frequently used for the selection of a small fraction of the 'best' descriptors usually do not have sufficient mathematical justification. We propose a new approach to a self-consistent classifier for SAR analysis in order to overcome these problems. Logistic (SCLC) and extreme (SCEC) extensions of self-consistent regression (SCR) were implemented to enhance the classification capabilities of SCR. The approach was applied to classification models' development for inhibiting activity endpoints in HIV-1-related data and toxicity endpoints with subsequent fivefold cross-validation to estimate the models' performance. Comparison of the proposed SCLC and SCEC models with those developed using the original SCR and support vector machine demonstrated the comparable accuracy. Advantages in feature selection using our approach provide more generalizable (Q)SAR models. In particular, the crucial factors responsible for the observed value are determined unambiguously.</p>\",\"PeriodicalId\":21446,\"journal\":{\"name\":\"SAR and QSAR in Environmental Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SAR and QSAR in Environmental Research\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1080/1062936X.2022.2139751\",\"RegionNum\":3,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SAR and QSAR in Environmental Research","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1080/1062936X.2022.2139751","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
The accuracy and performance of (Q)SAR models depend significantly on the data used for training. Datasets prepared on the basis of publicly available databases contain structures belonging to different chemical classes and have a highly imbalanced actives/inactives ratio. Currently, hundreds of structural descriptors are used in (Q)SAR studies. The abundance of structural descriptors gives rise to the problem of the constructed (Q)SAR models stability. The methods frequently used for the selection of a small fraction of the 'best' descriptors usually do not have sufficient mathematical justification. We propose a new approach to a self-consistent classifier for SAR analysis in order to overcome these problems. Logistic (SCLC) and extreme (SCEC) extensions of self-consistent regression (SCR) were implemented to enhance the classification capabilities of SCR. The approach was applied to classification models' development for inhibiting activity endpoints in HIV-1-related data and toxicity endpoints with subsequent fivefold cross-validation to estimate the models' performance. Comparison of the proposed SCLC and SCEC models with those developed using the original SCR and support vector machine demonstrated the comparable accuracy. Advantages in feature selection using our approach provide more generalizable (Q)SAR models. In particular, the crucial factors responsible for the observed value are determined unambiguously.
期刊介绍:
SAR and QSAR in Environmental Research is an international journal welcoming papers on the fundamental and practical aspects of the structure-activity and structure-property relationships in the fields of environmental science, agrochemistry, toxicology, pharmacology and applied chemistry. A unique aspect of the journal is the focus on emerging techniques for the building of SAR and QSAR models in these widely varying fields. The scope of the journal includes, but is not limited to, the topics of topological and physicochemical descriptors, mathematical, statistical and graphical methods for data analysis, computer methods and programs, original applications and comparative studies. In addition to primary scientific papers, the journal contains reviews of books and software and news of conferences. Special issues on topics of current and widespread interest to the SAR and QSAR community will be published from time to time.