{"title":"不平衡数据特征选择及其在毒性预测中的应用","authors":"Jincheng Li","doi":"10.1109/ICMLC51923.2020.9469564","DOIUrl":null,"url":null,"abstract":"The principle of computational toxicity prediction is that chemicals with similar molecular structures may possess similar toxicological pathways and effects. There have been many methods that represented each chemical by a set of descriptors, which are identified by experts as promising properties for predicting biological activity or toxicity. These chemical descriptors play a critical role in computational methods, that task correlated descriptors are favorable to achieve high prediction performance. However, there are few work compare the effectiveness of chemical descriptors and evaluate their performance in toxicity prediction. In this paper, we propose a novel ensemble feature selection method based on random under-sampling to analysis the effectiveness of chemical descriptors adopted in toxicity prediction application. The proposed method is efficient and can relief the imbalanced data problem of toxicity. Experiment results on the tox21 toxicity prediction dataset show that \"molecular property\", \"connectivity\" and \"topological\" descriptor are the three most important descriptors for toxicity prediction tasks among the 12 popular descriptors adopted in toxicity prediction applications. The results of this study can be used as a guide to propose new descriptors for chemical toxicity prediction.","PeriodicalId":170815,"journal":{"name":"2020 International Conference on Machine Learning and Cybernetics (ICMLC)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Feature Selection on Imbalanced Data and Its Application on Toxicity Prediction\",\"authors\":\"Jincheng Li\",\"doi\":\"10.1109/ICMLC51923.2020.9469564\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The principle of computational toxicity prediction is that chemicals with similar molecular structures may possess similar toxicological pathways and effects. There have been many methods that represented each chemical by a set of descriptors, which are identified by experts as promising properties for predicting biological activity or toxicity. These chemical descriptors play a critical role in computational methods, that task correlated descriptors are favorable to achieve high prediction performance. However, there are few work compare the effectiveness of chemical descriptors and evaluate their performance in toxicity prediction. In this paper, we propose a novel ensemble feature selection method based on random under-sampling to analysis the effectiveness of chemical descriptors adopted in toxicity prediction application. The proposed method is efficient and can relief the imbalanced data problem of toxicity. Experiment results on the tox21 toxicity prediction dataset show that \\\"molecular property\\\", \\\"connectivity\\\" and \\\"topological\\\" descriptor are the three most important descriptors for toxicity prediction tasks among the 12 popular descriptors adopted in toxicity prediction applications. The results of this study can be used as a guide to propose new descriptors for chemical toxicity prediction.\",\"PeriodicalId\":170815,\"journal\":{\"name\":\"2020 International Conference on Machine Learning and Cybernetics (ICMLC)\",\"volume\":\"133 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on Machine Learning and Cybernetics (ICMLC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLC51923.2020.9469564\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Machine Learning and Cybernetics (ICMLC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLC51923.2020.9469564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature Selection on Imbalanced Data and Its Application on Toxicity Prediction
The principle of computational toxicity prediction is that chemicals with similar molecular structures may possess similar toxicological pathways and effects. There have been many methods that represented each chemical by a set of descriptors, which are identified by experts as promising properties for predicting biological activity or toxicity. These chemical descriptors play a critical role in computational methods, that task correlated descriptors are favorable to achieve high prediction performance. However, there are few work compare the effectiveness of chemical descriptors and evaluate their performance in toxicity prediction. In this paper, we propose a novel ensemble feature selection method based on random under-sampling to analysis the effectiveness of chemical descriptors adopted in toxicity prediction application. The proposed method is efficient and can relief the imbalanced data problem of toxicity. Experiment results on the tox21 toxicity prediction dataset show that "molecular property", "connectivity" and "topological" descriptor are the three most important descriptors for toxicity prediction tasks among the 12 popular descriptors adopted in toxicity prediction applications. The results of this study can be used as a guide to propose new descriptors for chemical toxicity prediction.