{"title":"GPCR分类特征提取策略的比较分析","authors":"Safia Bekhouche, Yamina Mohamed Ben Ali","doi":"10.1109/CATA.2018.8398676","DOIUrl":null,"url":null,"abstract":"Protein is an alphabetical sequence of amino acids, this form of sequence can never be processed by data mining and machine learning algorithms that are needed for numerical data. Feature extraction strategies are used to transform the alphabetical sequence into a feature vector representing the properties of this sequence. But each method produces an attributes vector of different size and properties to others. Our work aims to compare three most used feature extraction strategies that are AAC, PseAAC and DC using five selected machine learning algorithms deployed on weka platform, they are evaluated based on Accuracy, F-measure, MCC and error rate measures. This comparison helps us to decide what feature extraction strategy is best suited to work while applying computationally expensive selected machine learning algorithms on a protein sequence data. Experiments suggested that AAC, PseAAC and DC methods would be optimal on GPCR classification at sub sub-family level using MLP algorithm. While working with other classifiers would be optimal if we do not use a huge subset of data so a grand class number. Hence this study concludes that a better performance would be reached when a good classifier is established.","PeriodicalId":231024,"journal":{"name":"2018 4th International Conference on Computer and Technology Applications (ICCTA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Comparative analysis on features extraction strategies for GPCR classification\",\"authors\":\"Safia Bekhouche, Yamina Mohamed Ben Ali\",\"doi\":\"10.1109/CATA.2018.8398676\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Protein is an alphabetical sequence of amino acids, this form of sequence can never be processed by data mining and machine learning algorithms that are needed for numerical data. Feature extraction strategies are used to transform the alphabetical sequence into a feature vector representing the properties of this sequence. But each method produces an attributes vector of different size and properties to others. Our work aims to compare three most used feature extraction strategies that are AAC, PseAAC and DC using five selected machine learning algorithms deployed on weka platform, they are evaluated based on Accuracy, F-measure, MCC and error rate measures. This comparison helps us to decide what feature extraction strategy is best suited to work while applying computationally expensive selected machine learning algorithms on a protein sequence data. Experiments suggested that AAC, PseAAC and DC methods would be optimal on GPCR classification at sub sub-family level using MLP algorithm. While working with other classifiers would be optimal if we do not use a huge subset of data so a grand class number. Hence this study concludes that a better performance would be reached when a good classifier is established.\",\"PeriodicalId\":231024,\"journal\":{\"name\":\"2018 4th International Conference on Computer and Technology Applications (ICCTA)\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 4th International Conference on Computer and Technology Applications (ICCTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CATA.2018.8398676\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 4th International Conference on Computer and Technology Applications (ICCTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CATA.2018.8398676","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparative analysis on features extraction strategies for GPCR classification
Protein is an alphabetical sequence of amino acids, this form of sequence can never be processed by data mining and machine learning algorithms that are needed for numerical data. Feature extraction strategies are used to transform the alphabetical sequence into a feature vector representing the properties of this sequence. But each method produces an attributes vector of different size and properties to others. Our work aims to compare three most used feature extraction strategies that are AAC, PseAAC and DC using five selected machine learning algorithms deployed on weka platform, they are evaluated based on Accuracy, F-measure, MCC and error rate measures. This comparison helps us to decide what feature extraction strategy is best suited to work while applying computationally expensive selected machine learning algorithms on a protein sequence data. Experiments suggested that AAC, PseAAC and DC methods would be optimal on GPCR classification at sub sub-family level using MLP algorithm. While working with other classifiers would be optimal if we do not use a huge subset of data so a grand class number. Hence this study concludes that a better performance would be reached when a good classifier is established.