{"title":"基于意见词和词频的特征正反文本分类——逆文档频率","authors":"Sasiporn Tongman, N. Wattanakitrungroj","doi":"10.1109/ICAICTA.2018.8541274","DOIUrl":null,"url":null,"abstract":"The contents in website and social networks are rapidly generated. The opinions and reviews can be analyzed and classified into two classes, positive or negative opinions, by machine learning methods. However, the main issue is how to representing each text as a proper set of variables, a p-feature vector, so that the successful classifiers can be obtained by one of the supervised learning approaches with its suitable parameter setting. In this study, a two-feature vector representing positive and negative moods in each text was prepared by using lists of positive and negative words, and then combined with term frequency - inverse document frequency (TF-IDF) features. kNN and SVM classifiers were comparatively built by this set and also other baseline set to predict each test vector and measure their effectiveness. Data of text Reviews from Yelp, Amazon and IMDB, were experimented with 10-fold cross validation in parameter variation and feature set reduction using PCA. The best Accuracy results across these three datasets, ~0.81-0.87, were yielded by SVM classifiers with each size of the reduced feature sets that is very smaller than the original size.","PeriodicalId":184882,"journal":{"name":"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Classifying Positive or Negative Text Using Features Based on Opinion Words and Term Frequency - Inverse Document Frequency\",\"authors\":\"Sasiporn Tongman, N. Wattanakitrungroj\",\"doi\":\"10.1109/ICAICTA.2018.8541274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The contents in website and social networks are rapidly generated. The opinions and reviews can be analyzed and classified into two classes, positive or negative opinions, by machine learning methods. However, the main issue is how to representing each text as a proper set of variables, a p-feature vector, so that the successful classifiers can be obtained by one of the supervised learning approaches with its suitable parameter setting. In this study, a two-feature vector representing positive and negative moods in each text was prepared by using lists of positive and negative words, and then combined with term frequency - inverse document frequency (TF-IDF) features. kNN and SVM classifiers were comparatively built by this set and also other baseline set to predict each test vector and measure their effectiveness. Data of text Reviews from Yelp, Amazon and IMDB, were experimented with 10-fold cross validation in parameter variation and feature set reduction using PCA. The best Accuracy results across these three datasets, ~0.81-0.87, were yielded by SVM classifiers with each size of the reduced feature sets that is very smaller than the original size.\",\"PeriodicalId\":184882,\"journal\":{\"name\":\"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAICTA.2018.8541274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAICTA.2018.8541274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Classifying Positive or Negative Text Using Features Based on Opinion Words and Term Frequency - Inverse Document Frequency
The contents in website and social networks are rapidly generated. The opinions and reviews can be analyzed and classified into two classes, positive or negative opinions, by machine learning methods. However, the main issue is how to representing each text as a proper set of variables, a p-feature vector, so that the successful classifiers can be obtained by one of the supervised learning approaches with its suitable parameter setting. In this study, a two-feature vector representing positive and negative moods in each text was prepared by using lists of positive and negative words, and then combined with term frequency - inverse document frequency (TF-IDF) features. kNN and SVM classifiers were comparatively built by this set and also other baseline set to predict each test vector and measure their effectiveness. Data of text Reviews from Yelp, Amazon and IMDB, were experimented with 10-fold cross validation in parameter variation and feature set reduction using PCA. The best Accuracy results across these three datasets, ~0.81-0.87, were yielded by SVM classifiers with each size of the reduced feature sets that is very smaller than the original size.