Yifeng Wang, Yang Wang, Weisheng Chen, Yingzhen Lin
{"title":"Natural Language Classification Algorithm of Comments Based on Bayesian Chasing-Clustering Model","authors":"Yifeng Wang, Yang Wang, Weisheng Chen, Yingzhen Lin","doi":"10.1109/DCABES50732.2020.00012","DOIUrl":null,"url":null,"abstract":"The clustering algorithm groups data objects according to their similarity. As an unsupervised learning, it is very convenient to use because there is no need to set labels for the data before training. But at the same time its clustering results are not attached with corresponding grouping information, which means the specific meaning of each type of data. However, in the field of modern Internet, the dimension of information space is very high, especially for user comment sections involving natural language. If the clustering direction is not induced and restricted, it is easy to appear that the clustering results are far from the goal of the algorithm. Therefore, we propose a Bayesian chasing-clustering model, which is improved on the basis of the k-means clustering algorithm. During the training process, it will tend to the clustering way we set in advance, and finally achieve a clustering effect which meets our goals better. In addition, we also propose the S-T method for word embedding representation, which is very suitable for natural language processing problems about comments. We applied it to film review websites and e-commerce platforms, and clustered the website content based on users' comments. Compared with traditional clustering methods, the accuracy rate was improved by 9%-57%.","PeriodicalId":351404,"journal":{"name":"2020 19th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 19th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCABES50732.2020.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
The clustering algorithm groups data objects according to their similarity. As an unsupervised learning, it is very convenient to use because there is no need to set labels for the data before training. But at the same time its clustering results are not attached with corresponding grouping information, which means the specific meaning of each type of data. However, in the field of modern Internet, the dimension of information space is very high, especially for user comment sections involving natural language. If the clustering direction is not induced and restricted, it is easy to appear that the clustering results are far from the goal of the algorithm. Therefore, we propose a Bayesian chasing-clustering model, which is improved on the basis of the k-means clustering algorithm. During the training process, it will tend to the clustering way we set in advance, and finally achieve a clustering effect which meets our goals better. In addition, we also propose the S-T method for word embedding representation, which is very suitable for natural language processing problems about comments. We applied it to film review websites and e-commerce platforms, and clustered the website content based on users' comments. Compared with traditional clustering methods, the accuracy rate was improved by 9%-57%.