{"title":"改进K近邻为字符串向量版本的文本分类","authors":"T. Jo","doi":"10.23919/ICACT.2019.8702043","DOIUrl":null,"url":null,"abstract":"This research is concerned with the string vector based version of the KNN which is the approach to the text categorization. Traditionally, texts have been encoded into numerical vectors for using the traditional version of KNN, and encoding so leads to the three main problems: huge dimensionality, sparse distribution, and poor transparency. In order to solve the problems, this research propose that texts should be encoded into string vectors the similarity measure between string vectors is defined, and the KNN is modified into the version where string vector is given its input. The proposed KNN version is validated empirically by comparing it with the traditional KNN version on the three collections: NewsPage.com, Opiniopsis, and 20NewsGroups. The goal of this research is to improve the text categorization performance by solving them.","PeriodicalId":226261,"journal":{"name":"2019 21st International Conference on Advanced Communication Technology (ICACT)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improving K Nearest Neighbor into String Vector Version for Text Categorization\",\"authors\":\"T. Jo\",\"doi\":\"10.23919/ICACT.2019.8702043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research is concerned with the string vector based version of the KNN which is the approach to the text categorization. Traditionally, texts have been encoded into numerical vectors for using the traditional version of KNN, and encoding so leads to the three main problems: huge dimensionality, sparse distribution, and poor transparency. In order to solve the problems, this research propose that texts should be encoded into string vectors the similarity measure between string vectors is defined, and the KNN is modified into the version where string vector is given its input. The proposed KNN version is validated empirically by comparing it with the traditional KNN version on the three collections: NewsPage.com, Opiniopsis, and 20NewsGroups. The goal of this research is to improve the text categorization performance by solving them.\",\"PeriodicalId\":226261,\"journal\":{\"name\":\"2019 21st International Conference on Advanced Communication Technology (ICACT)\",\"volume\":\"107 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 21st International Conference on Advanced Communication Technology (ICACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/ICACT.2019.8702043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 21st International Conference on Advanced Communication Technology (ICACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/ICACT.2019.8702043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving K Nearest Neighbor into String Vector Version for Text Categorization
This research is concerned with the string vector based version of the KNN which is the approach to the text categorization. Traditionally, texts have been encoded into numerical vectors for using the traditional version of KNN, and encoding so leads to the three main problems: huge dimensionality, sparse distribution, and poor transparency. In order to solve the problems, this research propose that texts should be encoded into string vectors the similarity measure between string vectors is defined, and the KNN is modified into the version where string vector is given its input. The proposed KNN version is validated empirically by comparing it with the traditional KNN version on the three collections: NewsPage.com, Opiniopsis, and 20NewsGroups. The goal of this research is to improve the text categorization performance by solving them.