Improving K Nearest Neighbor into String Vector Version for Text Categorization

2019 21st International Conference on Advanced Communication Technology (ICACT) Pub Date : 2019-02-01 DOI:10.23919/ICACT.2019.8702043

T. Jo

引用次数: 4

Abstract

This research is concerned with the string vector based version of the KNN which is the approach to the text categorization. Traditionally, texts have been encoded into numerical vectors for using the traditional version of KNN, and encoding so leads to the three main problems: huge dimensionality, sparse distribution, and poor transparency. In order to solve the problems, this research propose that texts should be encoded into string vectors the similarity measure between string vectors is defined, and the KNN is modified into the version where string vector is given its input. The proposed KNN version is validated empirically by comparing it with the traditional KNN version on the three collections: NewsPage.com, Opiniopsis, and 20NewsGroups. The goal of this research is to improve the text categorization performance by solving them.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

改进K近邻为字符串向量版本的文本分类

本文研究了基于字符串向量的KNN算法，这是一种文本分类方法。传统上，为了使用传统版本的KNN，文本被编码成数值向量，这样编码会导致三个主要问题:巨大的维度、稀疏的分布和较差的透明度。为了解决这一问题，本研究提出将文本编码为字符串向量，定义字符串向量之间的相似性度量，并将KNN修改为给定字符串向量输入的版本。通过将所提出的KNN版本与传统的KNN版本在三个集合(NewsPage.com、Opiniopsis和20NewsGroups)上进行比较，对所提出的KNN版本进行了实证验证。本研究的目的是通过解决这些问题来提高文本分类的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 21st International Conference on Advanced Communication Technology (ICACT)

自引率

0.00%

发文量