基于邻居关系保持的大规模k近邻分类算法

Wirel. Commun. Mob. Comput. Pub Date : 2022-01-07 DOI:10.1155/2022/7409171

Yunsheng Song, Xiaohan Kong, Chao Zhang

{"title":"基于邻居关系保持的大规模k近邻分类算法","authors":"Yunsheng Song, Xiaohan Kong, Chao Zhang","doi":"10.1155/2022/7409171","DOIUrl":null,"url":null,"abstract":"Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the \n \n k\n \n -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale \n \n k\n \n -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch \n \n k\n \n -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.","PeriodicalId":23995,"journal":{"name":"Wirel. Commun. Mob. Comput.","volume":"35 1","pages":"7409171:1-7409171:11"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A Large-Scale k -Nearest Neighbor Classification Algorithm Based on Neighbor Relationship Preservation\",\"authors\":\"Yunsheng Song, Xiaohan Kong, Chao Zhang\",\"doi\":\"10.1155/2022/7409171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the \\n \\n k\\n \\n -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale \\n \\n k\\n \\n -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch \\n \\n k\\n \\n -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.\",\"PeriodicalId\":23995,\"journal\":{\"name\":\"Wirel. Commun. Mob. Comput.\",\"volume\":\"35 1\",\"pages\":\"7409171:1-7409171:11\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Wirel. Commun. Mob. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1155/2022/7409171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wirel. Commun. Mob. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2022/7409171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

k近邻(kNN)分类算法由于对数据的底层分布没有假设，且生成能力强，被广泛应用于人脸识别、文本分类、情感分析等领域。然而，在预测过程中，kNN需要计算未标记实例与所有训练实例之间的相似度;处理大规模数据是困难的。为了克服这一困难，越来越多的基于数据分区的加速算法被提出。然而，缺乏对数据分割对分类性能影响的理论分析。本文利用经验风险最小化理论分析了这种影响，提出了一种基于邻居关系保持的大规模k近邻分类算法。将搜索最近邻的过程转化为约束优化问题。然后，给出了有数据划分和无数据划分的最优解下目标函数值差的估计。根据得到的估计，最小化不同划分子集中实例的相似度可以大大降低数据划分的影响。考虑到小批k均值聚类算法的有效性和高效性，选择该算法进行数据分区。最后，从连续合并候选子集生成的集合中不断搜索测试实例的最近邻居，直到它们不再变化为止，其中候选子集根据测试实例与聚类中心之间的相似性选择。在公开数据集上的实验结果表明，该算法在很大程度上保持了与原kNN分类算法相同的最近邻，分类精度没有显著差异，优于两种最先进的分类算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Large-Scale k -Nearest Neighbor Classification Algorithm Based on Neighbor Relationship Preservation

Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the k -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale k -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch k -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Wirel. Commun. Mob. Comput.

自引率

0.00%

发文量