基于邻居关系保持的大规模k近邻分类算法

Yunsheng Song, Xiaohan Kong, Chao Zhang
{"title":"基于邻居关系保持的大规模k近邻分类算法","authors":"Yunsheng Song, Xiaohan Kong, Chao Zhang","doi":"10.1155/2022/7409171","DOIUrl":null,"url":null,"abstract":"Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the \n \n k\n \n -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale \n \n k\n \n -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch \n \n k\n \n -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.","PeriodicalId":23995,"journal":{"name":"Wirel. Commun. Mob. Comput.","volume":"35 1","pages":"7409171:1-7409171:11"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"A Large-Scale k -Nearest Neighbor Classification Algorithm Based on Neighbor Relationship Preservation\",\"authors\":\"Yunsheng Song, Xiaohan Kong, Chao Zhang\",\"doi\":\"10.1155/2022/7409171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the \\n \\n k\\n \\n -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale \\n \\n k\\n \\n -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch \\n \\n k\\n \\n -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.\",\"PeriodicalId\":23995,\"journal\":{\"name\":\"Wirel. Commun. Mob. Comput.\",\"volume\":\"35 1\",\"pages\":\"7409171:1-7409171:11\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Wirel. Commun. Mob. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1155/2022/7409171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wirel. Commun. Mob. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2022/7409171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

k近邻(kNN)分类算法由于对数据的底层分布没有假设,且生成能力强,被广泛应用于人脸识别、文本分类、情感分析等领域。然而,在预测过程中,kNN需要计算未标记实例与所有训练实例之间的相似度;处理大规模数据是困难的。为了克服这一困难,越来越多的基于数据分区的加速算法被提出。然而,缺乏对数据分割对分类性能影响的理论分析。本文利用经验风险最小化理论分析了这种影响,提出了一种基于邻居关系保持的大规模k近邻分类算法。将搜索最近邻的过程转化为约束优化问题。然后,给出了有数据划分和无数据划分的最优解下目标函数值差的估计。根据得到的估计,最小化不同划分子集中实例的相似度可以大大降低数据划分的影响。考虑到小批k均值聚类算法的有效性和高效性,选择该算法进行数据分区。最后,从连续合并候选子集生成的集合中不断搜索测试实例的最近邻居,直到它们不再变化为止,其中候选子集根据测试实例与聚类中心之间的相似性选择。在公开数据集上的实验结果表明,该算法在很大程度上保持了与原kNN分类算法相同的最近邻,分类精度没有显著差异,优于两种最先进的分类算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Large-Scale k -Nearest Neighbor Classification Algorithm Based on Neighbor Relationship Preservation
Owing to the absence of hypotheses of the underlying distributions of the data and the strong generation ability, the k -nearest neighbor (kNN) classification algorithm is widely used to face recognition, text classification, emotional analysis, and other fields. However, kNN needs to compute the similarity between the unlabeled instance and all the training instances during the prediction process; it is difficult to deal with large-scale data. To overcome this difficulty, an increasing number of acceleration algorithms based on data partition are proposed. However, they lack theoretical analysis about the effect of data partition on classification performance. This paper has made a theoretical analysis of the effect using empirical risk minimization and proposed a large-scale k -nearest neighbor classification algorithm based on neighbor relationship preservation. The process of searching the nearest neighbors is converted to a constrained optimization problem. Then, it gives the estimation of the difference on the objective function value under the optimal solution with data partition and without data partition. According to the obtained estimation, minimizing the similarity of the instances in the different divided subsets can largely reduce the effect of data partition. The minibatch k -means clustering algorithm is chosen to perform data partition for its effectiveness and efficiency. Finally, the nearest neighbors of the test instance are continuously searched from the set generated by successively merging the candidate subsets until they do not change anymore, where the candidate subsets are selected based on the similarity between the test instance and cluster centers. Experiment results on public datasets show that the proposed algorithm can largely keep the same nearest neighbors and no significant difference in classification accuracy as the original kNN classification algorithm and better results than two state-of-the-art algorithms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
AI-Empowered Propagation Prediction and Optimization for Reconfigurable Wireless Networks C SVM Classification and KNN Techniques for Cyber Crime Detection A Secure and Efficient Energy Trading Model Using Blockchain for a 5G-Deployed Smart Community Fusion Deep Learning and Machine Learning for Heterogeneous Military Entity Recognition Influence of Embedded Microprocessor Wireless Communication and Computer Vision in Wushu Competition Referees' Decision Support
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1