基于mapreduce的k近邻大数据分类方法

2015 IEEE Trustcom/BigDataSE/ISPA Pub Date : 2015-08-20 DOI:10.1109/Trustcom.2015.577

Jesús Maillo, I. Triguero, F. Herrera

{"title":"基于mapreduce的k近邻大数据分类方法","authors":"Jesús Maillo, I. Triguero, F. Herrera","doi":"10.1109/Trustcom.2015.577","DOIUrl":null,"url":null,"abstract":"The k-Nearest Neighbor classifier is one of the most well known methods in data mining because of its effectiveness and simplicity. Due to its way of working, the application of this classifier may be restricted to problems with a certain number of examples, especially, when the runtime matters. However, the classification of large amounts of data is becoming a necessary task in a great number of real-world applications. This topic is known as big data classification, in which standard data mining techniques normally fail to tackle such volume of data. In this contribution we propose a MapReduce-based approach for k-Nearest neighbor classification. This model allows us to simultaneously classify large amounts of unseen cases (test examples) against a big (training) dataset. To do so, the map phase will determine the k-nearest neighbors in different splits of the data. Afterwards, the reduce stage will compute the definitive neighbors from the list obtained in the map phase. The designed model allows the k-Nearest neighbor classifier to scale to datasets of arbitrary size, just by simply adding more computing nodes if necessary. Moreover, this parallel implementation provides the exact classification rate as the original k-NN model. The conducted experiments, using a dataset with up to 1 million instances, show the promising scalability capabilities of the proposed approach.","PeriodicalId":277092,"journal":{"name":"2015 IEEE Trustcom/BigDataSE/ISPA","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification\",\"authors\":\"Jesús Maillo, I. Triguero, F. Herrera\",\"doi\":\"10.1109/Trustcom.2015.577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The k-Nearest Neighbor classifier is one of the most well known methods in data mining because of its effectiveness and simplicity. Due to its way of working, the application of this classifier may be restricted to problems with a certain number of examples, especially, when the runtime matters. However, the classification of large amounts of data is becoming a necessary task in a great number of real-world applications. This topic is known as big data classification, in which standard data mining techniques normally fail to tackle such volume of data. In this contribution we propose a MapReduce-based approach for k-Nearest neighbor classification. This model allows us to simultaneously classify large amounts of unseen cases (test examples) against a big (training) dataset. To do so, the map phase will determine the k-nearest neighbors in different splits of the data. Afterwards, the reduce stage will compute the definitive neighbors from the list obtained in the map phase. The designed model allows the k-Nearest neighbor classifier to scale to datasets of arbitrary size, just by simply adding more computing nodes if necessary. Moreover, this parallel implementation provides the exact classification rate as the original k-NN model. The conducted experiments, using a dataset with up to 1 million instances, show the promising scalability capabilities of the proposed approach.\",\"PeriodicalId\":277092,\"journal\":{\"name\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom.2015.577\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Trustcom/BigDataSE/ISPA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom.2015.577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 73

摘要

k近邻分类器是数据挖掘中最著名的方法之一，因为它的有效性和简单性。由于其工作方式，该分类器的应用可能仅限于具有一定数量的示例的问题，特别是在运行时很重要的情况下。然而，在大量实际应用中，对大量数据进行分类正成为一项必要的任务。这个主题被称为大数据分类，在这个主题中，标准的数据挖掘技术通常无法处理如此大量的数据。在这篇文章中，我们提出了一种基于mapreduce的k-最近邻分类方法。这个模型允许我们同时根据一个大的(训练)数据集对大量看不见的案例(测试示例)进行分类。为此，映射阶段将在数据的不同分割中确定k个最近的邻居。然后，reduce阶段将从映射阶段获得的列表中计算最终邻居。设计的模型允许k-最近邻分类器扩展到任意大小的数据集，只需在必要时添加更多的计算节点。此外，这种并行实现提供了与原始k-NN模型相同的分类率。使用多达100万个实例的数据集进行的实验表明，所提出的方法具有良好的可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A MapReduce-Based k-Nearest Neighbor Approach for Big Data Classification

The k-Nearest Neighbor classifier is one of the most well known methods in data mining because of its effectiveness and simplicity. Due to its way of working, the application of this classifier may be restricted to problems with a certain number of examples, especially, when the runtime matters. However, the classification of large amounts of data is becoming a necessary task in a great number of real-world applications. This topic is known as big data classification, in which standard data mining techniques normally fail to tackle such volume of data. In this contribution we propose a MapReduce-based approach for k-Nearest neighbor classification. This model allows us to simultaneously classify large amounts of unseen cases (test examples) against a big (training) dataset. To do so, the map phase will determine the k-nearest neighbors in different splits of the data. Afterwards, the reduce stage will compute the definitive neighbors from the list obtained in the map phase. The designed model allows the k-Nearest neighbor classifier to scale to datasets of arbitrary size, just by simply adding more computing nodes if necessary. Moreover, this parallel implementation provides the exact classification rate as the original k-NN model. The conducted experiments, using a dataset with up to 1 million instances, show the promising scalability capabilities of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE Trustcom/BigDataSE/ISPA

自引率

0.00%

发文量