Improved Data Streams Classification with Fast Unsupervised Feature Selection

2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Pub Date : 2016-12-01 DOI:10.1109/PDCAT.2016.056

Lulu Wang, Hong Shen

{"title":"Improved Data Streams Classification with Fast Unsupervised Feature Selection","authors":"Lulu Wang, Hong Shen","doi":"10.1109/PDCAT.2016.056","DOIUrl":null,"url":null,"abstract":"Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-the-art streams classification techniques in classifying real-life data streams.","PeriodicalId":203925,"journal":{"name":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2016.056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-the-art streams classification techniques in classifying real-life data streams.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于快速无监督特征选择的改进数据流分类

数据流分类面临着无限长、概念漂移和特征演化三大挑战。前两个问题已被广泛研究。然而，大多数现有的数据流分类技术都忽略了最后一项。DXMiner[17]是第一个通过使用过去标记的实例根据公式计算的分数选择排名最高的特征来解决特征进化的模型。这种半监督特征选择方法依赖于过去分类的质量，忽略了不同特征之间可能存在的相关性，无法产生最优的特征子集，从而降低了分类的准确性。针对静态数据分类和聚类提出的多聚类特征选择(Multi-Cluster Feature Selection, MCFS)[5]采用无监督特征选择来解决特征演化问题，但特征选择的计算成本较高。在本文中，我们在DXMiner框架中应用MCFS来处理数据流中的每个数据窗口，以实现动态数据流分类。通过无监督特征选择，我们的方法产生了最优的特征子集，从而提高了DXMiner的分类精度。我们通过使用局部敏感哈希森林(locality sensitive hash forest, LSH forest)进一步提高了MCFS中特征选择过程的时间复杂度[4]。实证结果表明，我们的方法在分类现实数据流方面优于最先进的流分类技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)

自引率

0.00%

发文量