Improved Data Streams Classification with Fast Unsupervised Feature Selection

Lulu Wang, Hong Shen
{"title":"Improved Data Streams Classification with Fast Unsupervised Feature Selection","authors":"Lulu Wang, Hong Shen","doi":"10.1109/PDCAT.2016.056","DOIUrl":null,"url":null,"abstract":"Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-the-art streams classification techniques in classifying real-life data streams.","PeriodicalId":203925,"journal":{"name":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2016.056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Data streams classification poses three major challenges, namely, infinite length, concept-drift, and featureevolution. The first two issues have been widely studied. However, most existing data stream classification techniques ignore the last one. DXMiner [17], the first model which addresses featureevolution by using the past labeled instances to select the top ranked features based on a scores computed by a formula. This semi-supervised feature selection method depends on the quality of the past classification and neglects the possible correlation among different features, thus unable to produce an optimal feature subset which deteriorates the accuracy of classification. Multi-Cluster Feature Selection (MCFS) [5] proposed for static data classification and clustering applies unsupervised feature selection to address the feature-evolution problem, but suffers from the high computational cost in feature selection. In this paper, we apply MCFS in the DXMiner framework to handle each window of data in a data stream for dynamic data stream-classification. With unsupervised feature selection, our method produces the optimal feature subset and hence improves DXMiner on the classification accuracy. We further improve the time complexity of the feature selection process in MCFS by using the locality sensitive hashing forest (LSH Forest) [4]. The empirical results indicate that our approach outperforms stateof-the-art streams classification techniques in classifying real-life data streams.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于快速无监督特征选择的改进数据流分类
数据流分类面临着无限长、概念漂移和特征演化三大挑战。前两个问题已被广泛研究。然而,大多数现有的数据流分类技术都忽略了最后一项。DXMiner[17]是第一个通过使用过去标记的实例根据公式计算的分数选择排名最高的特征来解决特征进化的模型。这种半监督特征选择方法依赖于过去分类的质量,忽略了不同特征之间可能存在的相关性,无法产生最优的特征子集,从而降低了分类的准确性。针对静态数据分类和聚类提出的多聚类特征选择(Multi-Cluster Feature Selection, MCFS)[5]采用无监督特征选择来解决特征演化问题,但特征选择的计算成本较高。在本文中,我们在DXMiner框架中应用MCFS来处理数据流中的每个数据窗口,以实现动态数据流分类。通过无监督特征选择,我们的方法产生了最优的特征子集,从而提高了DXMiner的分类精度。我们通过使用局部敏感哈希森林(locality sensitive hash forest, LSH forest)进一步提高了MCFS中特征选择过程的时间复杂度[4]。实证结果表明,我们的方法在分类现实数据流方面优于最先进的流分类技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Learning-Based System for Monitoring Electrical Load in Smart Grid A Domain-Independent Hybrid Approach for Automatic Taxonomy Induction CUDA-Based Parallel Implementation of IBM Word Alignment Algorithm for Statistical Machine Translation Optimal Scheduling Algorithm of MapReduce Tasks Based on QoS in the Hybrid Cloud Pre-Impact Fall Detection Based on Wearable Device Using Dynamic Threshold Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1