{"title":"在不平衡流数据环境中选择样本进行标记","authors":"Hanqing Hu, M. Kantardzic, Tegjyot Singh Sethi","doi":"10.1109/ICAT.2013.6684046","DOIUrl":null,"url":null,"abstract":"In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.","PeriodicalId":348701,"journal":{"name":"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Selecting samples for labeling in unbalanced streaming data environments\",\"authors\":\"Hanqing Hu, M. Kantardzic, Tegjyot Singh Sethi\",\"doi\":\"10.1109/ICAT.2013.6684046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.\",\"PeriodicalId\":348701,\"journal\":{\"name\":\"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAT.2013.6684046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 XXIV International Conference on Information, Communication and Automation Technologies (ICAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAT.2013.6684046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Selecting samples for labeling in unbalanced streaming data environments
In this paper we proposed an alternative approach to random selection for labeling extremely unbalanced stream data sets where one class is only 1-10% of the entire data set. Labeling, especially when human resources are needed, is often time consuming and expensive. In an extremely unbalanced data set, usually a lot of data points need to be labeled to get enough minority class samples. The goal of this research was to reduce the total number of samples needed in the labeling process of training new classification models for updating streaming data ensemble classifier. Our proposed approach is to find minority class clusters using the grid density algorithm, and sample minority class instances inside those regions. The result from the synthetic data set showed that efficiency of our proposed approaches varies with different grid sizes. Results on real world data sets confirmed that observation, and showed that when the data set has high dimensionality, dimensionality reduction was useful for reducing the number of grids in the data space increasing sampling efficiency. Our best results showed 19.4% improvement for an eight-dimension data set without dimensionality reduction, and 27.4% improvement for a thirty-six-dimension data set with dimensionality reduction.