Qingzhong Liu, B. Ribeiro, A. Sung, Divya Suryakumar
{"title":"Mining the Big Data: The Critical Feature Dimension Problem","authors":"Qingzhong Liu, B. Ribeiro, A. Sung, Divya Suryakumar","doi":"10.1109/IIAI-AAI.2014.105","DOIUrl":null,"url":null,"abstract":"In mining massive datasets, often two of the most important and immediate problems are sampling and feature selection. Proper sampling and feature selection contributes to reducing the size of the dataset while obtaining satisfactory results in model building. Theoretically, therefore, it is interesting to investigate whether a given dataset possesses a critical feature dimension, or the minimum number of features that is required for a given learning machine to achieve \"satisfactory\" performance. (Likewise, the critical sampling size problem concerns whether, for a given dataset, there is a minimum number of data points that must be included in any sample for a learning machine to achieve satisfactory performance.) Here the specific meaning of \"satisfactory\" performance is to be defined by the user. This paper addresses the complexity of both problems in one general theoretical setting and shows that they have the same complexity and are highly intractable. Next, an empirical method is applied in an attempt to find the approximate critical feature dimension of datasets. It is demonstrated that, under generally reasonable assumptions pertaining to feature ranking algorithms, the critical feature dimension are successfully discovered by the empirical method for a number of datasets of various sizes. The results are encouraging in achieving significant feature size reduction and point to a promising way in dealing with big data. The significance of the existence of crucial dimension in datasets is also explained.","PeriodicalId":432222,"journal":{"name":"2014 IIAI 3rd International Conference on Advanced Applied Informatics","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IIAI 3rd International Conference on Advanced Applied Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IIAI-AAI.2014.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
In mining massive datasets, often two of the most important and immediate problems are sampling and feature selection. Proper sampling and feature selection contributes to reducing the size of the dataset while obtaining satisfactory results in model building. Theoretically, therefore, it is interesting to investigate whether a given dataset possesses a critical feature dimension, or the minimum number of features that is required for a given learning machine to achieve "satisfactory" performance. (Likewise, the critical sampling size problem concerns whether, for a given dataset, there is a minimum number of data points that must be included in any sample for a learning machine to achieve satisfactory performance.) Here the specific meaning of "satisfactory" performance is to be defined by the user. This paper addresses the complexity of both problems in one general theoretical setting and shows that they have the same complexity and are highly intractable. Next, an empirical method is applied in an attempt to find the approximate critical feature dimension of datasets. It is demonstrated that, under generally reasonable assumptions pertaining to feature ranking algorithms, the critical feature dimension are successfully discovered by the empirical method for a number of datasets of various sizes. The results are encouraging in achieving significant feature size reduction and point to a promising way in dealing with big data. The significance of the existence of crucial dimension in datasets is also explained.