D-optimal plans for variable selection in data bases

CTIT technical reports series Pub Date : 2009-08-05 DOI:10.17877/DE290R-8705

J. Schiffner, C. Weihs

{"title":"D-optimal plans for variable selection in data bases","authors":"J. Schiffner, C. Weihs","doi":"10.17877/DE290R-8705","DOIUrl":null,"url":null,"abstract":"This paper is based on an article of Pumplun et al. (2005a) that investigates the use of Design of Experiments in data bases in order to select variables that are relevant for classification in situations where a sufficient number of measurements of the explanatory variables is available, but measuring the class label is hard, e. g. expensive or time-consuming. Pumplun et al. searched for D-optimal designs in existing data sets by means of a genetic algorithm and assessed variable importance based on the found plans. If the design matrix is standardized these D-optimal plans are almost orthogonal and the explanatory variables are nearly uncorrelated. Thus Pumplun et al. expected that their importance for discrimination can be judged independently of each other. In a simulation study Pumplun et al. applied this approach in combination with five classification methods to eight data sets and the obtained error rates were compared with those resulting from variable selection on the basis of the complete data sets. Based on the D-optimal plans in some cases considerably lower error rates were achieved. Although Pumplun et al. (2005a) obtained some promising results, it was not clear for different reasons if D-optimality actually is beneficial for variable selection. For example, D-efficiency and orthogonality of the resulting plans were not investigated and a comparison with variable selection based on random samples of observations of the same size as the D-optimal plans was missing. In this paper we extend the simulation study of Pumplun et al. (2005a) in order to verify their results and as basis for further research in this field. Moreover, in Pumplun et al. D-optimal plans are only used for data preprocessing, that is variable selection. The classification models are estimated on the whole data set in order to assess the effects of D-optimality on variable selection separately. Since the number of measurements of the class label in fact is limited one would normally employ the same observations that were used for variable selection for learning, too. For this reason in our simulation study the appropriateness of D-optimal plans for training classification methods is additionally investigated. It turned out that in general in terms of the error rate there is no difference between variable selection on the basis of D-optimal plans and variable selection on random samples. However, for training of linear classification methods D-optimal plans seem to be beneficial.","PeriodicalId":10841,"journal":{"name":"CTIT technical reports series","volume":"64 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2009-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CTIT technical reports series","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17877/DE290R-8705","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

This paper is based on an article of Pumplun et al. (2005a) that investigates the use of Design of Experiments in data bases in order to select variables that are relevant for classification in situations where a sufficient number of measurements of the explanatory variables is available, but measuring the class label is hard, e. g. expensive or time-consuming. Pumplun et al. searched for D-optimal designs in existing data sets by means of a genetic algorithm and assessed variable importance based on the found plans. If the design matrix is standardized these D-optimal plans are almost orthogonal and the explanatory variables are nearly uncorrelated. Thus Pumplun et al. expected that their importance for discrimination can be judged independently of each other. In a simulation study Pumplun et al. applied this approach in combination with five classification methods to eight data sets and the obtained error rates were compared with those resulting from variable selection on the basis of the complete data sets. Based on the D-optimal plans in some cases considerably lower error rates were achieved. Although Pumplun et al. (2005a) obtained some promising results, it was not clear for different reasons if D-optimality actually is beneficial for variable selection. For example, D-efficiency and orthogonality of the resulting plans were not investigated and a comparison with variable selection based on random samples of observations of the same size as the D-optimal plans was missing. In this paper we extend the simulation study of Pumplun et al. (2005a) in order to verify their results and as basis for further research in this field. Moreover, in Pumplun et al. D-optimal plans are only used for data preprocessing, that is variable selection. The classification models are estimated on the whole data set in order to assess the effects of D-optimality on variable selection separately. Since the number of measurements of the class label in fact is limited one would normally employ the same observations that were used for variable selection for learning, too. For this reason in our simulation study the appropriateness of D-optimal plans for training classification methods is additionally investigated. It turned out that in general in terms of the error rate there is no difference between variable selection on the basis of D-optimal plans and variable selection on random samples. However, for training of linear classification methods D-optimal plans seem to be beneficial.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据库中变量选择的最优方案

本文基于Pumplun et al.(2005)的一篇文章，该文章调查了在数据库中使用实验设计，以便在解释变量的测量数量足够的情况下选择与分类相关的变量，但测量类别标签很难，例如昂贵或耗时。Pumplun等人通过遗传算法在现有数据集中搜索d -最优设计，并根据找到的方案评估变量重要性。如果设计矩阵是标准化的，这些d -最优方案几乎是正交的，解释变量几乎不相关。因此，Pumplun等人期望可以独立地判断它们对歧视的重要性。在模拟研究中，Pumplun等人将该方法与五种分类方法结合应用于八个数据集，并将得到的错误率与基于完整数据集的变量选择结果进行了比较。基于d -最优计划，在某些情况下可以实现相当低的错误率。尽管Pumplun et al. (2005a)获得了一些有希望的结果，但由于不同的原因，d -最优性是否真的有利于变量选择尚不清楚。例如，没有研究结果方案的d效率和正交性，也没有对基于与d最优方案相同大小的随机观察样本的变量选择进行比较。本文对Pumplun et al. (2005a)的仿真研究进行了扩展，以验证其结果，为该领域的进一步研究奠定基础。此外，在Pumplun等。d -最优方案仅用于数据预处理，即变量选择。为了单独评估d -最优性对变量选择的影响，在整个数据集上对分类模型进行了估计。由于类标签的测量次数实际上是有限的，因此通常也会采用用于学习变量选择的相同观察结果。由于这个原因，在我们的模拟研究中，d -最优计划对训练分类方法的适用性进行了额外的研究。结果表明，总体而言，基于d -最优方案的变量选择与基于随机样本的变量选择在错误率上没有差别。然而，对于线性分类方法的训练，d -最优计划似乎是有益的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

CTIT technical reports series

自引率

0.00%

发文量