Generalized and Heuristic-Free Feature Construction for Improved Accuracy.

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining Pub Date : 2010-01-01 DOI:10.1137/1.9781611972801.55

Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, Qiang Yang

{"title":"Generalized and Heuristic-Free Feature Construction for Improved Accuracy.","authors":"Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, Qiang Yang","doi":"10.1137/1.9781611972801.55","DOIUrl":null,"url":null,"abstract":"<p><p>State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2010 ","pages":"629-640"},"PeriodicalIF":0.0000,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611972801.55","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611972801.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

提高准确率的广义和无启发式特征构建。

最先进的学习算法接受特征向量格式的数据作为输入。属于不同类的例子可能并不总是容易在原始特征空间中分离。有人可能会问:将现有的特征转化为新的空间，是否会揭示出在原空间中不明显的重要的判别信息?由于可以有无限多的方法来扩展特征，因此首先枚举然后执行特征选择是不切实际的。其次，对完整数据集的判别能力评估并不总是最优的。这是因为当在整个数据集上评估时，对样本子集高度判别的特征不一定是显著的。第三，特征构建应该是自动化的和通用的，这样，它不需要领域知识，并且在大量的分类算法中保持其提高的准确性。在本文中，我们提出了一个框架，通过以下步骤来解决这些问题:(1)分而治之，以避免穷尽列举;(2)在局部误差仍然很大且迄今为止构建的特征仍然不能很好预测的示例子空间内构建和评估局部特征;(3)基于加权规则的搜索，该搜索不涉及领域知识，具有可证明的性能保证。实证研究表明，在各种归纳学习器上使用新构建的特征，对许多平衡、倾斜和高维数据集进行评估，可以实现显着改进(准确率高达9%，AUC提高28%)。软件和数据集可从作者处获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining

自引率

0.00%

发文量

期刊最新文献

Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions. MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation. FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery. Harmonic Alignment. GRIA: Graphical Regularization for Integrative Analysis.