Generalized and Heuristic-Free Feature Construction for Improved Accuracy.

Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, Qiang Yang
{"title":"Generalized and Heuristic-Free Feature Construction for Improved Accuracy.","authors":"Wei Fan,&nbsp;Erheng Zhong,&nbsp;Jing Peng,&nbsp;Olivier Verscheure,&nbsp;Kun Zhang,&nbsp;Jiangtao Ren,&nbsp;Rong Yan,&nbsp;Qiang Yang","doi":"10.1137/1.9781611972801.55","DOIUrl":null,"url":null,"abstract":"<p><p>State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.</p>","PeriodicalId":74533,"journal":{"name":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","volume":"2010 ","pages":"629-640"},"PeriodicalIF":0.0000,"publicationDate":"2010-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1137/1.9781611972801.55","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/1.9781611972801.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

Abstract

State-of-the-art learning algorithms accept data in feature vector format as input. Examples belonging to different classes may not always be easy to separate in the original feature space. One may ask: can transformation of existing features into new space reveal significant discriminative information not obvious in the original space? Since there can be infinite number of ways to extend features, it is impractical to first enumerate and then perform feature selection. Second, evaluation of discriminative power on the complete dataset is not always optimal. This is because features highly discriminative on subset of examples may not necessarily be significant when evaluated on the entire dataset. Third, feature construction ought to be automated and general, such that, it doesn't require domain knowledge and its improved accuracy maintains over a large number of classification algorithms. In this paper, we propose a framework to address these problems through the following steps: (1) divide-conquer to avoid exhaustive enumeration; (2) local feature construction and evaluation within subspaces of examples where local error is still high and constructed features thus far still do not predict well; (3) weighting rules based search that is domain knowledge free and has provable performance guarantee. Empirical studies indicate that significant improvement (as much as 9% in accuracy and 28% in AUC) is achieved using the newly constructed features over a variety of inductive learners evaluated against a number of balanced, skewed and high-dimensional datasets. Software and datasets are available from the authors.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
提高准确率的广义和无启发式特征构建。
最先进的学习算法接受特征向量格式的数据作为输入。属于不同类的例子可能并不总是容易在原始特征空间中分离。有人可能会问:将现有的特征转化为新的空间,是否会揭示出在原空间中不明显的重要的判别信息?由于可以有无限多的方法来扩展特征,因此首先枚举然后执行特征选择是不切实际的。其次,对完整数据集的判别能力评估并不总是最优的。这是因为当在整个数据集上评估时,对样本子集高度判别的特征不一定是显著的。第三,特征构建应该是自动化的和通用的,这样,它不需要领域知识,并且在大量的分类算法中保持其提高的准确性。在本文中,我们提出了一个框架,通过以下步骤来解决这些问题:(1)分而治之,以避免穷尽列举;(2)在局部误差仍然很大且迄今为止构建的特征仍然不能很好预测的示例子空间内构建和评估局部特征;(3)基于加权规则的搜索,该搜索不涉及领域知识,具有可证明的性能保证。实证研究表明,在各种归纳学习器上使用新构建的特征,对许多平衡、倾斜和高维数据集进行评估,可以实现显着改进(准确率高达9%,AUC提高28%)。软件和数据集可从作者处获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automated Fusion of Multimodal Electronic Health Records for Better Medical Predictions. MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation. FAME: Fragment-based Conditional Molecular Generation for Phenotypic Drug Discovery. Harmonic Alignment. GRIA: Graphical Regularization for Integrative Analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1