Accurate and interpretable regression trees using oracle coaching

U. Johansson, Cecilia Sönströd, Rikard König
{"title":"Accurate and interpretable regression trees using oracle coaching","authors":"U. Johansson, Cecilia Sönströd, Rikard König","doi":"10.1109/CIDM.2014.7008667","DOIUrl":null,"url":null,"abstract":"In many real-world scenarios, predictive models need to be interpretable, thus ruling out many machine learning techniques known to produce very accurate models, e.g., neural networks, support vector machines and all ensemble schemes. Most often, tree models or rule sets are used instead, typically resulting in significantly lower predictive performance. The overall purpose of oracle coaching is to reduce this accuracy vs. comprehensibility trade-off by producing interpretable models optimized for the specific production set at hand. The method requires production set inputs to be present when generating the predictive model, a demand fulfilled in most, but not all, predictive modeling scenarios. In oracle coaching, a highly accurate, but opaque, model is first induced from the training data. This model (“the oracle”) is then used to label both the training instances and the production instances. Finally, interpretable models are trained using different combinations of the resulting data sets. In this paper, the oracle coaching produces regression trees, using neural networks and random forests as oracles. The experiments, using 32 publicly available data sets, show that the oracle coaching leads to significantly improved predictive performance, compared to standard induction. In addition, it is also shown that a highly accurate opaque model can be successfully used as a pre-processing step to reduce the noise typically present in data, even in situations where production inputs are not available. In fact, just augmenting or replacing training data with another copy of the training set, but with the predictions from the opaque model as targets, produced significantly more accurate and/or more compact regression trees.","PeriodicalId":117542,"journal":{"name":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2014.7008667","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In many real-world scenarios, predictive models need to be interpretable, thus ruling out many machine learning techniques known to produce very accurate models, e.g., neural networks, support vector machines and all ensemble schemes. Most often, tree models or rule sets are used instead, typically resulting in significantly lower predictive performance. The overall purpose of oracle coaching is to reduce this accuracy vs. comprehensibility trade-off by producing interpretable models optimized for the specific production set at hand. The method requires production set inputs to be present when generating the predictive model, a demand fulfilled in most, but not all, predictive modeling scenarios. In oracle coaching, a highly accurate, but opaque, model is first induced from the training data. This model (“the oracle”) is then used to label both the training instances and the production instances. Finally, interpretable models are trained using different combinations of the resulting data sets. In this paper, the oracle coaching produces regression trees, using neural networks and random forests as oracles. The experiments, using 32 publicly available data sets, show that the oracle coaching leads to significantly improved predictive performance, compared to standard induction. In addition, it is also shown that a highly accurate opaque model can be successfully used as a pre-processing step to reduce the noise typically present in data, even in situations where production inputs are not available. In fact, just augmenting or replacing training data with another copy of the training set, but with the predictions from the opaque model as targets, produced significantly more accurate and/or more compact regression trees.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
准确的和可解释的回归树使用oracle教练
在许多现实世界的场景中,预测模型需要是可解释的,因此排除了许多已知的能够产生非常精确模型的机器学习技术,例如神经网络、支持向量机和所有集成方案。大多数情况下,会使用树模型或规则集,这通常会导致预测性能显著降低。oracle指导的总体目的是通过生成针对手头特定生产集优化的可解释模型来减少这种准确性与可理解性之间的权衡。该方法要求在生成预测模型时提供生产集输入,这一要求在大多数(但不是全部)预测建模场景中得到满足。在oracle教练中,首先从训练数据中推导出一个高度精确但不透明的模型。这个模型(“oracle”)然后被用来标记训练实例和生产实例。最后,使用结果数据集的不同组合来训练可解释模型。在本文中,oracle训练生成回归树,使用神经网络和随机森林作为oracle。使用32个公开可用的数据集进行的实验表明,与标准归纳相比,oracle指导可以显著提高预测性能。此外,它还表明,即使在生产输入不可用的情况下,高度精确的不透明模型也可以成功地用作预处理步骤,以减少数据中通常存在的噪声。事实上,只是用训练集的另一个副本增加或替换训练数据,但以不透明模型的预测作为目标,可以产生更准确和/或更紧凑的回归树。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Automatic relevance source determination in human brain tumors using Bayesian NMF Interpolation and extrapolation: Comparison of definitions and survey of algorithms for convex and concave hulls Generalized kernel framework for unsupervised spectral methods of dimensionality reduction Convex multi-task relationship learning using hinge loss Aggregating predictions vs. aggregating features for relational classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1