Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm

Dhruv Sharma
{"title":"Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm","authors":"Dhruv Sharma","doi":"10.2139/ssrn.1858424","DOIUrl":null,"url":null,"abstract":"This paper uses synthetic datasets to classify the conditions in which random forest may outperform more traditional techniques such as logistic regression. We explore the theoretical implications of these experimental findings, and work towards building a theory based approach to data mining. During the course of these experiments we take the simulations where random forests dominate and add additional dimensionality to the data and run logistic regression using the additional attributes through the I* interaction miner algorithm outlined in Sharma 2011. Using the I* procedure with adequate amount of interaction terms the logistic regression can be made to match performance of random forests in the synthetic data sets where random forests dominate (Sharma, 2011). This makes it seem the interaction miner algorithm along with some minimal sufficient amount of interaction and transformations allow logistic regression to match ensemble performance. This implies that, without a certain amount of dimensionality in the data interaction, miner and logistic regression do not benefit from the interactions. Breiman and other work shows Random Forests thrive on dimensionality that said from experiences with various data sets adding additional artificial dimensionality doesn’t help forest (Breiman, 2001). There appears to be some minimum or necessary and sufficient amount of dimensionality after which more information cannot be extracted from the data. The good news is dimensionality can be created using the icreater function which add Tukey’s re-expressions automatically to the data (log, negative reciprocal, and sqrt).","PeriodicalId":165362,"journal":{"name":"ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.1858424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

This paper uses synthetic datasets to classify the conditions in which random forest may outperform more traditional techniques such as logistic regression. We explore the theoretical implications of these experimental findings, and work towards building a theory based approach to data mining. During the course of these experiments we take the simulations where random forests dominate and add additional dimensionality to the data and run logistic regression using the additional attributes through the I* interaction miner algorithm outlined in Sharma 2011. Using the I* procedure with adequate amount of interaction terms the logistic regression can be made to match performance of random forests in the synthetic data sets where random forests dominate (Sharma, 2011). This makes it seem the interaction miner algorithm along with some minimal sufficient amount of interaction and transformations allow logistic regression to match ensemble performance. This implies that, without a certain amount of dimensionality in the data interaction, miner and logistic regression do not benefit from the interactions. Breiman and other work shows Random Forests thrive on dimensionality that said from experiences with various data sets adding additional artificial dimensionality doesn’t help forest (Breiman, 2001). There appears to be some minimum or necessary and sufficient amount of dimensionality after which more information cannot be extracted from the data. The good news is dimensionality can be created using the icreater function which add Tukey’s re-expressions automatically to the data (log, negative reciprocal, and sqrt).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用合成数据和交互挖掘算法比较逻辑回归和随机森林的一些实验
本文使用合成数据集对随机森林可能优于逻辑回归等更传统技术的条件进行分类。我们探索这些实验结果的理论含义,并致力于建立一个基于理论的数据挖掘方法。在这些实验过程中,我们进行了随机森林占主导地位的模拟,并向数据添加了额外的维度,并通过Sharma 2011中概述的I*交互挖掘算法使用附加属性运行逻辑回归。使用具有足够数量的交互项的I*过程,可以进行逻辑回归,以匹配随机森林占主导地位的合成数据集中随机森林的性能(Sharma, 2011)。这使得交互挖掘算法以及一些最小的足够数量的交互和转换允许逻辑回归匹配集成性能。这意味着,如果数据交互中没有一定的维度,挖掘和逻辑回归就不能从交互中受益。Breiman和其他研究表明,随机森林在维度上茁壮成长,从各种数据集的经验来看,添加额外的人工维度对森林没有帮助(Breiman, 2001)。似乎存在一些最小或必要和足够的维数,超过这些维数,就不能从数据中提取更多的信息。好消息是维度可以使用icreater函数创建,该函数会自动将Tukey的重新表达式(log,负倒数和sqrt)添加到数据中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Assortment Optimization and Pricing Under the Threshold-Based Choice Models Robust Techniques to Estimate Parameters of Linear Models Identification of Random Coefficient Latent Utility Models An Algorithm for Assortment Optimization Under Parametric Discrete Choice Models Equivalent Choice Functions and Stable Mechanisms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1