Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm

ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic) Pub Date : 2011-06-05 DOI:10.2139/ssrn.1858424

Dhruv Sharma

{"title":"Some Experiments Comparing Logistic Regression and Random Forests Using Synthetic Data and the Interaction Miner Algorithm","authors":"Dhruv Sharma","doi":"10.2139/ssrn.1858424","DOIUrl":null,"url":null,"abstract":"This paper uses synthetic datasets to classify the conditions in which random forest may outperform more traditional techniques such as logistic regression. We explore the theoretical implications of these experimental findings, and work towards building a theory based approach to data mining. During the course of these experiments we take the simulations where random forests dominate and add additional dimensionality to the data and run logistic regression using the additional attributes through the I* interaction miner algorithm outlined in Sharma 2011. Using the I* procedure with adequate amount of interaction terms the logistic regression can be made to match performance of random forests in the synthetic data sets where random forests dominate (Sharma, 2011). This makes it seem the interaction miner algorithm along with some minimal sufficient amount of interaction and transformations allow logistic regression to match ensemble performance. This implies that, without a certain amount of dimensionality in the data interaction, miner and logistic regression do not benefit from the interactions. Breiman and other work shows Random Forests thrive on dimensionality that said from experiences with various data sets adding additional artificial dimensionality doesn’t help forest (Breiman, 2001). There appears to be some minimum or necessary and sufficient amount of dimensionality after which more information cannot be extracted from the data. The good news is dimensionality can be created using the icreater function which add Tukey’s re-expressions automatically to the data (log, negative reciprocal, and sqrt).","PeriodicalId":165362,"journal":{"name":"ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.1858424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

This paper uses synthetic datasets to classify the conditions in which random forest may outperform more traditional techniques such as logistic regression. We explore the theoretical implications of these experimental findings, and work towards building a theory based approach to data mining. During the course of these experiments we take the simulations where random forests dominate and add additional dimensionality to the data and run logistic regression using the additional attributes through the I* interaction miner algorithm outlined in Sharma 2011. Using the I* procedure with adequate amount of interaction terms the logistic regression can be made to match performance of random forests in the synthetic data sets where random forests dominate (Sharma, 2011). This makes it seem the interaction miner algorithm along with some minimal sufficient amount of interaction and transformations allow logistic regression to match ensemble performance. This implies that, without a certain amount of dimensionality in the data interaction, miner and logistic regression do not benefit from the interactions. Breiman and other work shows Random Forests thrive on dimensionality that said from experiences with various data sets adding additional artificial dimensionality doesn’t help forest (Breiman, 2001). There appears to be some minimum or necessary and sufficient amount of dimensionality after which more information cannot be extracted from the data. The good news is dimensionality can be created using the icreater function which add Tukey’s re-expressions automatically to the data (log, negative reciprocal, and sqrt).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用合成数据和交互挖掘算法比较逻辑回归和随机森林的一些实验

本文使用合成数据集对随机森林可能优于逻辑回归等更传统技术的条件进行分类。我们探索这些实验结果的理论含义，并致力于建立一个基于理论的数据挖掘方法。在这些实验过程中，我们进行了随机森林占主导地位的模拟，并向数据添加了额外的维度，并通过Sharma 2011中概述的I*交互挖掘算法使用附加属性运行逻辑回归。使用具有足够数量的交互项的I*过程，可以进行逻辑回归，以匹配随机森林占主导地位的合成数据集中随机森林的性能(Sharma, 2011)。这使得交互挖掘算法以及一些最小的足够数量的交互和转换允许逻辑回归匹配集成性能。这意味着，如果数据交互中没有一定的维度，挖掘和逻辑回归就不能从交互中受益。Breiman和其他研究表明，随机森林在维度上茁壮成长，从各种数据集的经验来看，添加额外的人工维度对森林没有帮助(Breiman, 2001)。似乎存在一些最小或必要和足够的维数，超过这些维数，就不能从数据中提取更多的信息。好消息是维度可以使用icreater函数创建，该函数会自动将Tukey的重新表达式(log，负倒数和sqrt)添加到数据中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ERN: Discrete Regression & Qualitative Choice Models (Single) (Topic)

自引率

0.00%

发文量

期刊最新文献

Assortment Optimization and Pricing Under the Threshold-Based Choice Models Robust Techniques to Estimate Parameters of Linear Models Identification of Random Coefficient Latent Utility Models An Algorithm for Assortment Optimization Under Parametric Discrete Choice Models Equivalent Choice Functions and Stable Mechanisms