Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data Mining and Knowledge Discovery Pub Date : 2024-01-31 DOI:10.1007/s10618-024-01002-5

Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho

{"title":"Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms","authors":"Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho","doi":"10.1007/s10618-024-01002-5","DOIUrl":null,"url":null,"abstract":"<p>Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"13 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01002-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

更好的树：关于分类决策树归纳算法超参数调整的实证研究

机器学习算法通常包含许多超参数，其值会以错综复杂的方式影响诱导模型的预测性能。由于这些超参数配置存在大量可能性及其复杂的相互作用，因此通常需要使用优化技术来找到能带来高预测性能的设置。然而，如何有效探索这一巨大的配置空间，以及如何处理预测性能和运行性能之间的权衡，仍然是一项挑战。此外，在某些情况下，默认超参数也适合合适的配置。此外，出于模型验证和遵守新法规等多种原因，人们对决策树（DT）归纳算法等可解释模型的兴趣与日俱增。本文提供了一种综合方法，用于研究最常用的两种 DT 归纳算法（CART 和 C4.5）的超参数调整效果。尽管许多超参数需要调整，但 DT 归纳算法具有很高的预测性能和可解释的分类模型。我们使用不同的调整策略进行了实验，以诱导模型，并使用 OpenML 的 94 个分类数据集评估超参数的相关性。实验结果表明，对每种算法进行不同的超参数调整，在大多数数据集上都能对 CART 算法带来统计意义上的显著改进，但在 C4.5 算法中，只有三分之一的数据集有显著改进。虽然不同的算法可能会有不同的调整方案，但调整技术一般只需要很少的评估就能找到准确的解决方案。此外，所有算法的最佳技术都是 Irace。最后，我们发现调整特定的一小部分超参数子集是获得最佳预测性能的良好选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.

期刊最新文献

Missing value replacement in strings and applications. FRUITS: feature extraction using iterated sums for time series classification Bounding the family-wise error rate in local causal discovery using Rademacher averages Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack Efficient learning with projected histograms