Xuecheng Tian, Shuaian Wang, Lu Zhen, Zuo-Jun (Max) Shen
{"title":"[formula omitted]-Tree: Crossing sharp boundaries in regression trees to find neighbors","authors":"Xuecheng Tian, Shuaian Wang, Lu Zhen, Zuo-Jun (Max) Shen","doi":"10.1016/j.ejor.2025.02.031","DOIUrl":null,"url":null,"abstract":"Traditional classification and regression trees (CARTs) utilize a top-down, greedy approach to split the feature space into sharply defined, axis-aligned sub-regions (leaves). Each leaf treats all of the samples therein uniformly during the prediction process, leading to a constant predictor. Although this approach is well known for its interpretability and efficiency, it overlooks the complex local distributions within and across leaves. As the number of features increases, this limitation becomes more pronounced, often resulting in a concentration of samples near the boundaries of the leaves. Such clustering suggests that there is potential in identifying closer neighbors in adjacent leaves, a phenomenon that is unexplored in the literature. Our study addresses this gap by introducing the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methodology, a novel method that extends the search for nearest neighbors beyond a single leaf to include adjacent leaves. This approach has two key innovations: (1) establishing an adjacency relationship between leaves across the tree space and (2) designing novel intra-leaf and inter-leaf distance metrics through an optimization lens, which are tailored to local data distributions within the tree. We explore three implementations of the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methodology: (1) the Post-hoc <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree (P<mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree), which integrates the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methodology into constructed decision trees, (2) the Advanced <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree, which seamlessly incorporates the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methodology during the tree construction process, and (3) the P<mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-random forest, which integrates the P<mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree principles with the random forest framework. The results of empirical evaluations conducted on a variety of real-world and synthetic datasets demonstrate that the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methods have greater prediction accuracy over the traditional models. These results highlight the potential of the <mml:math altimg=\"si545.svg\" display=\"inline\"><mml:mi>k</mml:mi></mml:math>-Tree methodology in enhancing predictive analytics by providing a deeper insight into the relationships between samples within the tree space.","PeriodicalId":55161,"journal":{"name":"European Journal of Operational Research","volume":"110 1","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Operational Research","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1016/j.ejor.2025.02.031","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPERATIONS RESEARCH & MANAGEMENT SCIENCE","Score":null,"Total":0}
引用次数: 0
摘要
传统的分类和回归树(CART)采用自上而下、贪婪的方法,将特征空间分割成定义清晰、轴对齐的子区域(叶)。在预测过程中,每个子区域都会对其中的所有样本进行统一处理,从而得到一个恒定的预测结果。虽然这种方法因其可解释性和高效性而闻名,但它忽略了叶内和叶间复杂的局部分布。随着特征数量的增加,这种局限性变得更加明显,往往导致样本集中在叶片边界附近。这种聚类现象表明,识别相邻叶片中的近邻是有潜力的,而这一现象在文献中尚未得到探讨。我们的研究通过引入 k-Tree 方法解决了这一空白,这是一种新颖的方法,它将搜索近邻的范围从单片叶子扩展到了相邻叶子。这种方法有两个关键的创新点:(1) 在整个树空间的树叶之间建立邻接关系;(2) 通过优化视角设计新颖的树叶内和树叶间距离度量,这些度量适合树内的局部数据分布。我们探索了 k 树方法的三种实现方式:(1)将 k 树方法集成到构建的决策树中的事后 k 树(Pk-Tree);(2)在树构建过程中无缝集成 k 树方法的高级 k 树;以及(3)将 Pk 树原理与随机森林框架集成的 Pk 随机森林。在各种现实世界和合成数据集上进行的实证评估结果表明,与传统模型相比,k 树方法具有更高的预测准确性。这些结果凸显了 k-Tree 方法通过深入洞察树空间内样本之间的关系来增强预测分析能力的潜力。
[formula omitted]-Tree: Crossing sharp boundaries in regression trees to find neighbors
Traditional classification and regression trees (CARTs) utilize a top-down, greedy approach to split the feature space into sharply defined, axis-aligned sub-regions (leaves). Each leaf treats all of the samples therein uniformly during the prediction process, leading to a constant predictor. Although this approach is well known for its interpretability and efficiency, it overlooks the complex local distributions within and across leaves. As the number of features increases, this limitation becomes more pronounced, often resulting in a concentration of samples near the boundaries of the leaves. Such clustering suggests that there is potential in identifying closer neighbors in adjacent leaves, a phenomenon that is unexplored in the literature. Our study addresses this gap by introducing the k-Tree methodology, a novel method that extends the search for nearest neighbors beyond a single leaf to include adjacent leaves. This approach has two key innovations: (1) establishing an adjacency relationship between leaves across the tree space and (2) designing novel intra-leaf and inter-leaf distance metrics through an optimization lens, which are tailored to local data distributions within the tree. We explore three implementations of the k-Tree methodology: (1) the Post-hoc k-Tree (Pk-Tree), which integrates the k-Tree methodology into constructed decision trees, (2) the Advanced k-Tree, which seamlessly incorporates the k-Tree methodology during the tree construction process, and (3) the Pk-random forest, which integrates the Pk-Tree principles with the random forest framework. The results of empirical evaluations conducted on a variety of real-world and synthetic datasets demonstrate that the k-Tree methods have greater prediction accuracy over the traditional models. These results highlight the potential of the k-Tree methodology in enhancing predictive analytics by providing a deeper insight into the relationships between samples within the tree space.
期刊介绍:
The European Journal of Operational Research (EJOR) publishes high quality, original papers that contribute to the methodology of operational research (OR) and to the practice of decision making.