Computing Random Forest-distances in the presence of missing data

IF 4 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Knowledge Discovery from Data Pub Date : 2024-04-08 DOI:10.1145/3656345

Manuele Bicego, Ferdinando Cicalese

引用次数: 0

Abstract

In this paper, we study the problem of computing Random Forest-distances in the presence of missing data. We present a general framework which avoids pre-imputation and uses in an agnostic way the information contained in the input points. We centre our investigation on RatioRF, an RF-based distance recently introduced in the context of clustering and shown to outperform most known RF-based distance measures. We also show that the same framework can be applied to several other state-of-the-art RF-based measures and provide their extensions to the missing data case. We provide significant empirical evidence of the effectiveness of the proposed framework, showing extensive experiments with RatioRF on 15 datasets. Finally, we also positively compare our method with many alternative literature distances, which can be computed with missing values.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在数据缺失的情况下计算随机森林间距

本文研究了在数据缺失的情况下计算随机森林间距的问题。我们提出了一个通用框架，它避免了预先输入，并以一种不可知的方式使用输入点中包含的信息。我们的研究以 RatioRF 为中心，RatioRF 是最近在聚类中引入的一种基于 RF 的距离测量方法，其性能优于大多数已知的基于 RF 的距离测量方法。我们还证明，同样的框架可应用于其他几种最先进的基于射频的测量方法，并将其扩展到缺失数据的情况。我们在 15 个数据集上使用 RatioRF 进行了大量实验，为所提框架的有效性提供了重要的经验证据。最后，我们还将我们的方法与许多可计算缺失值的其他文献距离进行了正面比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Knowledge Discovery from Data COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.70

自引率

5.60%

发文量

172

审稿时长

3 months

期刊介绍： TKDD welcomes papers on a full range of research in the knowledge discovery and analysis of diverse forms of data. Such subjects include, but are not limited to: scalable and effective algorithms for data mining and big data analysis, mining brain networks, mining data streams, mining multi-media data, mining high-dimensional data, mining text, Web, and semi-structured data, mining spatial and temporal data, data mining for community generation, social network analysis, and graph structured data, security and privacy issues in data mining, visual, interactive and online data mining, pre-processing and post-processing for data mining, robust and scalable statistical methods, data mining languages, foundations of data mining, KDD framework and process, and novel applications and infrastructures exploiting data mining technology including massively parallel processing and cloud computing platforms. TKDD encourages papers that explore the above subjects in the context of large distributed networks of computers, parallel or multiprocessing computers, or new data devices. TKDD also encourages papers that describe emerging data mining applications that cannot be satisfied by the current data mining technology.