A novel and efficient risk minimisation-based missing value imputation algorithm

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-11-25 Epub Date: 2024-08-28 DOI:10.1016/j.knosys.2024.112435

Yu-Lin He , Jia-Yin Yu , Xu Li , Philippe Fournier-Viger , Joshua Zhexue Huang

{"title":"A novel and efficient risk minimisation-based missing value imputation algorithm","authors":"Yu-Lin He , Jia-Yin Yu , Xu Li , Philippe Fournier-Viger , Joshua Zhexue Huang","doi":"10.1016/j.knosys.2024.112435","DOIUrl":null,"url":null,"abstract":"<div><p>Missing value imputation (MVI) is a key task in data science, in which learning models are built from incomplete data. In contrast to externally driven MVI algorithms, this study proposes a novel risk minimisation-based MVI algorithm (RM-MVI) that considers both the internal characteristics of missing data and the external performance for specific classification applications. RM-MVI is technically designed for labelled data and is applied in two stages: <em>filling</em> with structural risk minimisation (SRM) and <em>refining</em> with empirical risk minimisation (ERM). In the filling stage, an autoencoder with a single hidden layer is trained on the original dataset without missing values. Missing values are first initialised with random numbers, and the imputation values are then preliminarily optimised based on the derived updating rule to minimise the structural risk-oriented objective function. After the imputation values have been preliminarily optimised in the filling stage, a neural-network-based classifier is trained in the refining stage to optimise the imputation values sophisticatedly by reducing the empirical risk. Experiments were conducted on several benchmark datasets to validate the feasibility, rationality, and effectiveness of the proposed RM-MVI algorithm. The results show that (1) the optimisation processes of the imputation values corresponding to the SRM and ERM are convergent so that the optimised imputation values can be obtained; (2) SRM can ensure distribution consistency of the imputation values that are preliminarily optimised in the filling stage, while ERM can optimise the imputation values sophisticatedly in the refining stage, which is more helpful for classifier training; and (3) the RM-MVI algorithm can yield considerably better MVI performance on benchmark datasets than 11 well-known MVI algorithms, such as a 26% higher distribution consistency ratio and 2% to 5% higher testing accuracies for 6 classifiers on average. This demonstrates that RM-MVI is a viable approach for addressing MVI problems.</p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"304 ","pages":"Article 112435"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124010694","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Missing value imputation (MVI) is a key task in data science, in which learning models are built from incomplete data. In contrast to externally driven MVI algorithms, this study proposes a novel risk minimisation-based MVI algorithm (RM-MVI) that considers both the internal characteristics of missing data and the external performance for specific classification applications. RM-MVI is technically designed for labelled data and is applied in two stages: filling with structural risk minimisation (SRM) and refining with empirical risk minimisation (ERM). In the filling stage, an autoencoder with a single hidden layer is trained on the original dataset without missing values. Missing values are first initialised with random numbers, and the imputation values are then preliminarily optimised based on the derived updating rule to minimise the structural risk-oriented objective function. After the imputation values have been preliminarily optimised in the filling stage, a neural-network-based classifier is trained in the refining stage to optimise the imputation values sophisticatedly by reducing the empirical risk. Experiments were conducted on several benchmark datasets to validate the feasibility, rationality, and effectiveness of the proposed RM-MVI algorithm. The results show that (1) the optimisation processes of the imputation values corresponding to the SRM and ERM are convergent so that the optimised imputation values can be obtained; (2) SRM can ensure distribution consistency of the imputation values that are preliminarily optimised in the filling stage, while ERM can optimise the imputation values sophisticatedly in the refining stage, which is more helpful for classifier training; and (3) the RM-MVI algorithm can yield considerably better MVI performance on benchmark datasets than 11 well-known MVI algorithms, such as a 26% higher distribution consistency ratio and 2% to 5% higher testing accuracies for 6 classifiers on average. This demonstrates that RM-MVI is a viable approach for addressing MVI problems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于风险最小化的新型高效缺失值估算算法

缺失值估算（MVI）是数据科学中的一项关键任务，在这项任务中，要根据不完整的数据建立学习模型。与外部驱动的 MVI 算法不同，本研究提出了一种新颖的基于风险最小化的 MVI 算法（RM-MVI），它既考虑了缺失数据的内部特征，又考虑了特定分类应用的外部性能。RM-MVI 在技术上是为标记数据设计的，并分两个阶段应用：利用结构风险最小化（SRM）进行填充，以及利用经验风险最小化（ERM）进行细化。在填充阶段，在无缺失值的原始数据集上训练具有单隐层的自动编码器。首先用随机数对缺失值进行初始化，然后根据推导出的更新规则对估算值进行初步优化，以最小化以结构风险为导向的目标函数。在填充阶段对估算值进行初步优化后，在细化阶段对基于神经网络的分类器进行训练，通过降低经验风险对估算值进行精密优化。我们在多个基准数据集上进行了实验，以验证所提出的 RM-MVI 算法的可行性、合理性和有效性。结果表明：(1) SRM 和 ERM 对应的估算值优化过程是收敛的，因此可以得到优化的估算值；(2) SRM 可以确保在填充阶段初步优化的估算值的分布一致性，而 ERM 可以在细化阶段对估算值进行精细优化，这更有助于分类器的训练；(3) RM-MVI 算法在基准数据集上的 MVI 性能大大优于 11 种著名的 MVI 算法，如分布一致性比高 26%，6 个分类器的测试精度平均高 2%至 5%。这表明 RM-MVI 是解决 MVI 问题的一种可行方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.