Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings

IF 3.3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Informatica Pub Date : 2023-06-05 DOI:10.31449/inf.v47i3.4156

Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta

{"title":"Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings","authors":"Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta","doi":"10.31449/inf.v47i3.4156","DOIUrl":null,"url":null,"abstract":"Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"240 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4156","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

电子商务产品评级缺失价值归算技术的比较研究

缺失数据在几乎所有的研究中都很常见，这给数据解释增加了一层模糊性。数据集中缺失的值意味着重要信息的丢失。这是最常见的数据质量问题之一。缺失值是数据集中不存在的值。这些通常写成NAN、空格或任何其他占位符。缺失的值会造成不平衡的观察、有偏差的估计，有时还会导致误导性的结果。大多数真实世界的数据集都有缺失值。因此，要提供有效和有效的分析和解决方案，应适当考虑。通过填充缺失值，可以创建一个完整的数据集，并且可以避免处理缺失的复杂模式的挑战。缺失值可以是连续型的，也可以是分类型的。为了获得更精确的结果，可以使用各种技术来填充缺失值。本文比较了9种不同的归算方法:Simple Imputer、Last Observation Carried forward (LOCF)、KNN imputation (KNN)、Hot Deck、Linear Regression、MissForest、Random Forest Regression、DataWig和Multivariate imputation by Chained Equation (MICE)。基于R- Squared (r2)、Mean Squared error (MSE)和Mean absolute error (MAE)三个评价标准对Amazon实时数据集进行比较。结果发现，KNN的结果最好，而DataWig的R- Squared (r2)结果最差。r平方的取值范围为0-1。在均方误差(MSE)和平均绝对误差(MAE)方面，Hot deck imputation方法表现最好，而MissForest方法表现最差(MAE)。热甲板归算方法是一种有趣的方法，值得在实践中进一步研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Informatica 工程技术-计算机：信息系统

CiteScore

5.90

自引率

6.90%

发文量

审稿时长

12 months

期刊介绍： The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.