Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings

IF 3.3 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Informatica Pub Date : 2023-06-05 DOI:10.31449/inf.v47i3.4156
Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta
{"title":"Comparative Study of Missing Value Imputation Techniques on E-Commerce Product Ratings","authors":"Dimple Chehal, Parul Gupta, Payal Gulati, Tanisha Gupta","doi":"10.31449/inf.v47i3.4156","DOIUrl":null,"url":null,"abstract":"Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":"240 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31449/inf.v47i3.4156","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Missing data is a common occurrence in practically all studies, and it adds a layer of ambiguity to data interpretation. Missing values in a dataset mean loss of important information. It is one of the most common data quality issues. Missing values are values that are not present in the data set. These are usually written as NAN’s, blanks, or any other placeholders. Missing values create imbalanced observations, biased estimates and sometimes lead to misleading results. The majority of real-world datasets have missing values. As a result, to deliver an efficient and valid analysis and the solutions should be taken into account appropriately. By filling in the missing values a complete dataset can be created and the challenge of dealing with complex patterns of missingness can be avoided. Missing values can be of both continuous and categorical types. To get more precise results, a variety of techniques to fill out missing values can be used. In the present study, nine different imputation methods: Simple Imputer, Last Observation Carried forward (LOCF), KNN Imputation (KNN), Hot Deck, Linear Regression, MissForest, Random Forest Regression, DataWig, and Multivariate Imputation by Chained Equation (MICE) were compared. The comparison was performed on Amazon real-time dataset based on three evaluation criteria: R- Squared (R 2 ), Mean squared error (MSE), and Mean absolute error (MAE). As a result of the findings KNN had the best outcomes, while DataWig had the worst results for R- Squared (R 2 ). The R-squared value ranges from 0-1. In terms of mean squared error (MSE) and mean absolute error (MAE), the Hot deck imputation approach fared best, whereas MissForest performed worst (MAE). The hot deck imputation method appears to be of interest and merits further investigation in practice.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
电子商务产品评级缺失价值归算技术的比较研究
缺失数据在几乎所有的研究中都很常见,这给数据解释增加了一层模糊性。数据集中缺失的值意味着重要信息的丢失。这是最常见的数据质量问题之一。缺失值是数据集中不存在的值。这些通常写成NAN、空格或任何其他占位符。缺失的值会造成不平衡的观察、有偏差的估计,有时还会导致误导性的结果。大多数真实世界的数据集都有缺失值。因此,要提供有效和有效的分析和解决方案,应适当考虑。通过填充缺失值,可以创建一个完整的数据集,并且可以避免处理缺失的复杂模式的挑战。缺失值可以是连续型的,也可以是分类型的。为了获得更精确的结果,可以使用各种技术来填充缺失值。本文比较了9种不同的归算方法:Simple Imputer、Last Observation Carried forward (LOCF)、KNN imputation (KNN)、Hot Deck、Linear Regression、MissForest、Random Forest Regression、DataWig和Multivariate imputation by Chained Equation (MICE)。基于R- Squared (r2)、Mean Squared error (MSE)和Mean absolute error (MAE)三个评价标准对Amazon实时数据集进行比较。结果发现,KNN的结果最好,而DataWig的R- Squared (r2)结果最差。r平方的取值范围为0-1。在均方误差(MSE)和平均绝对误差(MAE)方面,Hot deck imputation方法表现最好,而MissForest方法表现最差(MAE)。热甲板归算方法是一种有趣的方法,值得在实践中进一步研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Informatica
Informatica 工程技术-计算机:信息系统
CiteScore
5.90
自引率
6.90%
发文量
19
审稿时长
12 months
期刊介绍: The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.
期刊最新文献
Beyond Quasi-Adjoint Graphs: On Polynomial-Time Solvable Cases of the Hamiltonian Cycle and Path Problems Confidential Transaction Balance Verification by the Net Using Non-Interactive Zero-Knowledge Proofs An Improved Algorithm for Extracting Frequent Gradual Patterns Offloaded Data Processing Energy Efficiency Evaluation Demystifying the Stability and the Performance Aspects of CoCoSo Ranking Method under Uncertain Preferences
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1