缺失数据对相关系数值的影响:数据准备中的删除和估算方法

IF 0.8 Q3 MULTIDISCIPLINARY SCIENCES Malaysian Journal of Fundamental and Applied Sciences Pub Date : 2023-12-04 DOI:10.11113/mjfas.v19n6.3098
Mohamed Shantal, Z. Othman, Azuraliza Abu Bakar
{"title":"缺失数据对相关系数值的影响:数据准备中的删除和估算方法","authors":"Mohamed Shantal, Z. Othman, Azuraliza Abu Bakar","doi":"10.11113/mjfas.v19n6.3098","DOIUrl":null,"url":null,"abstract":"The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.","PeriodicalId":18149,"journal":{"name":"Malaysian Journal of Fundamental and Applied Sciences","volume":"7 5","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation\",\"authors\":\"Mohamed Shantal, Z. Othman, Azuraliza Abu Bakar\",\"doi\":\"10.11113/mjfas.v19n6.3098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.\",\"PeriodicalId\":18149,\"journal\":{\"name\":\"Malaysian Journal of Fundamental and Applied Sciences\",\"volume\":\"7 5\",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Malaysian Journal of Fundamental and Applied Sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.11113/mjfas.v19n6.3098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaysian Journal of Fundamental and Applied Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11113/mjfas.v19n6.3098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

相关系数是用来发现变量之间关系的基本统计技术之一。根据数据类型,各种技术可以量化相关性,例如Pearson的、Spearman的和Kendall的相关系数。与任何数据的使用一样,缺少数据将影响数据的可用性,减少数据的可用性,并可能影响结果。此外,当使用完整的案例分析或可用的案例分析时,从研究中删除缺失值数据可能会导致选择偏差。本文通过计算原始完整数据集的相关系数与缺失数据集的相关系数之差来研究缺失数据对相关系数值的影响。在计算相关系数之前,使用了两种删除策略(Listwise和Pairwise)和三种imputation策略(Mean, k-Nearest Neighbors (k-NN)和Expectation-Maximization)来准备数据。将唯一的相关系数值转换为一维数组,得到唯一的相关系数值,并使用RMSE指标对实验进行评价。本研究使用了8个不同大小和属性数量的UCI和Kaggle数据集。实验结果表明,当缺失率中等或较小时,配对策略和k-NN分别在相关系数上取得了较好的效果。两两使用所有可用的值,只丢弃相关属性的缺失值,而k-NN用新值填充缺失值,产生接近实际值的相关系数值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation
The correlation coefficient is one of the essential statistical techniques used to discover relationships among variables. Various techniques can quantify correlation, such as Pearson's, Spearman's, and Kendall's correlation coefficients, depending on the data type. As with any use of data, missing data will impact the availability of data, reducing it and potentially affecting the results. Furthermore, the removal of missing-value data from the study when using complete case analysis or available case analysis may result in selection biases. In this paper, we investigate the impact of missing data on the correlation coefficient value by calculating the difference between the correlation coefficient of the original complete dataset and that of a dataset with missing data. Two deletion strategies (Listwise and Pairwise) and three imputation strategies (Mean, k-Nearest Neighbors (k-NN), and Expectation-Maximization) were used to prepare the data before calculating the correlation coefficient. Unique correlation coefficient values were created by converting unique values to a one-dimensional array, and RMSE metrics were used to evaluate the experiments. Eight UCI and Kaggle datasets with different sizes and numbers of attributes were used in this study. The experiment results demonstrate that the Pairwise strategy and k-NN give good results on the correlation coefficient, respectively, when the missing rate is moderate or less. Pairwise uses all the available values and discards only the missing values of the related attribute, while k-NN fills the missing values with new values that produce correlation coefficient values close to the actual values.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
45
期刊最新文献
A Review on Synthesis and Physicochemical Properties-Photocatalytic Activity Relationships of Carbon Quantum Dots Graphitic Carbon Nitride in Reduction of Carbon Dioxide A Multi-Criteria Generalised L-R Intuitionistic Fuzzy TOPSIS with CRITIC for River Water Pollution Classification Phytochemical Screening and Antioxidant Activities of Geniotrigona thoracica Propolis Extracts Derived from Different Locations in Malaysia Two-Dimensional Heavy Metal Migration in Soil with Adsorption and Instantaneous Injection Fuzzy Intuitionistic Alpha-cut Interpolation Rational Bézier Curve Modeling for Shoreline Island Data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1