{"title":"使用方框图和多元线性回归在多元数据中查找异常值的统计方法","authors":"Theeraphat Thanwiset, W. Srisodaphol","doi":"10.17576/jsm-2023-5209-20","DOIUrl":null,"url":null,"abstract":"The objective of this study was to propose a method for detecting outliers in multivariate data. It is based on a boxplot and multiple linear regression. In our proposed method, the box plot was initially applied to filter the data across all variables to split the data set into two sets: normal data (belonging to the upper and lower fences of the boxplot) and data that could be outliers. The normal data was then used to construct a multiple linear regression model and find the maximum error of the residual to denote the cut-off point. For the performance evaluation of the proposed method, a simulation study for multivariate normal data with and without contaminated data was conducted at various levels. The previous methods were compared with the performance of the proposed methods, namely, the Mahalanobis distance and Mahalanobis distance with the robust estimators using the minimum volume ellipsoid method, the minimum covariance determinant method, and the minimum vector variance method. The results showed that the proposed method had the best performance over other methods that were compared for all the contaminated levels. It was also found that when the proposed method was used with real data, it was able to find outlier values that were in line with the real data.","PeriodicalId":21366,"journal":{"name":"Sains Malaysiana","volume":"47 7 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Statistical Methods for Finding Outliers in Multivariate Data Using a Boxplot and Multiple Linear Regression\",\"authors\":\"Theeraphat Thanwiset, W. Srisodaphol\",\"doi\":\"10.17576/jsm-2023-5209-20\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The objective of this study was to propose a method for detecting outliers in multivariate data. It is based on a boxplot and multiple linear regression. In our proposed method, the box plot was initially applied to filter the data across all variables to split the data set into two sets: normal data (belonging to the upper and lower fences of the boxplot) and data that could be outliers. The normal data was then used to construct a multiple linear regression model and find the maximum error of the residual to denote the cut-off point. For the performance evaluation of the proposed method, a simulation study for multivariate normal data with and without contaminated data was conducted at various levels. The previous methods were compared with the performance of the proposed methods, namely, the Mahalanobis distance and Mahalanobis distance with the robust estimators using the minimum volume ellipsoid method, the minimum covariance determinant method, and the minimum vector variance method. The results showed that the proposed method had the best performance over other methods that were compared for all the contaminated levels. It was also found that when the proposed method was used with real data, it was able to find outlier values that were in line with the real data.\",\"PeriodicalId\":21366,\"journal\":{\"name\":\"Sains Malaysiana\",\"volume\":\"47 7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Sains Malaysiana\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.17576/jsm-2023-5209-20\",\"RegionNum\":4,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sains Malaysiana","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.17576/jsm-2023-5209-20","RegionNum":4,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
Statistical Methods for Finding Outliers in Multivariate Data Using a Boxplot and Multiple Linear Regression
The objective of this study was to propose a method for detecting outliers in multivariate data. It is based on a boxplot and multiple linear regression. In our proposed method, the box plot was initially applied to filter the data across all variables to split the data set into two sets: normal data (belonging to the upper and lower fences of the boxplot) and data that could be outliers. The normal data was then used to construct a multiple linear regression model and find the maximum error of the residual to denote the cut-off point. For the performance evaluation of the proposed method, a simulation study for multivariate normal data with and without contaminated data was conducted at various levels. The previous methods were compared with the performance of the proposed methods, namely, the Mahalanobis distance and Mahalanobis distance with the robust estimators using the minimum volume ellipsoid method, the minimum covariance determinant method, and the minimum vector variance method. The results showed that the proposed method had the best performance over other methods that were compared for all the contaminated levels. It was also found that when the proposed method was used with real data, it was able to find outlier values that were in line with the real data.
期刊介绍:
Sains Malaysiana is a refereed journal committed to the advancement of scholarly knowledge and research findings of the several branches of science and technology. It contains articles on Earth Sciences, Health Sciences, Life Sciences, Mathematical Sciences and Physical Sciences. The journal publishes articles, reviews, and research notes whose content and approach are of interest to a wide range of scholars. Sains Malaysiana is published by the UKM Press an its autonomous Editorial Board are drawn from the Faculty of Science and Technology, Universiti Kebangsaan Malaysia. In addition, distinguished scholars from local and foreign universities are appointed to serve as advisory board members and referees.