It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is p:1$$ sqrt{p}:1 $$ , where p$$ p $$ is the number of parameters in a linear regression model that explains the data well.
在拟合统计或机器学习模型之前,将数据集分成训练集和测试集是很常见的。然而,对于应该使用多少数据进行培训和测试,并没有明确的指导。在本文中,我们展示了最优的训练/测试分割比是p:1 $$ sqrt{p}:1 $$,其中p $$ p $$是线性回归模型中能够很好地解释数据的参数数量。
{"title":"Optimal ratio for data splitting","authors":"V. R. Joseph","doi":"10.1002/sam.11583","DOIUrl":"https://doi.org/10.1002/sam.11583","url":null,"abstract":"It is common to split a dataset into training and testing sets before fitting a statistical or machine learning model. However, there is no clear guidance on how much data should be used for training and testing. In this article, we show that the optimal training/testing splitting ratio is p:1$$ sqrt{p}:1 $$ , where p$$ p $$ is the number of parameters in a linear regression model that explains the data well.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Peide, Seyyid Emre Sofuoglu, T. Maiti, Selin Aviyente
Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision‐making. Recently, coupled matrix–tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix–tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A coupled support tensor machine (C‐STM) built upon the latent factors jointly estimated from the advanced coupled matrix–tensor factorization is proposed. C‐STM combines individual and shared latent factors with multiple kernels and estimates a maximal‐margin classifier for coupled matrix–tensor data. The classification risk of C‐STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C‐STM is validated through simulation studies as well as a simultaneous analysis on electroencephalography with functional magnetic resonance imaging data. The empirical evidence shows that C‐STM can utilize information from multiple sources and provide a better classification performance than traditional single‐mode classifiers.
{"title":"Coupled support tensor machine classification for multimodal neuroimaging data","authors":"L. Peide, Seyyid Emre Sofuoglu, T. Maiti, Selin Aviyente","doi":"10.1002/sam.11587","DOIUrl":"https://doi.org/10.1002/sam.11587","url":null,"abstract":"Multimodal data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as this offers the possibility of capturing complementary information among modalities. Multimodal modeling helps to explain the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision‐making. Recently, coupled matrix–tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, most of the prior work on coupled matrix–tensor factors focuses on unsupervised learning and there is little work on supervised learning using the jointly estimated latent factors. This paper considers the multimodal tensor data classification problem. A coupled support tensor machine (C‐STM) built upon the latent factors jointly estimated from the advanced coupled matrix–tensor factorization is proposed. C‐STM combines individual and shared latent factors with multiple kernels and estimates a maximal‐margin classifier for coupled matrix–tensor data. The classification risk of C‐STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C‐STM is validated through simulation studies as well as a simultaneous analysis on electroencephalography with functional magnetic resonance imaging data. The empirical evidence shows that C‐STM can utilize information from multiple sources and provide a better classification performance than traditional single‐mode classifiers.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114211773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recognizing COVID‐19 patients at a greater risk of mortality assists medical staff to identify who benefits from more serious care. We developed and validated prediction models for two‐week mortality of inpatients with COVID‐19 infection based on clinical predictors. A prospective cohort study was started in February 2020 and is still continuing. In total, 57,705 inpatients with both a positive reverse transcription‐polymerase chain reaction test and positive chest CT findings for COVID‐19 were included. The outcome was mortality within 2 weeks of admission. Three prognostic models were developed for young, adult, and senior patients. Data from the capital province (Tehran) of Iran were used for validation, and data from all other provinces were used for development of the models. The model Young, was well‐fitted to the data (p < 0.001, Nagelkerke R2 = 0.697, C‐statistics = 0.88) and the models Adult (p < 0.001, Nagelkerke R2 = 0.340, C‐statistics = 0.70) and Senior (p < 0.001, Nagelkerke R2 = 0.208, C‐statistics = 0.68) were also significant. Intubation, saturated O2 < 93%, impaired consciousness, acute respiratory distress syndrome, and cancer treatment were major risk factors. Elderly people were at greater risk of mortality. Young patients with a history of blood hypertension, vomiting, and fever; and adults with diabetes mellitus and cardiovascular disease had more mortality risk. Young people with myalgia; and adult patients with nausea, anorexia, and headache showed less risk of mortality than others.
{"title":"Development and validation of models for two‐week mortality of inpatients with COVID‐19 infection: A large prospective cohort study","authors":"M. Fathi, N. M. Moghaddam, L. Kheyrati","doi":"10.1002/sam.11572","DOIUrl":"https://doi.org/10.1002/sam.11572","url":null,"abstract":"Recognizing COVID‐19 patients at a greater risk of mortality assists medical staff to identify who benefits from more serious care. We developed and validated prediction models for two‐week mortality of inpatients with COVID‐19 infection based on clinical predictors. A prospective cohort study was started in February 2020 and is still continuing. In total, 57,705 inpatients with both a positive reverse transcription‐polymerase chain reaction test and positive chest CT findings for COVID‐19 were included. The outcome was mortality within 2 weeks of admission. Three prognostic models were developed for young, adult, and senior patients. Data from the capital province (Tehran) of Iran were used for validation, and data from all other provinces were used for development of the models. The model Young, was well‐fitted to the data (p < 0.001, Nagelkerke R2 = 0.697, C‐statistics = 0.88) and the models Adult (p < 0.001, Nagelkerke R2 = 0.340, C‐statistics = 0.70) and Senior (p < 0.001, Nagelkerke R2 = 0.208, C‐statistics = 0.68) were also significant. Intubation, saturated O2 < 93%, impaired consciousness, acute respiratory distress syndrome, and cancer treatment were major risk factors. Elderly people were at greater risk of mortality. Young patients with a history of blood hypertension, vomiting, and fever; and adults with diabetes mellitus and cardiovascular disease had more mortality risk. Young people with myalgia; and adult patients with nausea, anorexia, and headache showed less risk of mortality than others.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"8 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recommender systems are information filtering tools that seek to match customers with products or services of interest. Most of the prevalent collaborative filtering recommender systems, such as matrix factorization and AutoRec, suffer from the “cold‐start” problem, where they fail to provide meaningful recommendations for new users or new items due to informative‐missing from the training data. To address this problem, we propose a weighted AutoEncoding model to leverage information from other users or items that share similar characteristics. The proposed method provides an effective strategy for borrowing strength from user or item‐specific clustering structure as well as pairwise similarity in the training data, while achieving high computational efficiency and dimension reduction, and preserving nonlinear relationships between user preferences and item features. Simulation studies and applications to three real datasets show advantages in prediction accuracy of the proposed model compared to current state‐of‐the‐art approaches.
{"title":"Weighted AutoEncoding recommender system","authors":"Shuying Zhu, Weining Shen, Annie Qu","doi":"10.1002/sam.11571","DOIUrl":"https://doi.org/10.1002/sam.11571","url":null,"abstract":"Recommender systems are information filtering tools that seek to match customers with products or services of interest. Most of the prevalent collaborative filtering recommender systems, such as matrix factorization and AutoRec, suffer from the “cold‐start” problem, where they fail to provide meaningful recommendations for new users or new items due to informative‐missing from the training data. To address this problem, we propose a weighted AutoEncoding model to leverage information from other users or items that share similar characteristics. The proposed method provides an effective strategy for borrowing strength from user or item‐specific clustering structure as well as pairwise similarity in the training data, while achieving high computational efficiency and dimension reduction, and preserving nonlinear relationships between user preferences and item features. Simulation studies and applications to three real datasets show advantages in prediction accuracy of the proposed model compared to current state‐of‐the‐art approaches.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122942317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article describes an analytical method for comparing geographical sites and transferring fog forecasting models, trained by Data Mining techniques on a fixed site, across Italian airports. This portability method uses a specific intersite similarity measure based on the Euclidean distance between the performance vectors associated with each airport site. Performance vectors are useful for characterizing geographical sites. The components of a performance vector are the performance metrics of an Ensemble descriptive model. In the tests carried out, the comparison method provided very promising results, and the forecast model, when applied and evaluated on a new compatible site, shows only a small decrease in performance. The portability schema provides a meta‐learning methodology for applying predictive models to new sites where a new model cannot be trained from scratch owing to the class imbalance problem or the lack of data for a specific learning. The methodology offers a measure for clustering geographical sites and extending weather knowledge from one site to another.
{"title":"Portability analysis of data mining models for fog events forecasting","authors":"G. Zazzaro","doi":"10.1002/sam.11568","DOIUrl":"https://doi.org/10.1002/sam.11568","url":null,"abstract":"This article describes an analytical method for comparing geographical sites and transferring fog forecasting models, trained by Data Mining techniques on a fixed site, across Italian airports. This portability method uses a specific intersite similarity measure based on the Euclidean distance between the performance vectors associated with each airport site. Performance vectors are useful for characterizing geographical sites. The components of a performance vector are the performance metrics of an Ensemble descriptive model. In the tests carried out, the comparison method provided very promising results, and the forecast model, when applied and evaluated on a new compatible site, shows only a small decrease in performance. The portability schema provides a meta‐learning methodology for applying predictive models to new sites where a new model cannot be trained from scratch owing to the class imbalance problem or the lack of data for a specific learning. The methodology offers a measure for clustering geographical sites and extending weather knowledge from one site to another.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131743632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Handwriting analysis is conducted by forensic document examiners who are able to visually recognize characteristics of writing to evaluate the evidence of writership. Recently, there have been incentives to investigate how to quantify the similarity between two written documents to support the conclusions drawn by experts. We use an automatic algorithm within the “handwriter” package in R, to decompose a handwritten sample into small graphical units of writing. These graphs are sorted into 40 exemplar groups or clusters. We hypothesize that the frequency with which a person contributes graphs to each cluster is characteristic of their handwriting. Given two questioned handwritten documents, we can then use the vectors of cluster frequencies to quantify the similarity between the two documents. We extract features from the difference between the vectors and combine them using a random forest. The output from the random forest is used as the similarity score to compare documents. We estimate the distributions of the similarity scores computed from multiple pairs of documents known to have been written by the same and by different persons, and use these estimated densities to obtain score‐based likelihood ratios (SLRs) that rely on different assumptions. We find that the SLRs are able to indicate whether the similarity observed between two documents is more or less likely depending on writership.
{"title":"Handwriting identification using random forests and score‐based likelihood ratios","authors":"M. Q. Johnson, Danica M. Ommen","doi":"10.1002/sam.11566","DOIUrl":"https://doi.org/10.1002/sam.11566","url":null,"abstract":"Handwriting analysis is conducted by forensic document examiners who are able to visually recognize characteristics of writing to evaluate the evidence of writership. Recently, there have been incentives to investigate how to quantify the similarity between two written documents to support the conclusions drawn by experts. We use an automatic algorithm within the “handwriter” package in R, to decompose a handwritten sample into small graphical units of writing. These graphs are sorted into 40 exemplar groups or clusters. We hypothesize that the frequency with which a person contributes graphs to each cluster is characteristic of their handwriting. Given two questioned handwritten documents, we can then use the vectors of cluster frequencies to quantify the similarity between the two documents. We extract features from the difference between the vectors and combine them using a random forest. The output from the random forest is used as the similarity score to compare documents. We estimate the distributions of the similarity scores computed from multiple pairs of documents known to have been written by the same and by different persons, and use these estimated densities to obtain score‐based likelihood ratios (SLRs) that rely on different assumptions. We find that the SLRs are able to indicate whether the similarity observed between two documents is more or less likely depending on writership.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, missing data in regression model is one of the most well‐known topics. In this paper, we propose a class of efficient importance sampling imputation algorithms (EIS) for quantile and composite quantile regression with missing covariates. They are an EIS in quantile regression (EISQ) and its three extensions in composite quantile regression (EISCQ). Our EISQ uses an interior point (IP) approach, while EISCQ algorithms use IP and other two well‐known approaches: Majorize‐minimization (MM) and coordinate descent (CD). The aims of our proposed EIS algorithms are to decrease estimated variances and relieve computational burden at the same time, which improves the performances of coefficients estimators in both estimated and computational efficiencies. To compare our EIS algorithms with other existing competitors including complete cases analysis and multiple imputation, the paper carries out a series of simulation studies with different sample sizes and different levels of missing rates under different missing mechanism models. Finally, we apply all the algorithms to part of the examination data in National Health and Nutrition Examination Survey.
{"title":"Efficient importance sampling imputation algorithms for quantile and composite quantile regression","authors":"Haoyang Cheng","doi":"10.1002/sam.11565","DOIUrl":"https://doi.org/10.1002/sam.11565","url":null,"abstract":"Nowadays, missing data in regression model is one of the most well‐known topics. In this paper, we propose a class of efficient importance sampling imputation algorithms (EIS) for quantile and composite quantile regression with missing covariates. They are an EIS in quantile regression (EISQ) and its three extensions in composite quantile regression (EISCQ). Our EISQ uses an interior point (IP) approach, while EISCQ algorithms use IP and other two well‐known approaches: Majorize‐minimization (MM) and coordinate descent (CD). The aims of our proposed EIS algorithms are to decrease estimated variances and relieve computational burden at the same time, which improves the performances of coefficients estimators in both estimated and computational efficiencies. To compare our EIS algorithms with other existing competitors including complete cases analysis and multiple imputation, the paper carries out a series of simulation studies with different sample sizes and different levels of missing rates under different missing mechanism models. Finally, we apply all the algorithms to part of the examination data in National Health and Nutrition Examination Survey.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115452235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While many survival models have been invented, the Cox model and the proportional odds model are among the most popular ones. Both models are special cases of the linear transformation model. The linear transformation model typically assumes a linear function on covariates, which may not reflect the complex relationship between covariates and survival outcomes. Nonlinear functional form can also be specified in the linear transformation model. Nonetheless, the underlying functional form is unknown and mis‐specifying it leads to biased estimates and reduced prediction accuracy of the model. To address this issue, we develop a neural‐network transformation model. Similar to neural networks, the neural‐network transformation model uses its hierarchical structure to learn complex features from simpler ones and is capable of approximating the underlying functional form of covariates. It also inherits advantages from the linear transformation model, making it applicable to both time‐to‐event analyses and recurrent event analyses. Simulations demonstrate that the neural‐network transformation model outperforms the linear transformation model in terms of estimation and prediction accuracy when the covariate effects are nonlinear. The advantage of the new model over the linear transformation model is also illustrated via two real applications.
{"title":"Neural‐network transformation models for counting processes","authors":"Rongzi Liu, Chenxi Li, Qing Lu","doi":"10.1002/sam.11564","DOIUrl":"https://doi.org/10.1002/sam.11564","url":null,"abstract":"While many survival models have been invented, the Cox model and the proportional odds model are among the most popular ones. Both models are special cases of the linear transformation model. The linear transformation model typically assumes a linear function on covariates, which may not reflect the complex relationship between covariates and survival outcomes. Nonlinear functional form can also be specified in the linear transformation model. Nonetheless, the underlying functional form is unknown and mis‐specifying it leads to biased estimates and reduced prediction accuracy of the model. To address this issue, we develop a neural‐network transformation model. Similar to neural networks, the neural‐network transformation model uses its hierarchical structure to learn complex features from simpler ones and is capable of approximating the underlying functional form of covariates. It also inherits advantages from the linear transformation model, making it applicable to both time‐to‐event analyses and recurrent event analyses. Simulations demonstrate that the neural‐network transformation model outperforms the linear transformation model in terms of estimation and prediction accuracy when the covariate effects are nonlinear. The advantage of the new model over the linear transformation model is also illustrated via two real applications.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122063264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia package MixedModelsBLB.jl. Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.
{"title":"Bag of little bootstraps for massive and distributed longitudinal data","authors":"Xinkai Zhou, Jin J. Zhou, Hua Zhou","doi":"10.1002/sam.11563","DOIUrl":"https://doi.org/10.1002/sam.11563","url":null,"abstract":"Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia package MixedModelsBLB.jl. Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121133935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, we have been dealing with a large amount of data in which anomalies occur naturally for many reasons, both due to hardware and humans. Therefore, it is necessary to develop efficient tools that are easily adaptable to various data. The paper presents an innovative use of classical statistical tools to detect outliers in multidimensional data sets. The proposed approach uses well‐known statistical methods in an innovative way and allows for a high level of efficiency to be achieved using multi‐level aggregation. The effectiveness of the proposed innovative method is demonstrated by a series of numerical experiments.
{"title":"Intuitively adaptable outlier detector","authors":"Krystyna Kiersztyn","doi":"10.1002/sam.11562","DOIUrl":"https://doi.org/10.1002/sam.11562","url":null,"abstract":"Nowadays, we have been dealing with a large amount of data in which anomalies occur naturally for many reasons, both due to hardware and humans. Therefore, it is necessary to develop efficient tools that are easily adaptable to various data. The paper presents an innovative use of classical statistical tools to detect outliers in multidimensional data sets. The proposed approach uses well‐known statistical methods in an innovative way and allows for a high level of efficiency to be achieved using multi‐level aggregation. The effectiveness of the proposed innovative method is demonstrated by a series of numerical experiments.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122257246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}