In recent years, non‐probability samples, such as web survey samples, have become increasingly popular in many fields, but they may be subject to selection biases, which results in the difficulty for inference from them. Doubly robust (DR) estimation is one of the approaches to making inferences from non‐probability samples. When many covariates are available, variable selection becomes important in DR estimation. In this paper, a new DR estimator for the finite population mean is constructed, where the intertwined probabilistic factors decoupling (IPAD) and modified IPAD are used to select important variables in the propensity score model and the outcome superpopulation model, respectively. Unlike the traditional variable selection approaches, such as adaptive least absolute shrinkage and selection operator and smoothly clipped absolute deviations, IPAD and the modified IPAD not only can select important variables and estimate parameters, but also can control the false discovery rate, which can produce more accurate population estimators. Asymptotic theories and variance estimation of the DR estimator with a modified IPAD are established. Results from simulation studies indicate that our proposed estimator performs well. We apply the proposed method to the analysis of the Pew Research Center data and the Behavioral Risk Factor Surveillance System data.
{"title":"Doubly robust estimation for non‐probability samples with modified intertwined probabilistic factors decoupling","authors":"Zhanxu Liu, Junbo Zheng, Yingli Pan","doi":"10.1002/sam.11614","DOIUrl":"https://doi.org/10.1002/sam.11614","url":null,"abstract":"In recent years, non‐probability samples, such as web survey samples, have become increasingly popular in many fields, but they may be subject to selection biases, which results in the difficulty for inference from them. Doubly robust (DR) estimation is one of the approaches to making inferences from non‐probability samples. When many covariates are available, variable selection becomes important in DR estimation. In this paper, a new DR estimator for the finite population mean is constructed, where the intertwined probabilistic factors decoupling (IPAD) and modified IPAD are used to select important variables in the propensity score model and the outcome superpopulation model, respectively. Unlike the traditional variable selection approaches, such as adaptive least absolute shrinkage and selection operator and smoothly clipped absolute deviations, IPAD and the modified IPAD not only can select important variables and estimate parameters, but also can control the false discovery rate, which can produce more accurate population estimators. Asymptotic theories and variance estimation of the DR estimator with a modified IPAD are established. Results from simulation studies indicate that our proposed estimator performs well. We apply the proposed method to the analysis of the Pew Research Center data and the Behavioral Risk Factor Surveillance System data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116207298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel
Contemporaneous monitoring of disease progression, in addition to early diagnosis, is important for the treatment of patients with chronic conditions. Chronic disease‐related factors are not easily tractable, and the existing data sets do not clearly reflect them, making diagnosis difficult. The primary issue is that databases maintained by health care, insurance, or governmental organizations typically do not contain clinical information and instead focus on patient appointments and demographic profiles. Due to the lack of thorough information on potential risk factors for a single patient, investigations on the nature of disease are imprecise. We suggest the use of a latent Markov model with variables in a latent process because it enables the panel analysis of many forms of data. The purpose of this study is to evaluate unobserved factors in ischemic heart disease (IHD) using longitudinal data from electronic health records. Based on the results we designate states as healthy, light, moderate, and severe to represent stages of disease progression. This study demonstrates that gender, patient age, and hospital visit frequency are all significant factors in the development of the disease. Females acquire IHD more rapidly than males, frequently developing from moderate and severe disease. In addition, it demonstrates that individuals under the age of 20 bypass the light state of IHD and proceed directly to the moderate state.
{"title":"Estimation of disease progression for ischemic heart disease using latent Markov with covariates","authors":"Zarina Oflaz, Ceylan Yozgatlıgil, A. S. Selcuk-Kestel","doi":"10.1002/sam.11589","DOIUrl":"https://doi.org/10.1002/sam.11589","url":null,"abstract":"Contemporaneous monitoring of disease progression, in addition to early diagnosis, is important for the treatment of patients with chronic conditions. Chronic disease‐related factors are not easily tractable, and the existing data sets do not clearly reflect them, making diagnosis difficult. The primary issue is that databases maintained by health care, insurance, or governmental organizations typically do not contain clinical information and instead focus on patient appointments and demographic profiles. Due to the lack of thorough information on potential risk factors for a single patient, investigations on the nature of disease are imprecise. We suggest the use of a latent Markov model with variables in a latent process because it enables the panel analysis of many forms of data. The purpose of this study is to evaluate unobserved factors in ischemic heart disease (IHD) using longitudinal data from electronic health records. Based on the results we designate states as healthy, light, moderate, and severe to represent stages of disease progression. This study demonstrates that gender, patient age, and hospital visit frequency are all significant factors in the development of the disease. Females acquire IHD more rapidly than males, frequently developing from moderate and severe disease. In addition, it demonstrates that individuals under the age of 20 bypass the light state of IHD and proceed directly to the moderate state.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114991576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boosting has proven its superiority by increasing the diversity of base classifiers, mainly in various classification problems. In reality, target variables in classification often are formed by numerical variables, in possession of ordinal information. However, existing boosting algorithms for classification are unable to reflect such ordinal target variables, resulting in non‐optimal solutions. In this paper, we propose a novel algorithm of ordinal encoding adaptive boosting (AdaBoost) using a multi‐dimensional encoding scheme for ordinal target variables. Extending an original binary‐class AdaBoost, the proposed algorithm is equipped with a multi‐class exponential loss function. We show that it achieves the Bayes classifier and establishes forward stagewise additive modeling. We demonstrate the performance of the proposed algorithm with a base learner as a neural network. Our experiments show that it outperforms existing boosting algorithms in various ordinal datasets.
{"title":"Adaptive boosting for ordinal target variables using neural networks","authors":"Insung Um, Geonseok Lee, K. Lee","doi":"10.1002/sam.11613","DOIUrl":"https://doi.org/10.1002/sam.11613","url":null,"abstract":"Boosting has proven its superiority by increasing the diversity of base classifiers, mainly in various classification problems. In reality, target variables in classification often are formed by numerical variables, in possession of ordinal information. However, existing boosting algorithms for classification are unable to reflect such ordinal target variables, resulting in non‐optimal solutions. In this paper, we propose a novel algorithm of ordinal encoding adaptive boosting (AdaBoost) using a multi‐dimensional encoding scheme for ordinal target variables. Extending an original binary‐class AdaBoost, the proposed algorithm is equipped with a multi‐class exponential loss function. We show that it achieves the Bayes classifier and establishes forward stagewise additive modeling. We demonstrate the performance of the proposed algorithm with a base learner as a neural network. Our experiments show that it outperforms existing boosting algorithms in various ordinal datasets.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116989208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai
We propose a method called Bilateral‐Weighted Online Adaptive Isolation Forest (BWOAIF) for unsupervised anomaly detection based on Isolation Forest (IF), which is applicable to streaming data and able to cope with concept drift. Similar to IF, the proposed method has only few hyperparameters whose effect on the performance are easy to interpret by human intuition and therefore easy to tune. BWOAIF ingests data and classifies it as normal or anomalous, and simultaneously adapts its classifier by removing old trees as well as by creating new ones. We show that BWOAIF adapts gradually to slow concept drifts, and, at the same time, it is able to adapt fast to sudden changes of the data distribution. Numerical results show the efficacy of the proposed algorithm and its ability to learn different classes of concept drifts, such as slow/fast concept shift, concept split, concept appearance, and concept disappearance.
{"title":"Bilateral‐Weighted Online Adaptive Isolation Forest for anomaly detection in streaming data","authors":"Gabor Hannak, G. Horváth, Attila Kádár, Márk Dániel Szalai","doi":"10.1002/sam.11612","DOIUrl":"https://doi.org/10.1002/sam.11612","url":null,"abstract":"We propose a method called Bilateral‐Weighted Online Adaptive Isolation Forest (BWOAIF) for unsupervised anomaly detection based on Isolation Forest (IF), which is applicable to streaming data and able to cope with concept drift. Similar to IF, the proposed method has only few hyperparameters whose effect on the performance are easy to interpret by human intuition and therefore easy to tune. BWOAIF ingests data and classifies it as normal or anomalous, and simultaneously adapts its classifier by removing old trees as well as by creating new ones. We show that BWOAIF adapts gradually to slow concept drifts, and, at the same time, it is able to adapt fast to sudden changes of the data distribution. Numerical results show the efficacy of the proposed algorithm and its ability to learn different classes of concept drifts, such as slow/fast concept shift, concept split, concept appearance, and concept disappearance.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123468562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Structural equation modeling (SEM) plays an important role in business and social science and so do composites, that is, linear combinations of variables. However, existing approaches to integrate composites into structural equation models still have limitations. A major leap forward has been the Henseler–Ogasawara (H–O) specification, which for the first time allows for seamlessly integrating composites into structural equation models. In doing so, it relies on emergent variables, that is, the composite of interest, and one or more orthogonal excrescent variables, that is, composites that have no surplus meaning but just span the remaining space of the emergent variable's components. Although the H–O specification enables researchers to flexibly model composites in SEM, it comes along with several practical problems: (i) The H–O specification is difficult to visualize graphically; (ii) its complexity could create difficulties for analysts, and (iii) at times SEM software packages seem to encounter convergence issues with it. In this paper, we present a refinement of the original H–O specification that addresses these three problems. In this new specification, only two components load on each excrescent variable, whereas the excrescent variables are allowed to covary among themselves. This results in a simpler graphical visualization. Additionally, researchers facing convergence issues of the original H–O specification are provided with an alternative specification. Finally, we illustrate the new specification's application by means of an empirical example and provide guidance on how (standardized) weights including their standard errors can be calculated in the R package lavaan. The corresponding Mplus model syntax is provided in the Supplementary Material.
{"title":"Specifying composites in structural equation modeling: A refinement of the Henseler–Ogasawara specification","authors":"Xi Yu, Florian Schuberth, J. Henseler","doi":"10.1002/sam.11608","DOIUrl":"https://doi.org/10.1002/sam.11608","url":null,"abstract":"Structural equation modeling (SEM) plays an important role in business and social science and so do composites, that is, linear combinations of variables. However, existing approaches to integrate composites into structural equation models still have limitations. A major leap forward has been the Henseler–Ogasawara (H–O) specification, which for the first time allows for seamlessly integrating composites into structural equation models. In doing so, it relies on emergent variables, that is, the composite of interest, and one or more orthogonal excrescent variables, that is, composites that have no surplus meaning but just span the remaining space of the emergent variable's components. Although the H–O specification enables researchers to flexibly model composites in SEM, it comes along with several practical problems: (i) The H–O specification is difficult to visualize graphically; (ii) its complexity could create difficulties for analysts, and (iii) at times SEM software packages seem to encounter convergence issues with it. In this paper, we present a refinement of the original H–O specification that addresses these three problems. In this new specification, only two components load on each excrescent variable, whereas the excrescent variables are allowed to covary among themselves. This results in a simpler graphical visualization. Additionally, researchers facing convergence issues of the original H–O specification are provided with an alternative specification. Finally, we illustrate the new specification's application by means of an empirical example and provide guidance on how (standardized) weights including their standard errors can be calculated in the R package lavaan. The corresponding Mplus model syntax is provided in the Supplementary Material.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132327806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.
{"title":"Model selection with bootstrap validation","authors":"Rafael Savvides, Jarmo Mäkelä, K. Puolamäki","doi":"10.1002/sam.11606","DOIUrl":"https://doi.org/10.1002/sam.11606","url":null,"abstract":"Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117144076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene expressions have been extensively studied in biomedical research. With gene expression, network analysis, which takes a system perspective and examines the interconnections among genes, has been established as highly important and meaningful. In the construction of gene expression networks, a commonly adopted technique is high‐dimensional regularized regression. Network construction can be unadjusted (which focuses on gene expressions only) and adjusted (which also incorporates regulators of gene expressions), and the two types of construction have different implications and can be equally important. In this article, we propose a variable selection hierarchy to connect the unadjusted regression‐based network construction with the adjusted construction that incorporates two or more types of regulators. This hierarchy is sensible and amounts to additional information for both constructions, thus having the potential of improving variable selection and estimation. An effective computational algorithm is developed, and extensive simulation demonstrates the superiority of the proposed construction over multiple closely relevant alternatives. The analysis of TCGA data further demonstrates the practical utility of the proposed approach.
{"title":"Hierarchy‐assisted gene expression regulatory network analysis","authors":"Han Yan, Sanguo Zhang, Shuangge Ma","doi":"10.1002/sam.11609","DOIUrl":"https://doi.org/10.1002/sam.11609","url":null,"abstract":"Gene expressions have been extensively studied in biomedical research. With gene expression, network analysis, which takes a system perspective and examines the interconnections among genes, has been established as highly important and meaningful. In the construction of gene expression networks, a commonly adopted technique is high‐dimensional regularized regression. Network construction can be unadjusted (which focuses on gene expressions only) and adjusted (which also incorporates regulators of gene expressions), and the two types of construction have different implications and can be equally important. In this article, we propose a variable selection hierarchy to connect the unadjusted regression‐based network construction with the adjusted construction that incorporates two or more types of regulators. This hierarchy is sensible and amounts to additional information for both constructions, thus having the potential of improving variable selection and estimation. An effective computational algorithm is developed, and extensive simulation demonstrates the superiority of the proposed construction over multiple closely relevant alternatives. The analysis of TCGA data further demonstrates the practical utility of the proposed approach.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124058475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Surrogate models have been used to emulate mathematical simulators of physical or biological processes for computational efficiency. High‐speed simulation is crucial for conducting uncertainty quantification (UQ) when the simulation must repeat over many randomly sampled input points (aka the Monte Carlo method). A simulator can be so computationally intensive that UQ is only feasible with a surrogate model. Recently, deep neural network (DNN) surrogate models have gained popularity for their state‐of‐the‐art emulation accuracy. However, it is well‐known that DNN is prone to severe errors when input data are perturbed in particular ways, the very phenomenon which has inspired great interest in adversarial training. In the case of surrogate models, the concern is less about a deliberate attack exploiting the vulnerability of a DNN but more of the high sensitivity of its accuracy to input directions, an issue largely ignored by researchers using emulation models. In this paper, we show the severity of this issue through empirical studies and hypothesis testing. Furthermore, we adopt methods in adversarial training to enhance the robustness of DNN surrogate models. Experiments demonstrate that our approaches significantly improve the robustness of the surrogate models without compromising emulation accuracy.
{"title":"Robust deep neural network surrogate models with uncertainty quantification via adversarial training","authors":"Lixiang Zhang, Jia Li","doi":"10.1002/sam.11610","DOIUrl":"https://doi.org/10.1002/sam.11610","url":null,"abstract":"Surrogate models have been used to emulate mathematical simulators of physical or biological processes for computational efficiency. High‐speed simulation is crucial for conducting uncertainty quantification (UQ) when the simulation must repeat over many randomly sampled input points (aka the Monte Carlo method). A simulator can be so computationally intensive that UQ is only feasible with a surrogate model. Recently, deep neural network (DNN) surrogate models have gained popularity for their state‐of‐the‐art emulation accuracy. However, it is well‐known that DNN is prone to severe errors when input data are perturbed in particular ways, the very phenomenon which has inspired great interest in adversarial training. In the case of surrogate models, the concern is less about a deliberate attack exploiting the vulnerability of a DNN but more of the high sensitivity of its accuracy to input directions, an issue largely ignored by researchers using emulation models. In this paper, we show the severity of this issue through empirical studies and hypothesis testing. Furthermore, we adopt methods in adversarial training to enhance the robustness of DNN surrogate models. Experiments demonstrate that our approaches significantly improve the robustness of the surrogate models without compromising emulation accuracy.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123711917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The majority of multi‐learning techniques now in use presuppose that there will be enough labeled instances. But in real‐world applications, it is frequently the case that only partial labels are included for each training instance. This is either because getting a fully labeled training set takes a lot of time and effort or because doing so is expensive. Multi‐label learning with missing labels, on the other hand, has greater practical value. In this paper, we propose a brand‐new semi‐supervised multi‐label learning method (SMLMFC) that specifically addresses missing‐label scenarios. After successfully filling in the missing labels for instances using two‐stage label correlations, SMLMFC trains a semi‐supervised multi‐label classifier by imposing feature‐label correlation restrictions directly on the output of labels. The complex relationships between features and labels can be learned and implicitly captured through feature‐label correlations, in particular. The experimental results on a number of real‐world multi‐label datasets confirm that SMLMFC has strong competitiveness in comparison to other state‐of‐the‐art methods.
{"title":"Semi‐supervised multi‐label learning with missing labels by exploiting feature‐label correlations","authors":"Runxin Li, Xuefeng Zhao, Zhenhong Shang, Lianyin Jia","doi":"10.1002/sam.11607","DOIUrl":"https://doi.org/10.1002/sam.11607","url":null,"abstract":"The majority of multi‐learning techniques now in use presuppose that there will be enough labeled instances. But in real‐world applications, it is frequently the case that only partial labels are included for each training instance. This is either because getting a fully labeled training set takes a lot of time and effort or because doing so is expensive. Multi‐label learning with missing labels, on the other hand, has greater practical value. In this paper, we propose a brand‐new semi‐supervised multi‐label learning method (SMLMFC) that specifically addresses missing‐label scenarios. After successfully filling in the missing labels for instances using two‐stage label correlations, SMLMFC trains a semi‐supervised multi‐label classifier by imposing feature‐label correlation restrictions directly on the output of labels. The complex relationships between features and labels can be learned and implicitly captured through feature‐label correlations, in particular. The experimental results on a number of real‐world multi‐label datasets confirm that SMLMFC has strong competitiveness in comparison to other state‐of‐the‐art methods.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"250 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120890196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Depth functions are important tools of nonparametric statistics that extend orderings, ranks, and quantiles to the setup of multivariate data. We revisit the classical definition of the simplicial depth and explore its theoretical properties when evaluated with respect to datasets or measures that do not necessarily possess a symmetric density. Recent advances from discrete geometry are used to refine the results about the robustness and continuity of the simplicial depth and its induced multivariate median. Further, we compute the exact simplicial depth in several scenarios and point out some undesirable behavior: (i) the simplicial depth does not have to be maximized at the center of symmetry of the distribution, (ii) it is not necessarily unimodal, and can possess local extremes, and (iii) the sets of the induced multivariate medians or other central regions do not have to be connected.
{"title":"Simplicial depth and its median: Selected properties and limitations","authors":"Stanislav Nagy","doi":"10.1002/sam.11605","DOIUrl":"https://doi.org/10.1002/sam.11605","url":null,"abstract":"Depth functions are important tools of nonparametric statistics that extend orderings, ranks, and quantiles to the setup of multivariate data. We revisit the classical definition of the simplicial depth and explore its theoretical properties when evaluated with respect to datasets or measures that do not necessarily possess a symmetric density. Recent advances from discrete geometry are used to refine the results about the robustness and continuity of the simplicial depth and its induced multivariate median. Further, we compute the exact simplicial depth in several scenarios and point out some undesirable behavior: (i) the simplicial depth does not have to be maximized at the center of symmetry of the distribution, (ii) it is not necessarily unimodal, and can possess local extremes, and (iii) the sets of the induced multivariate medians or other central regions do not have to be connected.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132291653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}