This study presents a dynamic Bayesian network framework that facilitates intuitive gradual edge changes. We use two conditional dynamics to model the edge addition and deletion, and edge selection separately. Unlike previous research that uses a mixture network approach, which restricts the number of possible edge changes, or structural priors to induce gradual changes, which can lead to unclear network evolution, our model induces more frequent and intuitive edge change dynamics. We employ Markov chain Monte Carlo (MCMC) sampling to estimate the model structures and parameters and demonstrate the model's effectiveness in a portfolio selection application.
{"title":"Dynamic Bayesian Networks with Conditional Dynamics in Edge Addition and Deletion","authors":"Lupe S. H. Chan, Amanda M. Y. Chu, Mike K. P. So","doi":"arxiv-2409.08965","DOIUrl":"https://doi.org/arxiv-2409.08965","url":null,"abstract":"This study presents a dynamic Bayesian network framework that facilitates\u0000intuitive gradual edge changes. We use two conditional dynamics to model the\u0000edge addition and deletion, and edge selection separately. Unlike previous\u0000research that uses a mixture network approach, which restricts the number of\u0000possible edge changes, or structural priors to induce gradual changes, which\u0000can lead to unclear network evolution, our model induces more frequent and\u0000intuitive edge change dynamics. We employ Markov chain Monte Carlo (MCMC)\u0000sampling to estimate the model structures and parameters and demonstrate the\u0000model's effectiveness in a portfolio selection application.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"203 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we study Bayesian approach for solving large scale linear inverse problems arising in various scientific and engineering fields. We propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting properties and show that it can be formulated as a Gaussian mixture Markov random field. Since the density function of this family of prior is neither log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can not be applied to sample the posterior. Thus, we present a Gibbs sampler in which all the conditional posteriors involved have closed form expressions. The Gibbs sampler works well for small size problems but it is computationally intractable for large scale problems due to the need for sample high dimensional Gaussian distribution. To reduce the computation burden, we construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise deterministic Markov process. This new sampler combines elements of Gibbs sampler with bouncy particle sampler and its computation complexity is an order of magnitude smaller. We show that the new sampler converges to the target distribution. With computed tomography examples, we demonstrate that the proposed method shows competitive performance with existing popular Bayesian methods and is highly efficient in large scale problems.
{"title":"Fused $L_{1/2}$ prior for large scale linear inverse problem with Gibbs bouncy particle sampler","authors":"Xiongwen Ke, Yanan Fan, Qingping Zhou","doi":"arxiv-2409.07874","DOIUrl":"https://doi.org/arxiv-2409.07874","url":null,"abstract":"In this paper, we study Bayesian approach for solving large scale linear\u0000inverse problems arising in various scientific and engineering fields. We\u0000propose a fused $L_{1/2}$ prior with edge-preserving and sparsity-promoting\u0000properties and show that it can be formulated as a Gaussian mixture Markov\u0000random field. Since the density function of this family of prior is neither\u0000log-concave nor Lipschitz, gradient-based Markov chain Monte Carlo methods can\u0000not be applied to sample the posterior. Thus, we present a Gibbs sampler in\u0000which all the conditional posteriors involved have closed form expressions. The\u0000Gibbs sampler works well for small size problems but it is computationally\u0000intractable for large scale problems due to the need for sample high\u0000dimensional Gaussian distribution. To reduce the computation burden, we\u0000construct a Gibbs bouncy particle sampler (Gibbs-BPS) based on a piecewise\u0000deterministic Markov process. This new sampler combines elements of Gibbs\u0000sampler with bouncy particle sampler and its computation complexity is an order\u0000of magnitude smaller. We show that the new sampler converges to the target\u0000distribution. With computed tomography examples, we demonstrate that the\u0000proposed method shows competitive performance with existing popular Bayesian\u0000methods and is highly efficient in large scale problems.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaussian process (GP) methods have been widely studied recently, especially for large-scale systems with big data and even more extreme cases when data is sparse. Key advantages of these methods consist in: 1) the ability to provide inherent ways to assess the impact of uncertainties (especially in the data, and environment) on the solutions, 2) have efficient factorisation based implementations and 3) can be implemented easily in distributed manners and hence provide scalable solutions. This paper reviews the recently developed key factorised GP methods such as the hierarchical off-diagonal low-rank approximation methods and GP with Kronecker structures. An example illustrates the performance of these methods with respect to accuracy and computational complexity.
高斯过程(GP)方法最近得到了广泛的研究,尤其是针对具有海量数据的大型系统,以及数据稀少的更极端情况。这些方法的主要优势在于1) 能够提供评估不确定性(尤其是数据和环境中的不确定性)对解决方案影响的固有方法;2) 具有高效的基于因式分解的实现方法;3) 可以轻松地以分布式方式实现,从而提供可扩展的解决方案。本文回顾了最近开发的关键因子化 GP 方法,如分层离对角线低秩逼近方法和具有 Kronecker 结构的 GP。一个例子说明了这些方法在精度和计算复杂度方面的性能。
{"title":"Review of Recent Advances in Gaussian Process Regression Methods","authors":"Chenyi Lyu, Xingchi Liu, Lyudmila Mihaylova","doi":"arxiv-2409.08112","DOIUrl":"https://doi.org/arxiv-2409.08112","url":null,"abstract":"Gaussian process (GP) methods have been widely studied recently, especially\u0000for large-scale systems with big data and even more extreme cases when data is\u0000sparse. Key advantages of these methods consist in: 1) the ability to provide\u0000inherent ways to assess the impact of uncertainties (especially in the data,\u0000and environment) on the solutions, 2) have efficient factorisation based\u0000implementations and 3) can be implemented easily in distributed manners and\u0000hence provide scalable solutions. This paper reviews the recently developed key\u0000factorised GP methods such as the hierarchical off-diagonal low-rank\u0000approximation methods and GP with Kronecker structures. An example illustrates\u0000the performance of these methods with respect to accuracy and computational\u0000complexity.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Community detection is a crucial problem in the analysis of multi-layer networks. In this work, we introduce a new method, called regularized debiased sum of squared adjacency matrices (RDSoS), to detect latent communities in multi-layer networks. RDSoS is developed based on a novel regularized Laplacian matrix that regularizes the debiased sum of squared adjacency matrices. In contrast, the classical regularized Laplacian matrix typically regularizes the adjacency matrix of a single-layer network. Therefore, at a high level, our regularized Laplacian matrix extends the classical regularized Laplacian matrix to multi-layer networks. We establish the consistency property of RDSoS under the multi-layer stochastic block model (MLSBM) and further extend RDSoS and its theoretical results to the degree-corrected version of the MLSBM model. The effectiveness of the proposed methods is evaluated and demonstrated through synthetic and real datasets.
{"title":"Community detection in multi-layer networks by regularized debiased spectral clustering","authors":"Huan Qing","doi":"arxiv-2409.07956","DOIUrl":"https://doi.org/arxiv-2409.07956","url":null,"abstract":"Community detection is a crucial problem in the analysis of multi-layer\u0000networks. In this work, we introduce a new method, called regularized debiased\u0000sum of squared adjacency matrices (RDSoS), to detect latent communities in\u0000multi-layer networks. RDSoS is developed based on a novel regularized Laplacian\u0000matrix that regularizes the debiased sum of squared adjacency matrices. In\u0000contrast, the classical regularized Laplacian matrix typically regularizes the\u0000adjacency matrix of a single-layer network. Therefore, at a high level, our\u0000regularized Laplacian matrix extends the classical regularized Laplacian matrix\u0000to multi-layer networks. We establish the consistency property of RDSoS under\u0000the multi-layer stochastic block model (MLSBM) and further extend RDSoS and its\u0000theoretical results to the degree-corrected version of the MLSBM model. The\u0000effectiveness of the proposed methods is evaluated and demonstrated through\u0000synthetic and real datasets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Easy-to-interpret effect estimands are highly desirable in survival analysis. In the competing risks framework, one good candidate is the restricted mean time lost (RMTL). It is defined as the area under the cumulative incidence function up to a prespecified time point and, thus, it summarizes the cumulative incidence function into a meaningful estimand. While existing RMTL-based tests are limited to two-sample comparisons and mostly to two event types, we aim to develop general contrast tests for factorial designs and an arbitrary number of event types based on a Wald-type test statistic. Furthermore, we avoid the often-made, rather restrictive continuity assumption on the event time distribution. This allows for ties in the data, which often occur in practical applications, e.g., when event times are measured in whole days. In addition, we develop more reliable tests for RMTL comparisons that are based on a permutation approach to improve the small sample performance. In a second step, multiple tests for RMTL comparisons are developed to test several null hypotheses simultaneously. Here, we incorporate the asymptotically exact dependence structure between the local test statistics to gain more power. The small sample performance of the proposed testing procedures is analyzed in simulations and finally illustrated by analyzing a real data example about leukemia patients who underwent bone marrow transplantation.
{"title":"Multiple tests for restricted mean time lost with competing risks data","authors":"Merle Munko, Dennis Dobler, Marc Ditzhaus","doi":"arxiv-2409.07917","DOIUrl":"https://doi.org/arxiv-2409.07917","url":null,"abstract":"Easy-to-interpret effect estimands are highly desirable in survival analysis.\u0000In the competing risks framework, one good candidate is the restricted mean\u0000time lost (RMTL). It is defined as the area under the cumulative incidence\u0000function up to a prespecified time point and, thus, it summarizes the\u0000cumulative incidence function into a meaningful estimand. While existing\u0000RMTL-based tests are limited to two-sample comparisons and mostly to two event\u0000types, we aim to develop general contrast tests for factorial designs and an\u0000arbitrary number of event types based on a Wald-type test statistic.\u0000Furthermore, we avoid the often-made, rather restrictive continuity assumption\u0000on the event time distribution. This allows for ties in the data, which often\u0000occur in practical applications, e.g., when event times are measured in whole\u0000days. In addition, we develop more reliable tests for RMTL comparisons that are\u0000based on a permutation approach to improve the small sample performance. In a\u0000second step, multiple tests for RMTL comparisons are developed to test several\u0000null hypotheses simultaneously. Here, we incorporate the asymptotically exact\u0000dependence structure between the local test statistics to gain more power. The\u0000small sample performance of the proposed testing procedures is analyzed in\u0000simulations and finally illustrated by analyzing a real data example about\u0000leukemia patients who underwent bone marrow transplantation.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"398 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Studying racial bias in policing is a critically important problem, but one that comes with a number of inherent difficulties due to the nature of the available data. In this manuscript we tackle multiple key issues in the causal analysis of racial bias in policing. First, we formalize race and place policing, the idea that individuals of one race are policed differently when they are in neighborhoods primarily made up of individuals of other races. We develop an estimand to study this question rigorously, show the assumptions necessary for causal identification, and develop sensitivity analyses to assess robustness to violations of key assumptions. Additionally, we investigate difficulties with existing estimands targeting racial bias in policing. We show for these estimands, and the estimands developed in this manuscript, that estimation can benefit from incorporating mobility data into analyses. We apply these ideas to a study in New York City, where we find a large amount of racial bias, as well as race and place policing, and that these findings are robust to large violations of untestable assumptions. We additionally show that mobility data can make substantial impacts on the resulting estimates, suggesting it should be used whenever possible in subsequent studies.
{"title":"Causal inference and racial bias in policing: New estimands and the importance of mobility data","authors":"Zhuochao Huang, Brenden Beck, Joseph Antonelli","doi":"arxiv-2409.08059","DOIUrl":"https://doi.org/arxiv-2409.08059","url":null,"abstract":"Studying racial bias in policing is a critically important problem, but one\u0000that comes with a number of inherent difficulties due to the nature of the\u0000available data. In this manuscript we tackle multiple key issues in the causal\u0000analysis of racial bias in policing. First, we formalize race and place\u0000policing, the idea that individuals of one race are policed differently when\u0000they are in neighborhoods primarily made up of individuals of other races. We\u0000develop an estimand to study this question rigorously, show the assumptions\u0000necessary for causal identification, and develop sensitivity analyses to assess\u0000robustness to violations of key assumptions. Additionally, we investigate\u0000difficulties with existing estimands targeting racial bias in policing. We show\u0000for these estimands, and the estimands developed in this manuscript, that\u0000estimation can benefit from incorporating mobility data into analyses. We apply\u0000these ideas to a study in New York City, where we find a large amount of racial\u0000bias, as well as race and place policing, and that these findings are robust to\u0000large violations of untestable assumptions. We additionally show that mobility\u0000data can make substantial impacts on the resulting estimates, suggesting it\u0000should be used whenever possible in subsequent studies.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Functional data analysis (FDA) and ensemble learning can be powerful tools for analyzing complex environmental time series. Recent literature has highlighted the key role of diversity in enhancing accuracy and reducing variance in ensemble methods.This paper introduces Randomized Spline Trees (RST), a novel algorithm that bridges these two approaches by incorporating randomized functional representations into the Random Forest framework. RST generates diverse functional representations of input data using randomized B-spline parameters, creating an ensemble of decision trees trained on these varied representations. We provide a theoretical analysis of how this functional diversity contributes to reducing generalization error and present empirical evaluations on six environmental time series classification tasks from the UCR Time Series Archive. Results show that RST variants outperform standard Random Forests and Gradient Boosting on most datasets, improving classification accuracy by up to 14%. The success of RST demonstrates the potential of adaptive functional representations in capturing complex temporal patterns in environmental data. This work contributes to the growing field of machine learning techniques focused on functional data and opens new avenues for research in environmental time series analysis.
功能数据分析(FDA)和集合学习是分析复杂环境时间序列的有力工具。本文介绍了随机样条树(RST),这是一种新型算法,它将随机化函数表示纳入随机森林框架,从而在这两种方法之间架起了桥梁。RST 使用随机 B 样条参数生成输入数据的不同函数表示,并创建一个在这些不同表示上训练的决策树集合。我们从理论上分析了功能多样性如何有助于减少泛化误差,并对 UCR 时间序列档案中的六个环境时间序列分类任务进行了实证评估。结果表明,RST 变体在大多数数据集上的表现优于标准随机森林和梯度提升,分类准确率提高了 14%。RST 的成功证明了自适应函数表示法在捕捉环境数据中复杂时间模式方面的潜力。这项工作为不断发展的以功能数据为重点的机器学习技术领域做出了贡献,并为环境时间序列分析的研究开辟了新的途径。
{"title":"Randomized Spline Trees for Functional Data Classification: Theory and Application to Environmental Time Series","authors":"Donato Riccio, Fabrizio Maturo, Elvira Romano","doi":"arxiv-2409.07879","DOIUrl":"https://doi.org/arxiv-2409.07879","url":null,"abstract":"Functional data analysis (FDA) and ensemble learning can be powerful tools\u0000for analyzing complex environmental time series. Recent literature has\u0000highlighted the key role of diversity in enhancing accuracy and reducing\u0000variance in ensemble methods.This paper introduces Randomized Spline Trees\u0000(RST), a novel algorithm that bridges these two approaches by incorporating\u0000randomized functional representations into the Random Forest framework. RST\u0000generates diverse functional representations of input data using randomized\u0000B-spline parameters, creating an ensemble of decision trees trained on these\u0000varied representations. We provide a theoretical analysis of how this\u0000functional diversity contributes to reducing generalization error and present\u0000empirical evaluations on six environmental time series classification tasks\u0000from the UCR Time Series Archive. Results show that RST variants outperform\u0000standard Random Forests and Gradient Boosting on most datasets, improving\u0000classification accuracy by up to 14%. The success of RST demonstrates the\u0000potential of adaptive functional representations in capturing complex temporal\u0000patterns in environmental data. This work contributes to the growing field of\u0000machine learning techniques focused on functional data and opens new avenues\u0000for research in environmental time series analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The focus of this study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations. To achieve this, we develop several ML-based methods with varying architectures and implement them as two-sample tests. Each method is an ensemble (stacking) that combines predictions from classical two-sample tests. This paper presents the results of training the proposed ML methods, examines their statistical power compared to classical two-sample tests, analyzes the distribution of test statistics for the proposed methods when the null hypothesis is true, and evaluates the significance of the features incorporated into the proposed methods. All results from numerical experiments were obtained from a synthetic dataset generated using the Smirnov transform (Inverse Transform Sampling) and replicated multiple times through Monte Carlo simulation. To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods. All necessary materials (source code, example scripts, dataset, and samples) are available on GitHub and Hugging Face.
本研究的重点是评估机器学习(ML)方法在具有右删失观测值的双样本测试中的有效性。为了实现这一目标,我们开发了几种基于 ML 的方法,这些方法具有不同的架构,并将它们作为双样本检验方法来实施。每种方法都是一个集合(堆叠),结合了经典双样本检验的预测结果。本文介绍了所提出的 ML 方法的训练结果,考察了这些方法与经典双样本检验方法相比的统计能力,分析了所提出的方法在零假设为真时的检验统计量分布,并评估了所提出的方法中包含的特征的重要性。数值实验的所有结果均来自使用斯米尔诺夫变换(反变换采样)生成的合成数据集,并通过蒙特卡罗模拟进行了多次复制。要测试具有右删失观测值的双样本问题,可以使用建议的双样本方法。所有必要材料(源代码、示例脚本、数据集和样本)均可在 GitHub 和 Hugging Face 上获取。
{"title":"Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study","authors":"Petr Philonenko, Sergey Postovalov","doi":"arxiv-2409.08201","DOIUrl":"https://doi.org/arxiv-2409.08201","url":null,"abstract":"The focus of this study is to evaluate the effectiveness of Machine Learning\u0000(ML) methods for two-sample testing with right-censored observations. To\u0000achieve this, we develop several ML-based methods with varying architectures\u0000and implement them as two-sample tests. Each method is an ensemble (stacking)\u0000that combines predictions from classical two-sample tests. This paper presents\u0000the results of training the proposed ML methods, examines their statistical\u0000power compared to classical two-sample tests, analyzes the distribution of test\u0000statistics for the proposed methods when the null hypothesis is true, and\u0000evaluates the significance of the features incorporated into the proposed\u0000methods. All results from numerical experiments were obtained from a synthetic\u0000dataset generated using the Smirnov transform (Inverse Transform Sampling) and\u0000replicated multiple times through Monte Carlo simulation. To test the\u0000two-sample problem with right-censored observations, one can use the proposed\u0000two-sample methods. All necessary materials (source code, example scripts,\u0000dataset, and samples) are available on GitHub and Hugging Face.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar
Real-world applications may be affected by outlying values. In the model-based clustering literature, several methodologies have been proposed to detect units that deviate from the majority of the data (rowwise outliers) and trim them from the parameter estimates. However, the discarded observations can encompass valuable information in some observed features. Following the more recent cellwise contamination paradigm, we introduce a Gaussian mixture model for cellwise outlier detection. The proposal is estimated via an Expectation-Maximization (EM) algorithm with an additional step for flagging the contaminated cells of a data matrix and then imputing -- instead of discarding -- them before the parameter estimation. This procedure adheres to the spirit of the EM algorithm by treating the contaminated cells as missing values. We analyze the performance of the proposed model in comparison with other existing methodologies through a simulation study with different scenarios and illustrate its potential use for clustering, outlier detection, and imputation on three real data sets.
现实世界的应用可能会受到离群值的影响。在基于模型的聚类文献中,已经提出了几种方法来检测偏离大多数数据的单元(纵向离群值),并将其从参数估计中删除。然而,这些被丢弃的观测数据可能包含了某些观测特征的有价值信息。根据最近的单元污染范例,我们引入了一种高斯混合物模型用于单元离群值检测。该建议通过期望最大化(EM)算法进行估计,并在参数估计前增加了一个步骤,即标记数据矩阵中受污染的单元,然后将其归入(而不是丢弃)。这一过程秉承了 EM 算法的精神,将受污染的单元格视为缺失值。我们通过对不同情况的模拟研究,分析了所提模型与其他现有方法的性能比较,并在三个真实数据集上说明了该模型在聚类、离群点检测和估算方面的潜在用途。
{"title":"Cellwise outlier detection in heterogeneous populations","authors":"Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar","doi":"arxiv-2409.07881","DOIUrl":"https://doi.org/arxiv-2409.07881","url":null,"abstract":"Real-world applications may be affected by outlying values. In the\u0000model-based clustering literature, several methodologies have been proposed to\u0000detect units that deviate from the majority of the data (rowwise outliers) and\u0000trim them from the parameter estimates. However, the discarded observations can\u0000encompass valuable information in some observed features. Following the more\u0000recent cellwise contamination paradigm, we introduce a Gaussian mixture model\u0000for cellwise outlier detection. The proposal is estimated via an\u0000Expectation-Maximization (EM) algorithm with an additional step for flagging\u0000the contaminated cells of a data matrix and then imputing -- instead of\u0000discarding -- them before the parameter estimation. This procedure adheres to\u0000the spirit of the EM algorithm by treating the contaminated cells as missing\u0000values. We analyze the performance of the proposed model in comparison with\u0000other existing methodologies through a simulation study with different\u0000scenarios and illustrate its potential use for clustering, outlier detection,\u0000and imputation on three real data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seong-ho Lee, Brian D. Richardson, Yanyuan Ma, Karen S. Marder, Tanya P. Garcia
In Huntington's disease research, a current goal is to understand how symptoms change prior to a clinical diagnosis. Statistically, this entails modeling symptom severity as a function of the covariate 'time until diagnosis', which is often heavily right-censored in observational studies. Existing estimators that handle right-censored covariates have varying statistical efficiency and robustness to misspecified models for nuisance distributions (those of the censored covariate and censoring variable). On one extreme, complete case estimation, which utilizes uncensored data only, is free of nuisance distribution models but discards informative censored observations. On the other extreme, maximum likelihood estimation is maximally efficient but inconsistent when the covariate's distribution is misspecified. We propose a semiparametric estimator that is robust and efficient. When the nuisance distributions are modeled parametrically, the estimator is doubly robust, i.e., consistent if at least one distribution is correctly specified, and semiparametric efficient if both models are correctly specified. When the nuisance distributions are estimated via nonparametric or machine learning methods, the estimator is consistent and semiparametric efficient. We show empirically that the proposed estimator, implemented in the R package sparcc, has its claimed properties, and we apply it to study Huntington's disease symptom trajectories using data from the Enroll-HD study.
亨廷顿氏病研究的当前目标是了解临床诊断前症状是如何变化的。从统计学角度来看,这需要将症状严重程度作为协变量 "诊断前时间 "的函数来建模,而在观察性研究中,"诊断前时间 "往往是严重右删失的。现有处理右删失协变量的估计器具有不同的统计效率和对滋扰分布(删失协变量和删失变量的分布)的错误模型的稳健性。从一个极端来看,只利用未删减数据的完全情况估计不受干扰分布模型的影响,但会丢弃有信息量的删减观测值;从另一个极端来看,最大似然估计具有最大效率,但在协变量分布被错误定义时却不一致。我们提出了一种稳健高效的参数估计方法。当被扰分布以参数方式建模时,估计器具有双重稳健性,即如果至少一个分布被正确指定,则估计器具有一致性;如果两个模型都被正确指定,则估计器具有半参数效率。当通过非参数或机器学习方法估计扰动分布时,估计器是一致的,并且是半参数有效的。我们用经验证明了在 R 软件包 sparcc 中实现的估计器具有所宣称的特性,并利用 Enroll-HD 研究的数据将其用于研究亨廷顿氏病的症状轨迹。
{"title":"Robust and efficient estimation in the presence of a randomly censored covariate","authors":"Seong-ho Lee, Brian D. Richardson, Yanyuan Ma, Karen S. Marder, Tanya P. Garcia","doi":"arxiv-2409.07795","DOIUrl":"https://doi.org/arxiv-2409.07795","url":null,"abstract":"In Huntington's disease research, a current goal is to understand how\u0000symptoms change prior to a clinical diagnosis. Statistically, this entails\u0000modeling symptom severity as a function of the covariate 'time until\u0000diagnosis', which is often heavily right-censored in observational studies.\u0000Existing estimators that handle right-censored covariates have varying\u0000statistical efficiency and robustness to misspecified models for nuisance\u0000distributions (those of the censored covariate and censoring variable). On one\u0000extreme, complete case estimation, which utilizes uncensored data only, is free\u0000of nuisance distribution models but discards informative censored observations.\u0000On the other extreme, maximum likelihood estimation is maximally efficient but\u0000inconsistent when the covariate's distribution is misspecified. We propose a\u0000semiparametric estimator that is robust and efficient. When the nuisance\u0000distributions are modeled parametrically, the estimator is doubly robust, i.e.,\u0000consistent if at least one distribution is correctly specified, and\u0000semiparametric efficient if both models are correctly specified. When the\u0000nuisance distributions are estimated via nonparametric or machine learning\u0000methods, the estimator is consistent and semiparametric efficient. We show\u0000empirically that the proposed estimator, implemented in the R package sparcc,\u0000has its claimed properties, and we apply it to study Huntington's disease\u0000symptom trajectories using data from the Enroll-HD study.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}