Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar
{"title":"异质群体中的细胞离群点检测","authors":"Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar","doi":"arxiv-2409.07881","DOIUrl":null,"url":null,"abstract":"Real-world applications may be affected by outlying values. In the\nmodel-based clustering literature, several methodologies have been proposed to\ndetect units that deviate from the majority of the data (rowwise outliers) and\ntrim them from the parameter estimates. However, the discarded observations can\nencompass valuable information in some observed features. Following the more\nrecent cellwise contamination paradigm, we introduce a Gaussian mixture model\nfor cellwise outlier detection. The proposal is estimated via an\nExpectation-Maximization (EM) algorithm with an additional step for flagging\nthe contaminated cells of a data matrix and then imputing -- instead of\ndiscarding -- them before the parameter estimation. This procedure adheres to\nthe spirit of the EM algorithm by treating the contaminated cells as missing\nvalues. We analyze the performance of the proposed model in comparison with\nother existing methodologies through a simulation study with different\nscenarios and illustrate its potential use for clustering, outlier detection,\nand imputation on three real data sets.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cellwise outlier detection in heterogeneous populations\",\"authors\":\"Giorgia Zaccaria, Luis A. García-Escudero, Francesca Greselin, Agustín Mayo-Íscar\",\"doi\":\"arxiv-2409.07881\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Real-world applications may be affected by outlying values. In the\\nmodel-based clustering literature, several methodologies have been proposed to\\ndetect units that deviate from the majority of the data (rowwise outliers) and\\ntrim them from the parameter estimates. However, the discarded observations can\\nencompass valuable information in some observed features. Following the more\\nrecent cellwise contamination paradigm, we introduce a Gaussian mixture model\\nfor cellwise outlier detection. The proposal is estimated via an\\nExpectation-Maximization (EM) algorithm with an additional step for flagging\\nthe contaminated cells of a data matrix and then imputing -- instead of\\ndiscarding -- them before the parameter estimation. This procedure adheres to\\nthe spirit of the EM algorithm by treating the contaminated cells as missing\\nvalues. We analyze the performance of the proposed model in comparison with\\nother existing methodologies through a simulation study with different\\nscenarios and illustrate its potential use for clustering, outlier detection,\\nand imputation on three real data sets.\",\"PeriodicalId\":501425,\"journal\":{\"name\":\"arXiv - STAT - Methodology\",\"volume\":\"42 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Methodology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07881\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07881","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
现实世界的应用可能会受到离群值的影响。在基于模型的聚类文献中,已经提出了几种方法来检测偏离大多数数据的单元(纵向离群值),并将其从参数估计中删除。然而,这些被丢弃的观测数据可能包含了某些观测特征的有价值信息。根据最近的单元污染范例,我们引入了一种高斯混合物模型用于单元离群值检测。该建议通过期望最大化(EM)算法进行估计,并在参数估计前增加了一个步骤,即标记数据矩阵中受污染的单元,然后将其归入(而不是丢弃)。这一过程秉承了 EM 算法的精神,将受污染的单元格视为缺失值。我们通过对不同情况的模拟研究,分析了所提模型与其他现有方法的性能比较,并在三个真实数据集上说明了该模型在聚类、离群点检测和估算方面的潜在用途。
Cellwise outlier detection in heterogeneous populations
Real-world applications may be affected by outlying values. In the
model-based clustering literature, several methodologies have been proposed to
detect units that deviate from the majority of the data (rowwise outliers) and
trim them from the parameter estimates. However, the discarded observations can
encompass valuable information in some observed features. Following the more
recent cellwise contamination paradigm, we introduce a Gaussian mixture model
for cellwise outlier detection. The proposal is estimated via an
Expectation-Maximization (EM) algorithm with an additional step for flagging
the contaminated cells of a data matrix and then imputing -- instead of
discarding -- them before the parameter estimation. This procedure adheres to
the spirit of the EM algorithm by treating the contaminated cells as missing
values. We analyze the performance of the proposed model in comparison with
other existing methodologies through a simulation study with different
scenarios and illustrate its potential use for clustering, outlier detection,
and imputation on three real data sets.