Selective inference for clustering with unknown variance

IF 1 4区数学 Q3 STATISTICS & PROBABILITY Electronic Journal of Statistics Pub Date : 2023-01-01 DOI:10.1214/23-ejs2143

Y. Yun, R. Barber

{"title":"Selective inference for clustering with unknown variance","authors":"Y. Yun, R. Barber","doi":"10.1214/23-ejs2143","DOIUrl":null,"url":null,"abstract":"In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronic Journal of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/23-ejs2143","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 2

Abstract

In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

未知方差聚类的选择推理

在许多现代统计问题中，必须使用有限的可用数据来开发要检验的假设，并检验这些假设，即用于探索性和验证性数据分析。重复使用相同的数据集进行勘探和测试可能会导致大量的选择偏差，导致许多错误的发现。选择性推理是一种框架，即使在重复使用相同的数据进行探索和测试时，也可以执行有效的推理。在这项工作中，我们对数据聚类的选择性推理问题感兴趣，其中使用聚类过程来假设将数据点分离为一组子群，然后我们希望测试这些依赖数据的聚类是否真的代表了数据中有意义的差异。Gao等人最近的工作[2022]为这种设置提供了一个进行选择性推理的框架，其中使用分层聚类算法来产生聚类分配，然后由Chen和Witten[2022]将其扩展到k-means聚类。这两项工作都依赖于假设数据的已知协方差结构，但在实践中，需要估计噪声水平，当真正的聚类结构未知时，这尤其具有挑战性。在我们的工作中，我们将这项工作扩展到具有未知方差的噪声设置，并为这种更普遍的设置提供了一种选择性推理方法。实验结果表明，当真实噪声水平未知时，我们的新方法能够更好地保持高功率，同时控制I型误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Electronic Journal of Statistics STATISTICS & PROBABILITY-

CiteScore

1.80

自引率

9.10%

发文量

100

审稿时长

3 months

期刊介绍： The Electronic Journal of Statistics (EJS) publishes research articles and short notes on theoretical, computational and applied statistics. The journal is open access. Articles are refereed and are held to the same standard as articles in other IMS journals. Articles become publicly available shortly after they are accepted.