Selective inference for clustering with unknown variance

IF 1 4区 数学 Q3 STATISTICS & PROBABILITY Electronic Journal of Statistics Pub Date : 2023-01-01 DOI:10.1214/23-ejs2143
Y. Yun, R. Barber
{"title":"Selective inference for clustering with unknown variance","authors":"Y. Yun, R. Barber","doi":"10.1214/23-ejs2143","DOIUrl":null,"url":null,"abstract":"In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronic Journal of Statistics","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1214/23-ejs2143","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 2

Abstract

In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
未知方差聚类的选择推理
在许多现代统计问题中,必须使用有限的可用数据来开发要检验的假设,并检验这些假设,即用于探索性和验证性数据分析。重复使用相同的数据集进行勘探和测试可能会导致大量的选择偏差,导致许多错误的发现。选择性推理是一种框架,即使在重复使用相同的数据进行探索和测试时,也可以执行有效的推理。在这项工作中,我们对数据聚类的选择性推理问题感兴趣,其中使用聚类过程来假设将数据点分离为一组子群,然后我们希望测试这些依赖数据的聚类是否真的代表了数据中有意义的差异。Gao等人最近的工作[2022]为这种设置提供了一个进行选择性推理的框架,其中使用分层聚类算法来产生聚类分配,然后由Chen和Witten[2022]将其扩展到k-means聚类。这两项工作都依赖于假设数据的已知协方差结构,但在实践中,需要估计噪声水平,当真正的聚类结构未知时,这尤其具有挑战性。在我们的工作中,我们将这项工作扩展到具有未知方差的噪声设置,并为这种更普遍的设置提供了一种选择性推理方法。实验结果表明,当真实噪声水平未知时,我们的新方法能够更好地保持高功率,同时控制I型误差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Electronic Journal of Statistics
Electronic Journal of Statistics STATISTICS & PROBABILITY-
CiteScore
1.80
自引率
9.10%
发文量
100
审稿时长
3 months
期刊介绍: The Electronic Journal of Statistics (EJS) publishes research articles and short notes on theoretical, computational and applied statistics. The journal is open access. Articles are refereed and are held to the same standard as articles in other IMS journals. Articles become publicly available shortly after they are accepted.
期刊最新文献
Direct Bayesian linear regression for distribution-valued covariates. Statistical inference via conditional Bayesian posteriors in high-dimensional linear regression Subnetwork estimation for spatial autoregressive models in large-scale networks Tests for high-dimensional single-index models Variable selection for single-index varying-coefficients models with applications to synergistic G × E interactions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1