k-means 聚类的选择性推理。

IF 4.3 3区 计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS Journal of Machine Learning Research Pub Date : 2023-05-01
Yiqun T Chen, Daniela M Witten
{"title":"k-means 聚类的选择性推理。","authors":"Yiqun T Chen, Daniela M Witten","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>We consider the problem of testing for a difference in means between clusters of observations identified via <math><mi>k</mi></math>-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of <math><mi>k</mi></math>-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the <math><mi>k</mi></math>-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using <math><mi>k</mi></math>-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.</p>","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10805457/pdf/","citationCount":"0","resultStr":"{\"title\":\"<ArticleTitle xmlns:ns0=\\\"http://www.w3.org/1998/Math/MathML\\\">Selective inference for <ns0:math><ns0:mi>k</ns0:mi></ns0:math>-means clustering.\",\"authors\":\"Yiqun T Chen, Daniela M Witten\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>We consider the problem of testing for a difference in means between clusters of observations identified via <math><mi>k</mi></math>-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of <math><mi>k</mi></math>-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the <math><mi>k</mi></math>-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using <math><mi>k</mi></math>-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.</p>\",\"PeriodicalId\":50161,\"journal\":{\"name\":\"Journal of Machine Learning Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10805457/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Machine Learning Research\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

我们考虑的问题是检验通过 k-means 聚类确定的观测数据聚类之间的均值差异。在这种情况下,经典的假设检验会导致 I 类错误率上升。在最近的工作中,Gao 等人(2022 年)考虑了分层聚类背景下的相关问题。遗憾的是,他们的解决方案与分层聚类的背景高度契合,因此无法应用于 k-means 聚类。在本文中,我们提出了一个 p 值,它是 k-means 算法中所有中间聚类分配的条件。我们证明,该 p 值可以控制在有限样本中使用 k-means 聚类对一对聚类的均值差异进行检验时的选择性 I 类错误,并且可以高效计算。我们将我们的建议应用于手写数字数据和单细胞 RNA 序列数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Selective inference for k-means clustering.

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Machine Learning Research
Journal of Machine Learning Research 工程技术-计算机:人工智能
CiteScore
18.80
自引率
0.00%
发文量
2
审稿时长
3 months
期刊介绍: The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.
期刊最新文献
Convergence for nonconvex ADMM, with applications to CT imaging. Effect-Invariant Mechanisms for Policy Generalization. Batch Normalization Preconditioning for Stochastic Gradient Langevin Dynamics Why Self-Attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries Efficient Anti-Symmetrization of a Neural Network Layer by Taming the Sign Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1