Benjamin Hivert , Denis Agniel , Rodolphe Thiébaut , Boris P. Hejblum
{"title":"Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data","authors":"Benjamin Hivert , Denis Agniel , Rodolphe Thiébaut , Boris P. Hejblum","doi":"10.1016/j.csda.2023.107916","DOIUrl":null,"url":null,"abstract":"<div><p><span>Clustering is part of unsupervised analysis methods that group samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing<span> is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are thus used for the inference process because the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the </span></span>clustering process and the potential artificial differences it could create. Three novel statistical hypothesis tests are introduced, each designed to account for the clustering process. These tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations. The proposed tests were applied in two distinct contexts: animal ecology and immunology, demonstrating the relevance of the results with real datasets.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016794732300227X","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Clustering is part of unsupervised analysis methods that group samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are thus used for the inference process because the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. Three novel statistical hypothesis tests are introduced, each designed to account for the clustering process. These tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations. The proposed tests were applied in two distinct contexts: animal ecology and immunology, demonstrating the relevance of the results with real datasets.
聚类是无监督分析方法的一部分,它将样本分成同质且独立的观测子群,也称为聚类。为了解释聚类,通常使用统计假设检验来推断将估计聚类彼此显著区分开来的变量。然而,由于假设是从聚类结果中推导出来的,因此推论过程中使用了数据驱动的假设。这种对数据的双重使用导致传统的假设检验无法控制 I 类错误率,特别是因为聚类过程中的不确定性及其可能造成的人为差异。本文介绍了三种新的统计假设检验,每种检验的设计都考虑到了聚类过程。这些检验通过仅识别包含真正信号的变量来区分观察组,从而有效控制 I 类错误率。所提出的检验方法被应用于动物生态学和免疫学这两个不同的领域,证明了其与真实数据集的相关性。
期刊介绍:
Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas:
I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article.
II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures.
[...]
III) Special Applications - [...]
IV) Annals of Statistical Data Science [...]