Tandem clustering with invariant coordinate selection

IF 2 Q2 ECONOMICS Econometrics and Statistics Pub Date : 2024-03-16 DOI:10.1016/j.ecosta.2024.03.002
Andreas Alfons, Aurore Archimbaud, Klaus Nordhausen, Anne Ruiz-Gazen
{"title":"Tandem clustering with invariant coordinate selection","authors":"Andreas Alfons, Aurore Archimbaud, Klaus Nordhausen, Anne Ruiz-Gazen","doi":"10.1016/j.ecosta.2024.03.002","DOIUrl":null,"url":null,"abstract":"For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.","PeriodicalId":54125,"journal":{"name":"Econometrics and Statistics","volume":"24 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Econometrics and Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.ecosta.2024.03.002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ECONOMICS","Score":null,"Total":0}
引用次数: 0

Abstract

For multivariate data, tandem clustering is a well-known technique aiming to improve cluster identification through initial dimension reduction. Nevertheless, the usual approach using principal component analysis (PCA) has been criticized for focusing solely on inertia so that the first components do not necessarily retain the structure of interest for clustering. To address this limitation, a new tandem clustering approach based on invariant coordinate selection (ICS) is proposed. By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while providing affine invariant components. Certain theoretical results have been previously derived and guarantee that under some elliptical mixture models, the group structure can be highlighted on a subset of the first and/or last components. However, ICS has garnered minimal attention within the context of clustering. Two challenges associated with ICS include choosing the pair of scatter matrices and selecting the components to retain. For effective clustering purposes, it is demonstrated that the best scatter pairs consist of one scatter matrix capturing the within-cluster structure and another capturing the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully chosen subset size that is smaller than usual. The performance of ICS as a dimension reduction method is evaluated in terms of preserving the cluster structure in the data. In an extensive simulation study and empirical applications with benchmark data sets, various combinations of scatter matrices as well as component selection criteria are compared in situations with and without outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the PCA-based approach.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
带有不变坐标选择的串联聚类
对于多变量数据,串联聚类是一种众所周知的技术,旨在通过初始维度的降低来改进聚类识别。然而,使用主成分分析(PCA)的通常方法受到了批评,因为它只关注惯性,所以第一成分并不一定能保留聚类所需的结构。为了解决这一局限性,我们提出了一种基于不变坐标选择(ICS)的新串联聚类方法。通过对两个散点矩阵进行联合对角,ICS 可以找到数据中的结构,同时提供仿射不变成分。之前已经得出了一些理论结果,并保证在某些椭圆混合物模型下,可以在第一个和/或最后一个分量的子集上突出组结构。然而,ICS 在聚类中获得的关注极少。与 ICS 相关的两个挑战包括选择一对散点矩阵和选择要保留的成分。事实证明,为了达到有效聚类的目的,最佳散点对由一个捕捉簇内结构的散点矩阵和另一个捕捉全局结构的散点矩阵组成。对于前者来说,局部形状或成对散点是非常重要的,基于比通常更小的精心选择的子集大小的最小协方差行列式(MCD)估计器也是如此。ICS 作为一种降维方法,其性能是通过保留数据中的聚类结构来评估的。在广泛的模拟研究和基准数据集的经验应用中,比较了有异常值和无异常值情况下的各种散点矩阵组合以及成分选择标准。总之,采用 ICS 的串联聚类新方法显示出良好的效果,明显优于基于 PCA 的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
3.10
自引率
10.50%
发文量
84
期刊介绍: Econometrics and Statistics is the official journal of the networks Computational and Financial Econometrics and Computational and Methodological Statistics. It publishes research papers in all aspects of econometrics and statistics and comprises of the two sections Part A: Econometrics and Part B: Statistics.
期刊最新文献
Editorial Board Empirical best predictors under multivariate Fay-Herriot models and their numerical approximation Forecasting with Machine Learning methods and multiple large datasets[formula omitted] Specification tests for normal/gamma and stable/gamma stochastic frontier models based on empirical transforms A Bayesian flexible model for testing Granger causality
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1