混合类型数据的改进加权高尔距离聚类分析:模拟与实证分析。

IF 3.9 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES BMC Medical Research Methodology Pub Date : 2024-12-18 DOI:10.1186/s12874-024-02427-8
Pinyan Liu, Han Yuan, Yilin Ning, Bibhas Chakraborty, Nan Liu, Marco Aurélio Peres
{"title":"混合类型数据的改进加权高尔距离聚类分析:模拟与实证分析。","authors":"Pinyan Liu, Han Yuan, Yilin Ning, Bibhas Chakraborty, Nan Liu, Marco Aurélio Peres","doi":"10.1186/s12874-024-02427-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.</p><p><strong>Methods: </strong>This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.</p><p><strong>Results: </strong>In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.</p><p><strong>Conclusions: </strong>DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"24 1","pages":"305"},"PeriodicalIF":3.9000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654179/pdf/","citationCount":"0","resultStr":"{\"title\":\"A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.\",\"authors\":\"Pinyan Liu, Han Yuan, Yilin Ning, Bibhas Chakraborty, Nan Liu, Marco Aurélio Peres\",\"doi\":\"10.1186/s12874-024-02427-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.</p><p><strong>Methods: </strong>This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.</p><p><strong>Results: </strong>In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.</p><p><strong>Conclusions: </strong>DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.</p>\",\"PeriodicalId\":9114,\"journal\":{\"name\":\"BMC Medical Research Methodology\",\"volume\":\"24 1\",\"pages\":\"305\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11654179/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Research Methodology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12874-024-02427-8\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-024-02427-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

背景:传统的聚类技术通常局限于连续变量或分类变量。然而,大多数现实世界的临床数据是混合型的。本研究旨在引入一种专门为包含连续变量和分类变量的数据集设计的聚类技术,以提供比其他混合类型技术更好的聚类兼容性、适应性和可解释性。方法:本文提出了一种将特征重要度作为权重的改进高尔距离,以保持连续特征和分类特征之间的贡献相等。该算法(DAFI)使用五个具有不同比例重要特征的模拟数据集和来自2011-2014年国家健康与营养检查调查(NHANES)的真实数据集进行评估。通过与13种聚类技术的比较,证明了有效性。在模拟研究中使用调整后的兰德指数(ARI)来评估聚类性能的准确性,在NHANES中使用剪影评分来评估聚类和分离。此外,多变量logistic回归估计了牙周炎(PD)和心血管疾病(cvd)之间的关联,调整了NHANES中的聚类。结果:在模拟研究中,在调查的设置中,根据调整后的Rand指数,DAFI-Gower算法始终优于基线方法,特别是在冗余特征较多的数据集上。NHANES对3760人进行了分析。DAFI-Gower的剪影评分最高(0.79)。确定了四个具有不同健康概况的不同群集。结合特征重要性,我们发现簇的形成更强烈地受到cvd相关因素的影响。在调整聚类后,牙周炎和心血管疾病之间的关联揭示了重要的洞察力(调整OR 1.95, 95% CI 1.50至2.55,p = 0.012),突出了严重的牙周炎是心血管疾病的潜在危险因素。结论:DAFI在模拟和真实数据集上的表现都优于经典聚类基线。它通过考虑特征的重要性来有效地捕获集群特征,这在许多变量可能相似或不相关的临床环境中至关重要。我们设想DAFI为混合型聚类提供了一个有效的解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.

Background: Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques.

Methods: This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES.

Results: In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases.

Conclusions: DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Medical Research Methodology
BMC Medical Research Methodology 医学-卫生保健
CiteScore
6.50
自引率
2.50%
发文量
298
审稿时长
3-8 weeks
期刊介绍: BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.
期刊最新文献
A generative model for evaluating missing data methods in large epidemiological cohorts. Discrepancies in safety reporting for chronic back pain clinical trials: an observational study from ClinicalTrials.gov and publications. Multiple states clustering analysis (MSCA), an unsupervised approach to multiple time-to-event electronic health records applied to multimorbidity associated with myocardial infarction. Matching plus regression adjustment for the estimation of the average treatment effect on survival outcomes: a case study with mosunetuzumab in relapsed/refractory follicular lymphoma. Protocol publication rate and comparison between article, registry and protocol in RCTs.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1