利用模拟生态瞬时评价数据评价多变量时间序列聚类

Mandani Ntekouli , Gerasimos Spanakis , Lourens Waldorp , Anne Roefs
{"title":"利用模拟生态瞬时评价数据评价多变量时间序列聚类","authors":"Mandani Ntekouli ,&nbsp;Gerasimos Spanakis ,&nbsp;Lourens Waldorp ,&nbsp;Anne Roefs","doi":"10.1016/j.mlwa.2023.100512","DOIUrl":null,"url":null,"abstract":"<div><p>During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.</p></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"14 ","pages":"Article 100512"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666827023000658/pdfft?md5=1ec5ee06e2dff3ae3806641723ab9f42&pid=1-s2.0-S2666827023000658-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Evaluating multivariate time-series clustering using simulated ecological momentary assessment data\",\"authors\":\"Mandani Ntekouli ,&nbsp;Gerasimos Spanakis ,&nbsp;Lourens Waldorp ,&nbsp;Anne Roefs\",\"doi\":\"10.1016/j.mlwa.2023.100512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.</p></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"14 \",\"pages\":\"Article 100512\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666827023000658/pdfft?md5=1ec5ee06e2dff3ae3806641723ab9f42&pid=1-s2.0-S2666827023000658-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827023000658\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827023000658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在生态瞬时评估(EMA)研究中,通过重复的数字问卷,我们有机会为所有参与者收集多个多变量时间序列(MTS)数据。虽然,每个参与者都分析个人数据是很常见的,但这种数据集的丰富性提出了一个问题,即是否可以发现有意义的个人群体,以更好地理解个人和群体层面的潜在过程。这种分组可以通过聚类得到。因此,本文研究了基于原始时间序列数据模式相似性的各种聚类方法对个体进行分组的性能。聚类是一项无监督的任务,其中真正的底层组通常是不可用的,这使得结果难以评估。因此,本文采用类似EMA的模拟不规则时间序列数据来验证几种方法在不同聚类相关选择(如距离度量)下的性能。数据是由不同数量的集群、个体总数和时间点以及变量数量和嘈杂变量的比例生成的,而它们的时间序列表示形状良好的模式,通常在情绪行为中观察到。在对所有模拟数据集应用聚类之后,首先通过比较真实标签和预测标签来评估聚类性能,同时还检查了不同数据集参数的影响。由于在现实场景中,地面真值标签并不总是可用的,或者甚至不存在,因此进一步研究了基于距离和无距离度量的聚类评估。总的来说,所有的聚类方法(如k-means, Hierarchical clustering, Fuzzy k-medoids)在不同的配置下被证明是可靠的,揭示了聚类的真实数量。此外,当涉及高噪声变量时,基于核的方法显得更有效,对于现实世界的数据变得更有希望。作为第二部分,提供了两个特定模拟场景(数据集)的说明,更详细地展示了在得出关于选择最佳簇数的结论之前的所有不同分析步骤。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating multivariate time-series clustering using simulated ecological momentary assessment data

During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Machine learning with applications
Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications
自引率
0.00%
发文量
0
审稿时长
98 days
期刊最新文献
Document Layout Error Rate (DLER) metric to evaluate image segmentation methods Supervised machine learning for microbiomics: Bridging the gap between current and best practices Playing with words: Comparing the vocabulary and lexical diversity of ChatGPT and humans A survey on knowledge distillation: Recent advancements Texas rural land market integration: A causal analysis using machine learning applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1