利用模拟生态瞬时评价数据评价多变量时间序列聚类

IF 4.9 Machine learning with applications Pub Date : 2023-12-15 Epub Date: 2023-11-20 DOI:10.1016/j.mlwa.2023.100512

Mandani Ntekouli , Gerasimos Spanakis , Lourens Waldorp , Anne Roefs

{"title":"利用模拟生态瞬时评价数据评价多变量时间序列聚类","authors":"Mandani Ntekouli , Gerasimos Spanakis , Lourens Waldorp , Anne Roefs","doi":"10.1016/j.mlwa.2023.100512","DOIUrl":null,"url":null,"abstract":"<div><p>During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.</p></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"14 ","pages":"Article 100512"},"PeriodicalIF":4.9000,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666827023000658/pdfft?md5=1ec5ee06e2dff3ae3806641723ab9f42&pid=1-s2.0-S2666827023000658-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Evaluating multivariate time-series clustering using simulated ecological momentary assessment data\",\"authors\":\"Mandani Ntekouli , Gerasimos Spanakis , Lourens Waldorp , Anne Roefs\",\"doi\":\"10.1016/j.mlwa.2023.100512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.</p></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"14 \",\"pages\":\"Article 100512\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2023-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666827023000658/pdfft?md5=1ec5ee06e2dff3ae3806641723ab9f42&pid=1-s2.0-S2666827023000658-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827023000658\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/11/20 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827023000658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/20 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在生态瞬时评估(EMA)研究中，通过重复的数字问卷，我们有机会为所有参与者收集多个多变量时间序列(MTS)数据。虽然，每个参与者都分析个人数据是很常见的，但这种数据集的丰富性提出了一个问题，即是否可以发现有意义的个人群体，以更好地理解个人和群体层面的潜在过程。这种分组可以通过聚类得到。因此，本文研究了基于原始时间序列数据模式相似性的各种聚类方法对个体进行分组的性能。聚类是一项无监督的任务，其中真正的底层组通常是不可用的，这使得结果难以评估。因此，本文采用类似EMA的模拟不规则时间序列数据来验证几种方法在不同聚类相关选择(如距离度量)下的性能。数据是由不同数量的集群、个体总数和时间点以及变量数量和嘈杂变量的比例生成的，而它们的时间序列表示形状良好的模式，通常在情绪行为中观察到。在对所有模拟数据集应用聚类之后，首先通过比较真实标签和预测标签来评估聚类性能，同时还检查了不同数据集参数的影响。由于在现实场景中，地面真值标签并不总是可用的，或者甚至不存在，因此进一步研究了基于距离和无距离度量的聚类评估。总的来说，所有的聚类方法(如k-means, Hierarchical clustering, Fuzzy k-medoids)在不同的配置下被证明是可靠的，揭示了聚类的真实数量。此外，当涉及高噪声变量时，基于核的方法显得更有效，对于现实世界的数据变得更有希望。作为第二部分，提供了两个特定模拟场景(数据集)的说明，更详细地展示了在得出关于选择最佳簇数的结论之前的所有不同分析步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluating multivariate time-series clustering using simulated ecological momentary assessment data

During an Ecological Momentary Assessment (EMA) study, through repeated digital questionnaires, we have the opportunity to collect multiple multivariate time-series (MTS) data for all participants. Although, it is common that individual data is analyzed per participant, the richness of such dataset poses the question of whether meaningful groups of individuals could be uncovered to better understand the underlying processes on an individual and a group level. Such grouping could be obtained by clustering. Therefore, this paper examines the performance of various clustering approaches for grouping individuals based on the similarity of their raw time-series data patterns. Clustering is an unsupervised task, where the true underlying groups are not usually available, making the result difficult to evaluate. Therefore, in the current paper, simulated irregular time-series data, resembling EMA, are used to validate the performance of several methods under different clustering-related choices, such as the distance metric. Data are generated with a varying number of clusters, total number of individuals and time-points as well as number of variables and proportions of noisy variables, while their time-series represent well-shaped patterns, typically observed in emotional behavior. After applying clustering to all simulated datasets, clustering performance was first assessed by comparing the true and predicted labels, while the impact of the different datasets’ parameters was also examined. Because ground truth labels are not always available, or do not even exist, in real-world scenarios, clustering evaluation through distance-based and distance-free measures was further investigated. Overall, all clustering methods (e.g. k-means, Hierarchical clustering, Fuzzy k-medoids) proved reliable in different configurations, revealing the true number of clusters. Moreover, kernel-based methods appeared more efficient when highly noisy variables are involved, becoming more promising for real-world data. As a second part, an illustration of two specific simulated scenarios (datasets) is provided, showing, in more detail, all different analysis steps before drawing a conclusion about the choice of the optimal number of clusters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days