Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling
{"title":"Somtimes: self organizing maps for time series clustering and its application to serious illness conversations","authors":"Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling","doi":"10.1007/s10618-023-00979-9","DOIUrl":null,"url":null,"abstract":"Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$\\times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$\\times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> and 18 $$\\times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"1 2","pages":"0"},"PeriodicalIF":2.8000,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10618-023-00979-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$\times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$\times$$ × and 18 $$\times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
摘要对能够聚类和分析大型时间序列数据的可扩展算法的需求越来越大。Kohonen自组织映射(SOM)是一种用于复杂数据聚类、可视化和降维的无监督人工神经网络。与所有聚类方法一样,它需要输入数据之间的相似性度量(在此工作时间序列中)。动态时间翘曲(DTW)就是这样一种度量方法,它在调整时间序列时能够适应扭曲。尽管DTW在聚类中很流行,但它在实践中受到限制,因为运行时复杂度是时间序列长度的二次元。为了解决这个问题,我们提出了一个新的用于聚类TIME Series的自组织映射,称为SOMTimeS,它使用DTW作为距离度量。与其他基于dtw的聚类算法相比,该方法具有相似的精度,但可扩展性更好,运行速度更快。计算性能源于在SOM的训练阶段修剪不必要的DTW计算。为了比较,我们对K-means实施了类似的修剪策略,并将后者称为K-TimeS。SOMTimeS和K-TimeS修剪43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$\times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$\times$$ × and 18 $$\times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.