Pub Date : 2023-10-20DOI: 10.1007/s10618-023-00979-9
Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling
Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ × and 18 $$times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
摘要对能够聚类和分析大型时间序列数据的可扩展算法的需求越来越大。Kohonen自组织映射(SOM)是一种用于复杂数据聚类、可视化和降维的无监督人工神经网络。与所有聚类方法一样,它需要输入数据之间的相似性度量(在此工作时间序列中)。动态时间翘曲(DTW)就是这样一种度量方法,它在调整时间序列时能够适应扭曲。尽管DTW在聚类中很流行,但它在实践中受到限制,因为运行时复杂度是时间序列长度的二次元。为了解决这个问题,我们提出了一个新的用于聚类TIME Series的自组织映射,称为SOMTimeS,它使用DTW作为距离度量。与其他基于dtw的聚类算法相比,该方法具有相似的精度,但可扩展性更好,运行速度更快。计算性能源于在SOM的训练阶段修剪不必要的DTW计算。为了比较,我们对K-means实施了类似的修剪策略,并将后者称为K-TimeS。SOMTimeS和K-TimeS修剪43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ × and 18 $$times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
{"title":"Somtimes: self organizing maps for time series clustering and its application to serious illness conversations","authors":"Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling","doi":"10.1007/s10618-023-00979-9","DOIUrl":"https://doi.org/10.1007/s10618-023-00979-9","url":null,"abstract":"Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> and 18 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"1 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135513626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-20DOI: 10.1007/s10618-023-00957-1
João Luiz Junho Pereira, Kate Smith-Miles, Mario Andrés Muñoz, Ana Carolina Lorena
{"title":"Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation","authors":"João Luiz Junho Pereira, Kate Smith-Miles, Mario Andrés Muñoz, Ana Carolina Lorena","doi":"10.1007/s10618-023-00957-1","DOIUrl":"https://doi.org/10.1007/s10618-023-00957-1","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135617362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-17DOI: 10.1007/s10618-023-00970-4
Martin Khannouz, Tristan Glatard
{"title":"Mondrian forest for data stream classification under memory constraints","authors":"Martin Khannouz, Tristan Glatard","doi":"10.1007/s10618-023-00970-4","DOIUrl":"https://doi.org/10.1007/s10618-023-00970-4","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135995030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-16DOI: 10.1007/s10618-023-00978-w
Nestor Cabello, Elham Naghizade, Jianzhong Qi, Lars Kulik
Abstract Time series classification (TSC) aims to predict the class label of a given time series, which is critical to a rich set of application areas such as economics and medicine. State-of-the-art TSC methods have mostly focused on classification accuracy, without considering classification speed. However, efficiency is important for big data analysis. Datasets with a large training size or long series challenge the use of the current highly accurate methods, because they are usually computationally expensive. Similarly, classification explainability, which is an important property required by modern big data applications such as appliance modeling and legislation such as the European General Data Protection Regulation , has received little attention. To address these gaps, we propose a novel TSC method – the Randomized-Supervised Time Series Forest (r-STSF). r-STSF is extremely fast and achieves state-of-the-art classification accuracy. It is an efficient interval-based approach that classifies time series according to aggregate values of the discriminatory sub-series (intervals). To achieve state-of-the-art accuracy, r-STSF builds an ensemble of randomized trees using the discriminatory sub-series. It uses four time series representations, nine aggregation functions and a supervised binary-inspired search combined with a feature ranking metric to identify highly discriminatory sub-series. The discriminatory sub-series enable explainable classifications. Experiments on extensive datasets show that r-STSF achieves state-of-the-art accuracy while being orders of magnitude faster than most existing TSC methods and enabling for explanations on the classifier decision.
{"title":"Fast, accurate and explainable time series classification through randomization","authors":"Nestor Cabello, Elham Naghizade, Jianzhong Qi, Lars Kulik","doi":"10.1007/s10618-023-00978-w","DOIUrl":"https://doi.org/10.1007/s10618-023-00978-w","url":null,"abstract":"Abstract Time series classification (TSC) aims to predict the class label of a given time series, which is critical to a rich set of application areas such as economics and medicine. State-of-the-art TSC methods have mostly focused on classification accuracy, without considering classification speed. However, efficiency is important for big data analysis. Datasets with a large training size or long series challenge the use of the current highly accurate methods, because they are usually computationally expensive. Similarly, classification explainability, which is an important property required by modern big data applications such as appliance modeling and legislation such as the European General Data Protection Regulation , has received little attention. To address these gaps, we propose a novel TSC method – the Randomized-Supervised Time Series Forest (r-STSF). r-STSF is extremely fast and achieves state-of-the-art classification accuracy. It is an efficient interval-based approach that classifies time series according to aggregate values of the discriminatory sub-series (intervals). To achieve state-of-the-art accuracy, r-STSF builds an ensemble of randomized trees using the discriminatory sub-series. It uses four time series representations, nine aggregation functions and a supervised binary-inspired search combined with a feature ranking metric to identify highly discriminatory sub-series. The discriminatory sub-series enable explainable classifications. Experiments on extensive datasets show that r-STSF achieves state-of-the-art accuracy while being orders of magnitude faster than most existing TSC methods and enabling for explanations on the classifier decision.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136113227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-13DOI: 10.1007/s10618-023-00981-1
Sowon Jeon, Gilhee Lee, Hyoungshick Kim, Simon S. Woo
{"title":"Design and evaluation of highly accurate smart contract code vulnerability detection framework","authors":"Sowon Jeon, Gilhee Lee, Hyoungshick Kim, Simon S. Woo","doi":"10.1007/s10618-023-00981-1","DOIUrl":"https://doi.org/10.1007/s10618-023-00981-1","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135853010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-09DOI: 10.1007/s10618-023-00976-y
Guihong Wan, Baokun He, Haim Schweitzer
{"title":"The art of centering without centering for robust principal component analysis","authors":"Guihong Wan, Baokun He, Haim Schweitzer","doi":"10.1007/s10618-023-00976-y","DOIUrl":"https://doi.org/10.1007/s10618-023-00976-y","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135093726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-09DOI: 10.1007/s10618-023-00975-z
Hélder Alves, Paula Brito, Pedro Campos
{"title":"Community detection in interval-weighted networks","authors":"Hélder Alves, Paula Brito, Pedro Campos","doi":"10.1007/s10618-023-00975-z","DOIUrl":"https://doi.org/10.1007/s10618-023-00975-z","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-09DOI: 10.1007/s10618-023-00977-x
Ilaria Bombelli, Ichcha Manipur, Mario Rosario Guarracino, Maria Brigida Ferraro
{"title":"Representing ensembles of networks for fuzzy cluster analysis: a case study","authors":"Ilaria Bombelli, Ichcha Manipur, Mario Rosario Guarracino, Maria Brigida Ferraro","doi":"10.1007/s10618-023-00977-x","DOIUrl":"https://doi.org/10.1007/s10618-023-00977-x","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135094965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-09-29DOI: 10.1007/s10618-023-00980-2
Moritz Herrmann, Daniyal Kazempour, Fabian Scheipl, Peer Kröger
Abstract We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.
{"title":"Enhancing cluster analysis via topological manifold learning","authors":"Moritz Herrmann, Daniyal Kazempour, Fabian Scheipl, Peer Kröger","doi":"10.1007/s10618-023-00980-2","DOIUrl":"https://doi.org/10.1007/s10618-023-00980-2","url":null,"abstract":"Abstract We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135194568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}