Pub Date : 2023-12-21DOI: 10.1007/s10618-023-00995-9
Jaewan Chun, Geon Lee, Kijung Shin, Jinhong Jung
Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose ARCHER, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose ARCHER. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of ARCHER, (b) the complementary nature of the two computation methods composing ARCHER, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.
{"title":"Random walk with restart on hypergraphs: fast computation and an application to anomaly detection","authors":"Jaewan Chun, Geon Lee, Kijung Shin, Jinhong Jung","doi":"10.1007/s10618-023-00995-9","DOIUrl":"https://doi.org/10.1007/s10618-023-00995-9","url":null,"abstract":"<p>Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose <span>ARCHER</span>, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose <span>ARCHER</span>. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of <span>ARCHER</span>, (b) the complementary nature of the two computation methods composing <span>ARCHER</span>, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"69 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138823850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-21DOI: 10.1007/s10618-023-00997-7
Antonio R. Moya, Bruno Veloso, João Gama, Sebastián Ventura
{"title":"Improving hyper-parameter self-tuning for data streams by adapting an evolutionary approach","authors":"Antonio R. Moya, Bruno Veloso, João Gama, Sebastián Ventura","doi":"10.1007/s10618-023-00997-7","DOIUrl":"https://doi.org/10.1007/s10618-023-00997-7","url":null,"abstract":"","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"52 11","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138952437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-12DOI: 10.1007/s10618-023-00990-0
Ling Jian, Kai Shao, Ying Liu, Jundong Li, Xijun Liang
Distilling actionable patterns from large-scale streaming data in the presence of concept drift is a challenging problem, especially when data is polluted with noisy labels. To date, various data stream mining algorithms have been proposed and extensively used in many real-world applications. Considering the functional complementation of classical online learning algorithms and with the goal of combining their advantages, we propose an Online Ensemble Classification (OEC) algorithm to integrate the predictions obtained by different base online classification algorithms. The proposed OEC method works by learning weights of different base classifiers dynamically through the classical Normalized Exponentiated Gradient (NEG) algorithm framework. As a result, the proposed OEC inherits the adaptability and flexibility of concept drift-tracking online classifiers, while maintaining the robustness of noise-resistant online classifiers. Theoretically, we show OEC algorithm is a low regret algorithm which makes it a good candidate to learn from noisy streaming data. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed OEC method.
{"title":"OEC: an online ensemble classifier for mining data streams with noisy labels","authors":"Ling Jian, Kai Shao, Ying Liu, Jundong Li, Xijun Liang","doi":"10.1007/s10618-023-00990-0","DOIUrl":"https://doi.org/10.1007/s10618-023-00990-0","url":null,"abstract":"<p>Distilling actionable patterns from large-scale streaming data in the presence of concept drift is a challenging problem, especially when data is polluted with noisy labels. To date, various data stream mining algorithms have been proposed and extensively used in many real-world applications. Considering the functional complementation of classical online learning algorithms and with the goal of combining their advantages, we propose an Online Ensemble Classification (OEC) algorithm to integrate the predictions obtained by different base online classification algorithms. The proposed OEC method works by learning weights of different base classifiers dynamically through the classical Normalized Exponentiated Gradient (NEG) algorithm framework. As a result, the proposed OEC inherits the adaptability and flexibility of concept drift-tracking online classifiers, while maintaining the robustness of noise-resistant online classifiers. Theoretically, we show OEC algorithm is a low regret algorithm which makes it a good candidate to learn from noisy streaming data. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed OEC method.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"177 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138628812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-08DOI: 10.1007/s10618-023-00987-9
Nourhan Ahmed, Lars Schmidt-Thieme
Handling incomplete multivariate time series is an important and fundamental concern for a variety of domains. Existing time-series imputation approaches rely on basic assumptions regarding relationship information between sensors, posing significant challenges since inter-sensor interactions in the real world are often complex and unknown beforehand. Specifically, there is a lack of in-depth investigation into (1) the coexistence of relationships between sensors and (2) the incorporation of reciprocal impact between sensor properties and inter-sensor relationships for the time-series imputation problem. To fill this gap, we present the Structure-aware Decoupled imputation network (SaD), which is designed to model sensor characteristics and relationships between sensors in distinct latent spaces. Our approach is equipped with a two-step knowledge integration scheme that incorporates the influence between the sensor attribute information as well as sensor relationship information. The experimental results indicate that when compared to state-of-the-art models for time-series imputation tasks, our proposed method can reduce error by around 15%.
{"title":"Structure-aware decoupled imputation network for multivariate time series","authors":"Nourhan Ahmed, Lars Schmidt-Thieme","doi":"10.1007/s10618-023-00987-9","DOIUrl":"https://doi.org/10.1007/s10618-023-00987-9","url":null,"abstract":"<p>Handling incomplete multivariate time series is an important and fundamental concern for a variety of domains. Existing time-series imputation approaches rely on basic assumptions regarding relationship information between sensors, posing significant challenges since inter-sensor interactions in the real world are often complex and unknown beforehand. Specifically, there is a lack of in-depth investigation into (1) the coexistence of relationships between sensors and (2) the incorporation of reciprocal impact between sensor properties and inter-sensor relationships for the time-series imputation problem. To fill this gap, we present the Structure-aware Decoupled imputation network (SaD), which is designed to model sensor characteristics and relationships between sensors in distinct latent spaces. Our approach is equipped with a two-step knowledge integration scheme that incorporates the influence between the sensor attribute information as well as sensor relationship information. The experimental results indicate that when compared to state-of-the-art models for time-series imputation tasks, our proposed method can reduce error by around 15%.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"107 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138555888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-12-01Epub Date: 2023-06-19DOI: 10.1007/s12070-023-03947-3
Kalyana Sundaram Chidambaram, Manjul Muraleedharan, Amit Keshri, Sabaratnam Mayilvaganan, Nazrin Hameed, Mohd Aqib, Arushi Kumar, Ravi Sankar Manogaran, Raj Kumar
Benign parotid tumors follow an indolent course and present as slow-growing painless swelling in the pre-and-infra-auricular areas. The treatment of choice is surgery. Though the gold standard technique is Superficial Parotidectomy, Extracapsular Dissection (ECD) is an alternative option with the same outcome and decreased complications. This study discusses our experience with extracapsular dissection and the surgical nuances for better results. A retrospective study of histologically confirmed cases of pleomorphic adenoma of the parotid gland, who underwent Extracapsular dissection between September 2019 and March 2023, was done. The demographic details, clinical characteristics, and outcomes were evaluated. There were 33 patients, including 16 females and 17 males, with a mean age of 32.75 years. All cases presented as slow-growing painless swelling for a mean duration of 5 years. Most of the tumors (94%) were of size between 2 and 4 cm, with few tumors more than 4 cm. All underwent extracapsular dissection with complete excision. There was only one complication (seroma) and no incidence of facial palsy in our experience with ECD. The goal of a benign parotid surgery is the complete removal of the tumor with minimum complications, which could be achieved with ECD, which has good tumor clearance and lesser rates of complications with good cosmesis. Thus, this minimally invasive parotid surgery could be a worthwhile option in properly selected cases.
{"title":"The Outcomes and Surgical Nuances of Minimally Invasive Parotid Surgery for Pleomorphic Adenoma.","authors":"Kalyana Sundaram Chidambaram, Manjul Muraleedharan, Amit Keshri, Sabaratnam Mayilvaganan, Nazrin Hameed, Mohd Aqib, Arushi Kumar, Ravi Sankar Manogaran, Raj Kumar","doi":"10.1007/s12070-023-03947-3","DOIUrl":"10.1007/s12070-023-03947-3","url":null,"abstract":"<p><p>Benign parotid tumors follow an indolent course and present as slow-growing painless swelling in the pre-and-infra-auricular areas. The treatment of choice is surgery. Though the gold standard technique is Superficial Parotidectomy, Extracapsular Dissection (ECD) is an alternative option with the same outcome and decreased complications. This study discusses our experience with extracapsular dissection and the surgical nuances for better results. A retrospective study of histologically confirmed cases of pleomorphic adenoma of the parotid gland, who underwent Extracapsular dissection between September 2019 and March 2023, was done. The demographic details, clinical characteristics, and outcomes were evaluated. There were 33 patients, including 16 females and 17 males, with a mean age of 32.75 years. All cases presented as slow-growing painless swelling for a mean duration of 5 years. Most of the tumors (94%) were of size between 2 and 4 cm, with few tumors more than 4 cm. All underwent extracapsular dissection with complete excision. There was only one complication (seroma) and no incidence of facial palsy in our experience with ECD. The goal of a benign parotid surgery is the complete removal of the tumor with minimum complications, which could be achieved with ECD, which has good tumor clearance and lesser rates of complications with good cosmesis. Thus, this minimally invasive parotid surgery could be a worthwhile option in properly selected cases.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"28 1","pages":"3256-3262"},"PeriodicalIF":2.8,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10645680/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73804083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-18DOI: 10.1007/s10618-023-00988-8
Sondre Sørbø, Massimiliano Ruocco
The field of time series anomaly detection is constantly advancing, with several methods available, making it a challenge to determine the most appropriate method for a specific domain. The evaluation of these methods is facilitated by the use of metrics, which vary widely in their properties. Despite the existence of new evaluation metrics, there is limited agreement on which metrics are best suited for specific scenarios and domains, and the most commonly used metrics have faced criticism in the literature. This paper provides a comprehensive overview of the metrics used for the evaluation of time series anomaly detection methods, and also defines a taxonomy of these based on how they are calculated. By defining a set of properties for evaluation metrics and a set of specific case studies and experiments, twenty metrics are analyzed and discussed in detail, highlighting the unique suitability of each for specific tasks. Through extensive experimentation and analysis, this paper argues that the choice of evaluation metric must be made with care, taking into account the specific requirements of the task at hand.
{"title":"Navigating the metric maze: a taxonomy of evaluation metrics for anomaly detection in time series","authors":"Sondre Sørbø, Massimiliano Ruocco","doi":"10.1007/s10618-023-00988-8","DOIUrl":"https://doi.org/10.1007/s10618-023-00988-8","url":null,"abstract":"<p>The field of time series anomaly detection is constantly advancing, with several methods available, making it a challenge to determine the most appropriate method for a specific domain. The evaluation of these methods is facilitated by the use of metrics, which vary widely in their properties. Despite the existence of new evaluation metrics, there is limited agreement on which metrics are best suited for specific scenarios and domains, and the most commonly used metrics have faced criticism in the literature. This paper provides a comprehensive overview of the metrics used for the evaluation of time series anomaly detection methods, and also defines a taxonomy of these based on how they are calculated. By defining a set of properties for evaluation metrics and a set of specific case studies and experiments, twenty metrics are analyzed and discussed in detail, highlighting the unique suitability of each for specific tasks. Through extensive experimentation and analysis, this paper argues that the choice of evaluation metric must be made with care, taking into account the specific requirements of the task at hand.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"13 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2023-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138540835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-13DOI: 10.1007/s10618-023-00985-x
Luka Biedebach, María Óskarsdóttir, Erna Sif Arnardóttir, Sigridur Sigurdardóttir, Michael Valur Clausen, Sigurveig Þ. Sigurdardóttir, Marta Serwatko, Anna Sigridur Islind
Abstract Identifying mouth breathing during sleep in a reliable, non-invasive way is challenging and currently not included in sleep studies. However, it has a high clinical relevance in pediatrics, as it can negatively impact the physical and mental health of children. Since mouth breathing is an anomalous condition in the general population with only 2% prevalence in our data set, we are facing an anomaly detection problem. This type of human medical data is commonly approached with deep learning methods. However, applying multiple supervised and unsupervised machine learning methods to this anomaly detection problem showed that classic machine learning methods should also be taken into account. This paper compared deep learning and classic machine learning methods on respiratory data during sleep using a leave-one-out cross validation. This way we observed the uncertainty of the models and their performance across participants with varying signal quality and prevalence of mouth breathing. The main contribution is identifying the model with the highest clinical relevance to facilitate the diagnosis of chronic mouth breathing, which may allow more affected children to receive appropriate treatment.
{"title":"Anomaly detection in sleep: detecting mouth breathing in children","authors":"Luka Biedebach, María Óskarsdóttir, Erna Sif Arnardóttir, Sigridur Sigurdardóttir, Michael Valur Clausen, Sigurveig Þ. Sigurdardóttir, Marta Serwatko, Anna Sigridur Islind","doi":"10.1007/s10618-023-00985-x","DOIUrl":"https://doi.org/10.1007/s10618-023-00985-x","url":null,"abstract":"Abstract Identifying mouth breathing during sleep in a reliable, non-invasive way is challenging and currently not included in sleep studies. However, it has a high clinical relevance in pediatrics, as it can negatively impact the physical and mental health of children. Since mouth breathing is an anomalous condition in the general population with only 2% prevalence in our data set, we are facing an anomaly detection problem. This type of human medical data is commonly approached with deep learning methods. However, applying multiple supervised and unsupervised machine learning methods to this anomaly detection problem showed that classic machine learning methods should also be taken into account. This paper compared deep learning and classic machine learning methods on respiratory data during sleep using a leave-one-out cross validation. This way we observed the uncertainty of the models and their performance across participants with varying signal quality and prevalence of mouth breathing. The main contribution is identifying the model with the highest clinical relevance to facilitate the diagnosis of chronic mouth breathing, which may allow more affected children to receive appropriate treatment.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"60 24","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136348550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-27DOI: 10.1007/s10618-023-00984-y
João Palet, Vasco Manquinho, Rui Henriques
Abstract Individual and societal systems are open systems continuously affected by their situational context. In recent years, context sources have been increasingly considered in different domains to aid short and long-term forecasts of systems’ behavior. Nevertheless, available research generally disregards the role of prospective context, such as calendrical planning or weather forecasts. This work proposes a multiple-input neural architecture consisting of a sequential composition of long short-term memory units or temporal convolutional networks able to incorporate both historical and prospective sources of situational context to aid time series forecasting tasks. Considering urban case studies, we further assess the impact that different sources of external context have on medical emergency and mobility forecasts. Results show that the incorporation of external context variables, including calendrical and weather variables, can significantly reduce forecasting errors against state-of-the-art forecasters. In particular, the incorporation of prospective context, generally neglected in related work, mitigates error increases along the forecasting horizon.
{"title":"Multiple-input neural networks for time series forecasting incorporating historical and prospective context","authors":"João Palet, Vasco Manquinho, Rui Henriques","doi":"10.1007/s10618-023-00984-y","DOIUrl":"https://doi.org/10.1007/s10618-023-00984-y","url":null,"abstract":"Abstract Individual and societal systems are open systems continuously affected by their situational context. In recent years, context sources have been increasingly considered in different domains to aid short and long-term forecasts of systems’ behavior. Nevertheless, available research generally disregards the role of prospective context, such as calendrical planning or weather forecasts. This work proposes a multiple-input neural architecture consisting of a sequential composition of long short-term memory units or temporal convolutional networks able to incorporate both historical and prospective sources of situational context to aid time series forecasting tasks. Considering urban case studies, we further assess the impact that different sources of external context have on medical emergency and mobility forecasts. Results show that the incorporation of external context variables, including calendrical and weather variables, can significantly reduce forecasting errors against state-of-the-art forecasters. In particular, the incorporation of prospective context, generally neglected in related work, mitigates error increases along the forecasting horizon.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"11 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136316825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-26DOI: 10.1007/s10618-023-00983-z
Anne Hartebrodt, Richard Röttger, David B. Blumenthal
Abstract Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors corresponding to the samples are never exchanged and the vectors associated with the features are only transmitted to an aggregator for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.
{"title":"Federated singular value decomposition for high-dimensional data","authors":"Anne Hartebrodt, Richard Röttger, David B. Blumenthal","doi":"10.1007/s10618-023-00983-z","DOIUrl":"https://doi.org/10.1007/s10618-023-00983-z","url":null,"abstract":"Abstract Federated learning (FL) is emerging as a privacy-aware alternative to classical cloud-based machine learning. In FL, the sensitive data remains in data silos and only aggregated parameters are exchanged. Hospitals and research institutions which are not willing to share their data can join a federated study without breaching confidentiality. In addition to the extreme sensitivity of biomedical data, the high dimensionality poses a challenge in the context of federated genome-wide association studies (GWAS). In this article, we present a federated singular value decomposition algorithm, suitable for the privacy-related and computational requirements of GWAS. Notably, the algorithm has a transmission cost independent of the number of samples and is only weakly dependent on the number of features, because the singular vectors corresponding to the samples are never exchanged and the vectors associated with the features are only transmitted to an aggregator for a fixed number of iterations. Although motivated by GWAS, the algorithm is generically applicable for both horizontally and vertically partitioned data.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"33 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-10-20DOI: 10.1007/s10618-023-00979-9
Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling
Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ × and 18 $$times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
摘要对能够聚类和分析大型时间序列数据的可扩展算法的需求越来越大。Kohonen自组织映射(SOM)是一种用于复杂数据聚类、可视化和降维的无监督人工神经网络。与所有聚类方法一样,它需要输入数据之间的相似性度量(在此工作时间序列中)。动态时间翘曲(DTW)就是这样一种度量方法,它在调整时间序列时能够适应扭曲。尽管DTW在聚类中很流行,但它在实践中受到限制,因为运行时复杂度是时间序列长度的二次元。为了解决这个问题,我们提出了一个新的用于聚类TIME Series的自组织映射,称为SOMTimeS,它使用DTW作为距离度量。与其他基于dtw的聚类算法相比,该方法具有相似的精度,但可扩展性更好,运行速度更快。计算性能源于在SOM的训练阶段修剪不必要的DTW计算。为了比较,我们对K-means实施了类似的修剪策略,并将后者称为K-TimeS。SOMTimeS和K-TimeS修剪43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ × speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ × and 18 $$times$$ × depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.
{"title":"Somtimes: self organizing maps for time series clustering and its application to serious illness conversations","authors":"Ali Javed, Donna M. Rizzo, Byung Suk Lee, Robert Gramling","doi":"10.1007/s10618-023-00979-9","DOIUrl":"https://doi.org/10.1007/s10618-023-00979-9","url":null,"abstract":"Abstract There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> and 18 $$times$$ <mml:math xmlns:mml=\"http://www.w3.org/1998/Math/MathML\"> <mml:mo>×</mml:mo> </mml:math> depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"1 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135513626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}