Pub Date : 2024-05-21DOI: 10.1007/s10618-024-01035-w
Mingjie Qiu, Zhiyi Tan, Bing-Kun Bao
Infectious disease forecasting has been a key focus and proved to be crucial in controlling epidemic. A recent trend is to develop forecasting models based on graph neural networks (GNNs). However, existing GNN-based methods suffer from two key limitations: (1) current models broaden receptive fields by scaling the depth of GNNs, which is insufficient to preserve the semantics of long-range connectivity between distant but epidemic related areas. (2) Previous approaches model epidemics within single spatial scale, while ignoring the multi-scale epidemic patterns derived from different scales. To address these deficiencies, we devise the Multi-scale Spatio-temporal Graph Neural Network (MSGNN) based on an innovative multi-scale view. To be specific, in the proposed MSGNN model, we first devise a novel graph learning module, which directly captures long-range connectivity from trans-regional epidemic signals and integrates them into a multi-scale graph. Based on the learned multi-scale graph, we utilize a newly designed graph convolution module to exploit multi-scale epidemic patterns. This module allows us to facilitate multi-scale epidemic modeling by mining both scale-shared and scale-specific patterns. Experimental results on forecasting new cases of COVID-19 in United State demonstrate the superiority of our method over state-of-arts. Further analyses and visualization also show that MSGNN offers not only accurate, but also robust and interpretable forecasting result. Code is available at https://github.com/JashinKorone/MSGNN.
{"title":"MSGNN: Multi-scale Spatio-temporal Graph Neural Network for epidemic forecasting","authors":"Mingjie Qiu, Zhiyi Tan, Bing-Kun Bao","doi":"10.1007/s10618-024-01035-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01035-w","url":null,"abstract":"<p>Infectious disease forecasting has been a key focus and proved to be crucial in controlling epidemic. A recent trend is to develop forecasting models based on graph neural networks (GNNs). However, existing GNN-based methods suffer from two key limitations: (1) current models broaden receptive fields by scaling the depth of GNNs, which is insufficient to preserve the semantics of long-range connectivity between distant but epidemic related areas. (2) Previous approaches model epidemics within single spatial scale, while ignoring the multi-scale epidemic patterns derived from different scales. To address these deficiencies, we devise the Multi-scale Spatio-temporal Graph Neural Network (MSGNN) based on an innovative multi-scale view. To be specific, in the proposed MSGNN model, we first devise a novel graph learning module, which directly captures long-range connectivity from trans-regional epidemic signals and integrates them into a multi-scale graph. Based on the learned multi-scale graph, we utilize a newly designed graph convolution module to exploit multi-scale epidemic patterns. This module allows us to facilitate multi-scale epidemic modeling by mining both scale-shared and scale-specific patterns. Experimental results on forecasting new cases of COVID-19 in United State demonstrate the superiority of our method over state-of-arts. Further analyses and visualization also show that MSGNN offers not only accurate, but also robust and interpretable forecasting result. Code is available at https://github.com/JashinKorone/MSGNN.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-19DOI: 10.1007/s10618-024-01027-w
David Guijo-Rubio, Matthew Middlehurst, Guilherme Arcencio, Diego Furtado Silva, Anthony Bagnall
Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We then extend the comparison to include a wider range of standard regressors and the latest versions of TSER models used in the previous study. We show that none of the previously evaluated regressors can outperform a regression adaptation of a standard classifier, rotation forest. We introduce two new TSER algorithms developed from related work in time series classification. FreshPRINCE is a pipeline estimator consisting of a transform into a wide range of summary features followed by a rotation forest regressor. DrCIF is a tree ensemble that creates features from summary statistics over random intervals. Our study demonstrates that both algorithms, along with InceptionTime, exhibit significantly better performance compared to the other 18 regressors tested. More importantly, DrCIF is the only one that significantly outperforms a standard rotation forest regressor.
{"title":"Unsupervised feature based algorithms for time series extrinsic regression","authors":"David Guijo-Rubio, Matthew Middlehurst, Guilherme Arcencio, Diego Furtado Silva, Anthony Bagnall","doi":"10.1007/s10618-024-01027-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01027-w","url":null,"abstract":"<p>Time Series Extrinsic Regression (TSER) involves using a set of training time series to form a predictive model of a continuous response variable that is not directly related to the regressor series. The TSER archive for comparing algorithms was released in 2022 with 19 problems. We increase the size of this archive to 63 problems and reproduce the previous comparison of baseline algorithms. We then extend the comparison to include a wider range of standard regressors and the latest versions of TSER models used in the previous study. We show that none of the previously evaluated regressors can outperform a regression adaptation of a standard classifier, rotation forest. We introduce two new TSER algorithms developed from related work in time series classification. FreshPRINCE is a pipeline estimator consisting of a transform into a wide range of summary features followed by a rotation forest regressor. DrCIF is a tree ensemble that creates features from summary statistics over random intervals. Our study demonstrates that both algorithms, along with InceptionTime, exhibit significantly better performance compared to the other 18 regressors tested. More importantly, DrCIF is the only one that significantly outperforms a standard rotation forest regressor.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"32 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141063378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1007/s10618-024-01020-3
Raphael Fischer, Thomas Liebig, Katharina Morik
With machine learning (ML) becoming a popular tool across all domains, practitioners are in dire need of comprehensive reporting on the state-of-the-art. Benchmarks and open databases provide helpful insights for many tasks, however suffer from several phenomena: Firstly, they overly focus on prediction quality, which is problematic considering the demand for more sustainability in ML. Depending on the use case at hand, interested users might also face tight resource constraints and thus should be allowed to interact with reporting frameworks, in order to prioritize certain reported characteristics. Furthermore, as some practitioners might not yet be well-skilled in ML, it is important to convey information on a more abstract, comprehensible level. Usability and extendability are key for moving with the state-of-the-art and in order to be trustworthy, frameworks should explicitly address reproducibility. In this work, we analyze established reporting systems under consideration of the aforementioned issues. Afterwards, we propose STREP, our novel framework that aims at overcoming these shortcomings and paves the way towards more sustainable and trustworthy reporting. We use STREP’s (publicly available) implementation to investigate various existing report databases. Our experimental results unveil the need for making reporting more resource-aware and demonstrate our framework’s capabilities of overcoming current reporting limitations. With our work, we want to initiate a paradigm shift in reporting and help with making ML advances more considerate of sustainability and trustworthiness.
随着机器学习(ML)成为各个领域的流行工具,从业人员迫切需要有关最新技术的全面报告。基准和开放数据库为许多任务提供了有用的见解,但也存在一些问题:首先,它们过于关注预测质量,而考虑到对 ML 可持续性的要求,这是有问题的。根据当前的使用情况,感兴趣的用户也可能面临资源紧张的问题,因此应允许他们与报告框架进行交互,以便优先考虑某些报告特征。此外,由于一些从业人员可能尚未熟练掌握 ML,因此必须在更抽象、更易理解的层面上传达信息。可用性和可扩展性是与最新技术保持同步的关键,为了值得信赖,框架应明确解决可重复性问题。在这项工作中,我们根据上述问题对已有的报告系统进行了分析。随后,我们提出了 STREP -- 我们的新型框架,旨在克服这些不足,为实现更可持续、更可信的报告铺平道路。我们使用 STREP(公开可用)的实现来调查各种现有的报告数据库。我们的实验结果揭示了使报告更具资源感知能力的必要性,并展示了我们的框架克服当前报告局限性的能力。通过我们的工作,我们希望启动报告范式的转变,并帮助使 ML 的进步更加考虑可持续性和可信性。
{"title":"Towards more sustainable and trustworthy reporting in machine learning","authors":"Raphael Fischer, Thomas Liebig, Katharina Morik","doi":"10.1007/s10618-024-01020-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01020-3","url":null,"abstract":"<p>With machine learning (ML) becoming a popular tool across all domains, practitioners are in dire need of comprehensive reporting on the state-of-the-art. Benchmarks and open databases provide helpful insights for many tasks, however suffer from several phenomena: Firstly, they overly focus on prediction quality, which is problematic considering the demand for more sustainability in ML. Depending on the use case at hand, interested users might also face tight resource constraints and thus should be allowed to interact with reporting frameworks, in order to prioritize certain reported characteristics. Furthermore, as some practitioners might not yet be well-skilled in ML, it is important to convey information on a more abstract, comprehensible level. Usability and extendability are key for moving with the state-of-the-art and in order to be trustworthy, frameworks should explicitly address reproducibility. In this work, we analyze established reporting systems under consideration of the aforementioned issues. Afterwards, we propose STREP, our novel framework that aims at overcoming these shortcomings and paves the way towards more sustainable and trustworthy reporting. We use STREP’s (publicly available) implementation to investigate various existing report databases. Our experimental results unveil the need for making reporting more resource-aware and demonstrate our framework’s capabilities of overcoming current reporting limitations. With our work, we want to initiate a paradigm shift in reporting and help with making ML advances more considerate of sustainability and trustworthiness.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140841715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-25DOI: 10.1007/s10618-024-01010-5
Kacper Sokol, Peter Flach
Interpretable representations are the backbone of many explainers that target black-box predictive systems based on artificial intelligence and machine learning algorithms. They translate the low-level data representation necessary for good predictive performance into high-level human-intelligible concepts used to convey the explanatory insights. Notably, the explanation type and its cognitive complexity are directly controlled by the interpretable representation, tweaking which allows to target a particular audience and use case. However, many explainers built upon interpretable representations overlook their merit and fall back on default solutions that often carry implicit assumptions, thereby degrading the explanatory power and reliability of such techniques. To address this problem, we study properties of interpretable representations that encode presence and absence of human-comprehensible concepts. We demonstrate how they are operationalised for tabular, image and text data; discuss their assumptions, strengths and weaknesses; identify their core building blocks; and scrutinise their configuration and parameterisation. In particular, this in-depth analysis allows us to pinpoint their explanatory properties, desiderata and scope for (malicious) manipulation in the context of tabular data where a linear model is used to quantify the influence of interpretable concepts on a black-box prediction. Our findings lead to a range of recommendations for designing trustworthy interpretable representations; specifically, the benefits of class-aware (supervised) discretisation of tabular data, e.g., with decision trees, and sensitivity of image interpretable representations to segmentation granularity and occlusion colour.
{"title":"Interpretable representations in explainable AI: from theory to practice","authors":"Kacper Sokol, Peter Flach","doi":"10.1007/s10618-024-01010-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01010-5","url":null,"abstract":"<p>Interpretable representations are the backbone of many explainers that target black-box predictive systems based on artificial intelligence and machine learning algorithms. They translate the low-level data representation necessary for good predictive performance into high-level human-intelligible concepts used to convey the explanatory insights. Notably, the explanation type and its cognitive complexity are directly controlled by the interpretable representation, tweaking which allows to target a particular audience and use case. However, many explainers built upon interpretable representations overlook their merit and fall back on default solutions that often carry implicit assumptions, thereby degrading the explanatory power and reliability of such techniques. To address this problem, we study properties of interpretable representations that encode presence and absence of human-comprehensible concepts. We demonstrate how they are operationalised for tabular, image and text data; discuss their assumptions, strengths and weaknesses; identify their core building blocks; and scrutinise their configuration and parameterisation. In particular, this in-depth analysis allows us to pinpoint their explanatory properties, desiderata and scope for (malicious) manipulation in the context of tabular data where a linear model is used to quantify the influence of interpretable concepts on a black-box prediction. Our findings lead to a range of recommendations for designing trustworthy interpretable representations; specifically, the benefits of class-aware (supervised) discretisation of tabular data, e.g., with decision trees, and sensitivity of image interpretable representations to segmentation granularity and occlusion colour.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-19DOI: 10.1007/s10618-024-01022-1
Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall
In 2017, a research paper (Bagnall et al. Data Mining and Knowledge Discovery 31(3):606-660. 2017) compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a ‘bake off’, identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, MultiROCKET+Hydra (Dempster et al. 2022) and HIVE-COTEv2 (Middlehurst et al. Mach Learn 110:3211-3243. 2021), perform significantly better than other approaches on both the current and new TSC problems.
2017 年,一篇研究论文(Bagnall et al. Data Mining and Knowledge Discovery 31(3):606-660.2017)在加州大学河滨分校(UCR)档案馆的 85 个数据集上比较了 18 种时间序列分类(TSC)算法。这项通常被称为 "烘烤 "的研究发现,只有九种算法的性能明显优于所使用的动态时间扭曲(DTW)和旋转森林基准。该研究按照从时间序列数据中提取特征的类型对每种算法进行了分类,形成了五种主要算法类型的分类法。对算法进行分类,同时提供代码和可访问的结果,以实现可重复性,这有助于提高 TSC 领域的受欢迎程度。自这次竞赛以来,六年多过去了,UCR 档案已扩展到 112 个数据集,并提出了大量新算法。我们重温了这次评选活动,看看自最初发表以来,所提出的每个分类是如何发展的,并利用扩充的 UCR 档案,对照以前的最佳分类,评估新算法的性能。我们扩展了分类法,增加了三个新类别,以反映最新发展。除了最初提出的基于距离、区间、小形、字典和混合的算法外,我们还比较了较新的基于卷积和特征的算法以及深度学习方法。我们引入了 30 个分类数据集,这些数据集要么是最近捐赠给档案馆的,要么是重新格式化为 TSC 格式的,我们将利用这些数据集进一步评估每个类别中性能最佳的算法。总体而言,我们发现最近提出的两种算法--MultiROCKET+Hydra (Dempster et al. 2022) 和 HIVE-COTEv2 (Middlehurst et al. Mach Learn 110:3211-3243. 2021),在当前和新的 TSC 问题上的表现都明显优于其他方法。
{"title":"Bake off redux: a review and experimental evaluation of recent time series classification algorithms","authors":"Matthew Middlehurst, Patrick Schäfer, Anthony Bagnall","doi":"10.1007/s10618-024-01022-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01022-1","url":null,"abstract":"<p>In 2017, a research paper (Bagnall et al. Data Mining and Knowledge Discovery 31(3):606-660. 2017) compared 18 Time Series Classification (TSC) algorithms on 85 datasets from the University of California, Riverside (UCR) archive. This study, commonly referred to as a ‘bake off’, identified that only nine algorithms performed significantly better than the Dynamic Time Warping (DTW) and Rotation Forest benchmarks that were used. The study categorised each algorithm by the type of feature they extract from time series data, forming a taxonomy of five main algorithm types. This categorisation of algorithms alongside the provision of code and accessible results for reproducibility has helped fuel an increase in popularity of the TSC field. Over six years have passed since this bake off, the UCR archive has expanded to 112 datasets and there have been a large number of new algorithms proposed. We revisit the bake off, seeing how each of the proposed categories have advanced since the original publication, and evaluate the performance of newer algorithms against the previous best-of-category using an expanded UCR archive. We extend the taxonomy to include three new categories to reflect recent developments. Alongside the originally proposed distance, interval, shapelet, dictionary and hybrid based algorithms, we compare newer convolution and feature based algorithms as well as deep learning approaches. We introduce 30 classification datasets either recently donated to the archive or reformatted to the TSC format, and use these to further evaluate the best performing algorithm from each category. Overall, we find that two recently proposed algorithms, MultiROCKET+Hydra (Dempster et al. 2022) and HIVE-COTEv2 (Middlehurst et al. Mach Learn 110:3211-3243. 2021), perform significantly better than other approaches on both the current and new TSC problems.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"54 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140627694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s10618-024-01019-w
Helen L. Smith, Patrick J. Biggs, Nigel P. French, Adam N. H. Smith, Jonathan C. Marshall
Levels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of Campylobacter species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic versus target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the a priori hypothesis of equal class probability. We present a novel method of ordinal encoding predictors via principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.
{"title":"Lost in the Forest: Encoding categorical variables and the absent levels problem","authors":"Helen L. Smith, Patrick J. Biggs, Nigel P. French, Adam N. H. Smith, Jonathan C. Marshall","doi":"10.1007/s10618-024-01019-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01019-w","url":null,"abstract":"<p>Levels of a predictor variable that are absent when a classification tree is grown can not be subject to an explicit splitting rule. This is an issue if these absent levels are present in a new observation for prediction. To date, there remains no satisfactory solution for absent levels in random forest models. Unlike missing data, absent levels are fully observed and known. Ordinal encoding of predictors allows absent levels to be integrated and used for prediction. Using a case study on source attribution of <i>Campylobacter</i> species using whole genome sequencing (WGS) data as predictors, we examine how target-agnostic <i>versus</i> target-based encoding of predictor variables with absent levels affects the accuracy of random forest models. We show that a target-based encoding approach using class probabilities, with absent levels designated the highest rank, is systematically biased, and that this bias is resolved by encoding absent levels according to the <i>a priori</i> hypothesis of equal class probability. We present a novel method of ordinal encoding predictors <i>via</i> principal coordinates analysis (PCO) which capitalizes on the similarity between pairs of predictor levels. Absent levels are encoded according to their similarity to each of the other levels in the training data. We show that the PCO-encoding method performs at least as well as the target-based approach and is not biased.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"13 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140562689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01DOI: 10.1007/s10618-024-01018-x
Abstract
Time series data, spanning applications ranging from climatology to finance to healthcare, presents significant challenges in data mining due to its size and complexity. One open issue lies in time series clustering, which is crucial for processing large volumes of unlabeled time series data and unlocking valuable insights. Traditional and modern analysis methods, however, often struggle with these complexities. To address these limitations, we introduce R-Clustering, a novel method that utilizes convolutional architectures with randomly selected parameters. Through extensive evaluations, R-Clustering demonstrates superior performance over existing methods in terms of clustering accuracy, computational efficiency and scalability. Empirical results obtained using the UCR archive demonstrate the effectiveness of our approach across diverse time series datasets. The findings highlight the significance of R-Clustering in various domains and applications, contributing to the advancement of time series data mining.
摘要 时间序列数据的应用范围从气候学、金融到医疗保健,由于其规模和复杂性,给数据挖掘带来了巨大挑战。其中一个有待解决的问题是时间序列聚类,这对于处理大量无标记的时间序列数据和挖掘有价值的见解至关重要。然而,传统和现代的分析方法往往难以应对这些复杂性。为了解决这些局限性,我们引入了 R-聚类,这是一种利用随机选择参数的卷积架构的新方法。通过广泛的评估,R-聚类在聚类准确性、计算效率和可扩展性方面都表现出优于现有方法的性能。使用 UCR 档案获得的经验结果表明,我们的方法在各种时间序列数据集上都很有效。研究结果凸显了 R 聚类在不同领域和应用中的重要性,有助于推动时间序列数据挖掘的发展。
{"title":"Time series clustering with random convolutional kernels","authors":"","doi":"10.1007/s10618-024-01018-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01018-x","url":null,"abstract":"<h3>Abstract</h3> <p>Time series data, spanning applications ranging from climatology to finance to healthcare, presents significant challenges in data mining due to its size and complexity. One open issue lies in time series clustering, which is crucial for processing large volumes of unlabeled time series data and unlocking valuable insights. Traditional and modern analysis methods, however, often struggle with these complexities. To address these limitations, we introduce R-Clustering, a novel method that utilizes convolutional architectures with randomly selected parameters. Through extensive evaluations, R-Clustering demonstrates superior performance over existing methods in terms of clustering accuracy, computational efficiency and scalability. Empirical results obtained using the UCR archive demonstrate the effectiveness of our approach across diverse time series datasets. The findings highlight the significance of R-Clustering in various domains and applications, contributing to the advancement of time series data mining.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140562687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-25DOI: 10.1007/s10618-024-01015-0
Abstract
One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, dimensionality reduction techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, Linear Correlated Features Aggregation (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.
摘要 对真实数据进行机器学习的若干应用的核心问题之一是输入特征的选择。理想情况下,设计者应选择少量相关的、非冗余的特征,以保留原始数据集所包含的完整信息,且特征之间的共线性很小。这一程序有助于缓解处理高维问题时出现的过拟合和维度诅咒等问题。另一方面,简单地丢弃某些特征并不可取,因为这些特征可能仍然包含可用于改进结果的信息。相反,降维技术旨在将数据集中的特征投影到一个更低的维度空间,从而限制特征的数量,并可能考虑到所有原始特征。然而,应用降维技术得到的投影特征通常很难解释。在本文中,我们试图设计一种有原则的降维方法,以保持所得特征的可解释性。具体来说,我们提出了线性模型的偏差-方差分析,并利用这些理论结果设计了一种算法--线性相关特征聚合(Linear Correlated Features Aggregation,LinCFA)。通过这种方法,所有特征都被考虑在内,维度得以降低,可解释性得以保持。最后,我们在合成数据集上对所提出的算法进行了数值验证,以确认理论结果,并在真实数据集上展示了一些有前景的应用。
{"title":"Interpretable linear dimensionality reduction based on bias-variance analysis","authors":"","doi":"10.1007/s10618-024-01015-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01015-0","url":null,"abstract":"<h3>Abstract</h3> <p>One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, <em>dimensionality reduction</em> techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, <em>Linear Correlated Features Aggregation</em> (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"86 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140300618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-22DOI: 10.1007/s10618-024-01017-y
Annabelle Redelmeier, Martin Jullum, Kjersti Aas, Anders Løland
We introduce MCCE: ({{{underline{varvec{M}}}}})onte ({{{underline{varvec{C}}}}})arlo sampling of valid and realistic ({{{underline{varvec{C}}}}})ounterfactual ({{{underline{varvec{E}}}}})xplanations for tabular data, a novel counterfactual explanation method that generates on-manifold, actionable and valid counterfactuals by modeling the joint distribution of the mutable features given the immutable features and the decision. Unlike other on-manifold methods that tend to rely on variational autoencoders and have strict prediction model and data requirements, MCCE handles any type of prediction model and categorical features with more than two levels. MCCE first models the joint distribution of the features and the decision with an autoregressive generative model where the conditionals are estimated using decision trees. Then, it samples a large set of observations from this model, and finally, it removes the samples that do not obey certain criteria. We compare MCCE with a range of state-of-the-art on-manifold counterfactual methods using four well-known data sets and show that MCCE outperforms these methods on all common performance metrics and speed. In particular, including the decision in the modeling process improves the efficiency of the method substantially.
我们介绍 MCCE:针对表格数据的有效和现实的反事实解释({{underline{/varvec{M}}}}}onte &({{{underline{/varvec{C}}}}})arlo sampling of valid and realistic &({{{underline{/varvec{C}}}}})unterfactual &({{{underline{/varvec{E}}}}})xplanations for tabular data)、这是一种新颖的反事实解释方法,通过给定不可变特征和决策,对可变特征的联合分布进行建模,生成可操作的有效反事实。与其他往往依赖变异自动编码器并对预测模型和数据有严格要求的manifold方法不同,MCCE可处理任何类型的预测模型和两级以上的分类特征。MCCE 首先使用自回归生成模型对特征和决策的联合分布进行建模,其中的条件使用决策树进行估计。然后,它从该模型中抽取大量观察样本,最后剔除不符合特定标准的样本。我们使用四个著名的数据集将 MCCE 与一系列最先进的本体反事实方法进行了比较,结果表明 MCCE 在所有常见性能指标和速度上都优于这些方法。特别是,将决策纳入建模过程大大提高了该方法的效率。
{"title":"MCCE: Monte Carlo sampling of valid and realistic counterfactual explanations for tabular data","authors":"Annabelle Redelmeier, Martin Jullum, Kjersti Aas, Anders Løland","doi":"10.1007/s10618-024-01017-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01017-y","url":null,"abstract":"<p>We introduce MCCE: <span>({{{underline{varvec{M}}}}})</span>onte <span>({{{underline{varvec{C}}}}})</span>arlo sampling of valid and realistic <span>({{{underline{varvec{C}}}}})</span>ounterfactual <span>({{{underline{varvec{E}}}}})</span>xplanations for tabular data, a novel counterfactual explanation method that generates on-manifold, actionable and valid counterfactuals by modeling the joint distribution of the mutable features given the immutable features and the decision. Unlike other on-manifold methods that tend to rely on variational autoencoders and have strict prediction model and data requirements, MCCE handles any type of prediction model and categorical features with more than two levels. MCCE first models the joint distribution of the features and the decision with an autoregressive generative model where the conditionals are estimated using decision trees. Then, it samples a large set of observations from this model, and finally, it removes the samples that do not obey certain criteria. We compare MCCE with a range of state-of-the-art on-manifold counterfactual methods using four well-known data sets and show that MCCE outperforms these methods on all common performance metrics and speed. In particular, including the decision in the modeling process improves the efficiency of the method substantially.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"365 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-18DOI: 10.1007/s10618-024-01014-1
Pablo González, Alejandro Moreo, Fabrizio Sebastiani
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.
{"title":"Binary quantification and dataset shift: an experimental investigation","authors":"Pablo González, Alejandro Moreo, Fabrizio Sebastiani","doi":"10.1007/s10618-024-01014-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01014-1","url":null,"abstract":"<p>Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from <i>dataset shift</i>. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., <i>prior probability shift</i>; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"159 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140167619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}