Data Mining and Knowledge Discovery最新文献

英文中文

Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation 无偏机器学习算法评价基准数据集的最优选择

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-20 DOI: 10.1007/s10618-023-00957-1

João Luiz Junho Pereira, Kate Smith-Miles, Mario Andrés Muñoz, Ana Carolina Lorena

引用次数: 0

Mondrian forest for data stream classification under memory constraints 内存约束下数据流分类的蒙德里安森林

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-17 DOI: 10.1007/s10618-023-00970-4

Martin Khannouz, Tristan Glatard

引用次数: 0

Fast, accurate and explainable time series classification through randomization 通过随机化快速，准确和可解释的时间序列分类

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-16 DOI: 10.1007/s10618-023-00978-w

Nestor Cabello, Elham Naghizade, Jianzhong Qi, Lars Kulik

Abstract Time series classification (TSC) aims to predict the class label of a given time series, which is critical to a rich set of application areas such as economics and medicine. State-of-the-art TSC methods have mostly focused on classification accuracy, without considering classification speed. However, efficiency is important for big data analysis. Datasets with a large training size or long series challenge the use of the current highly accurate methods, because they are usually computationally expensive. Similarly, classification explainability, which is an important property required by modern big data applications such as appliance modeling and legislation such as the European General Data Protection Regulation , has received little attention. To address these gaps, we propose a novel TSC method – the Randomized-Supervised Time Series Forest (r-STSF). r-STSF is extremely fast and achieves state-of-the-art classification accuracy. It is an efficient interval-based approach that classifies time series according to aggregate values of the discriminatory sub-series (intervals). To achieve state-of-the-art accuracy, r-STSF builds an ensemble of randomized trees using the discriminatory sub-series. It uses four time series representations, nine aggregation functions and a supervised binary-inspired search combined with a feature ranking metric to identify highly discriminatory sub-series. The discriminatory sub-series enable explainable classifications. Experiments on extensive datasets show that r-STSF achieves state-of-the-art accuracy while being orders of magnitude faster than most existing TSC methods and enabling for explanations on the classifier decision.

时间序列分类(TSC)的目的是预测给定时间序列的类别标签，这对于经济学和医学等丰富的应用领域至关重要。现有的TSC方法主要关注分类精度，而不考虑分类速度。然而，对于大数据分析来说，效率是很重要的。具有大型训练规模或长序列的数据集挑战当前高精度方法的使用，因为它们通常在计算上昂贵。同样，分类可解释性是现代大数据应用(如设备建模)和立法(如欧洲通用数据保护条例)所要求的重要属性，但却很少受到关注。为了解决这些差距，我们提出了一种新的TSC方法-随机监督时间序列森林(r-STSF)。r-STSF非常快，达到了最先进的分类精度。它是一种有效的基于区间的方法，根据判别子序列(区间)的集合值对时间序列进行分类。为了达到最先进的精度，r-STSF使用歧视性子序列构建随机树的集合。它使用四种时间序列表示、九种聚合函数和一种监督二值搜索，结合特征排序度量来识别高度歧视的子序列。区分子系列使分类可以解释。在大量数据集上的实验表明，r-STSF达到了最先进的精度，同时比大多数现有的TSC方法快几个数量级，并且能够解释分类器的决策。

{"title":"Fast, accurate and explainable time series classification through randomization","authors":"Nestor Cabello, Elham Naghizade, Jianzhong Qi, Lars Kulik","doi":"10.1007/s10618-023-00978-w","DOIUrl":"https://doi.org/10.1007/s10618-023-00978-w","url":null,"abstract":"Abstract Time series classification (TSC) aims to predict the class label of a given time series, which is critical to a rich set of application areas such as economics and medicine. State-of-the-art TSC methods have mostly focused on classification accuracy, without considering classification speed. However, efficiency is important for big data analysis. Datasets with a large training size or long series challenge the use of the current highly accurate methods, because they are usually computationally expensive. Similarly, classification explainability, which is an important property required by modern big data applications such as appliance modeling and legislation such as the European General Data Protection Regulation , has received little attention. To address these gaps, we propose a novel TSC method – the Randomized-Supervised Time Series Forest (r-STSF). r-STSF is extremely fast and achieves state-of-the-art classification accuracy. It is an efficient interval-based approach that classifies time series according to aggregate values of the discriminatory sub-series (intervals). To achieve state-of-the-art accuracy, r-STSF builds an ensemble of randomized trees using the discriminatory sub-series. It uses four time series representations, nine aggregation functions and a supervised binary-inspired search combined with a feature ranking metric to identify highly discriminatory sub-series. The discriminatory sub-series enable explainable classifications. Experiments on extensive datasets show that r-STSF achieves state-of-the-art accuracy while being orders of magnitude faster than most existing TSC methods and enabling for explanations on the classifier decision.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136113227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Design and evaluation of highly accurate smart contract code vulnerability detection framework 高精度智能合约代码漏洞检测框架的设计与评估

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-13 DOI: 10.1007/s10618-023-00981-1

Sowon Jeon, Gilhee Lee, Hyoungshick Kim, Simon S. Woo

引用次数: 0

MODE-Bi-GRU: orthogonal independent Bi-GRU model with multiscale feature extraction MODE-Bi-GRU:具有多尺度特征提取的正交独立Bi-GRU模型

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-09 DOI: 10.1007/s10618-023-00964-2

Wei Wang, Wenhan Ruan, Xiangfu Meng

引用次数: 0

The art of centering without centering for robust principal component analysis 鲁棒主成分分析中不定心的定心方法

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-09 DOI: 10.1007/s10618-023-00976-y

Guihong Wan, Baokun He, Haim Schweitzer

引用次数: 0

Community detection in interval-weighted networks 区间加权网络中的团体检测

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-09 DOI: 10.1007/s10618-023-00975-z

Hélder Alves, Paula Brito, Pedro Campos

引用次数: 2

Representing ensembles of networks for fuzzy cluster analysis: a case study 表示模糊聚类分析的网络集成:一个案例研究

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-10-09 DOI: 10.1007/s10618-023-00977-x

Ilaria Bombelli, Ichcha Manipur, Mario Rosario Guarracino, Maria Brigida Ferraro

引用次数: 0

Enhancing cluster analysis via topological manifold learning 通过拓扑流形学习增强聚类分析

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-09-29 DOI: 10.1007/s10618-023-00980-2

Moritz Herrmann, Daniyal Kazempour, Fabian Scheipl, Peer Kröger

Abstract We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.

我们讨论了聚类分析的拓扑方面，并表明在聚类之前推断数据集的拓扑结构可以大大增强聚类检测:我们表明聚类嵌入向量表示数据集的固有结构而不是观察到的特征向量本身是非常有益的。为了证明这一点，我们将用于推断拓扑结构的流形学习方法UMAP与基于密度的聚类方法DBSCAN相结合。综合数据和实际数据结果表明，这种方法既简化了聚类，也改善了各种低维和高维问题的聚类，包括密度变化和/或纠缠形状的聚类。我们的方法简化了聚类，因为拓扑预处理始终降低了DBSCAN的参数敏感性。然后用DBSCAN对结果嵌入进行聚类，甚至可以胜过复杂的方法，如SPECTACL和ClusterGAN。最后，我们的研究表明，聚类的关键问题似乎不是数据的标称维度或它包含多少不相关的特征，而是聚类在它们嵌入的环境观测空间中如何可分离，这通常是由数据特征定义的(高维)欧几里德空间。该方法是成功的，因为它在将数据投影到更合适的空间后执行聚类分析，该空间在某种意义上针对可分离性进行了优化。

{"title":"Enhancing cluster analysis via topological manifold learning","authors":"Moritz Herrmann, Daniyal Kazempour, Fabian Scheipl, Peer Kröger","doi":"10.1007/s10618-023-00980-2","DOIUrl":"https://doi.org/10.1007/s10618-023-00980-2","url":null,"abstract":"Abstract We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: we show that clustering embedding vectors representing the inherent structure of a dataset instead of the observed feature vectors themselves is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how separable the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. The approach is successful because it performs the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135194568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Traffic forecasting on new roads using spatial contrastive pre-training (SCPT) 基于空间对比预训练(SCPT)的新道路交通预测

3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2023-09-27 DOI: 10.1007/s10618-023-00982-0

Arian Prabowo, Hao Xue, Wei Shao, Piotr Koniusz, Flora D. Salim

Abstract New roads are being constructed all the time. However, the capabilities of previous deep forecasting models to generalize to new roads not seen in the training data (unseen roads) are rarely explored. In this paper, we introduce a novel setup called a spatio-temporal split to evaluate the models’ capabilities to generalize to unseen roads. In this setup, the models are trained on data from a sample of roads, but tested on roads not seen in the training data. Moreover, we also present a novel framework called Spatial Contrastive Pre-Training (SCPT) where we introduce a spatial encoder module to extract latent features from unseen roads during inference time. This spatial encoder is pre-trained using contrastive learning. During inference, the spatial encoder only requires two days of traffic data on the new roads and does not require any re-training. We also show that the output from the spatial encoder can be used effectively to infer latent node embeddings on unseen roads during inference time. The SCPT framework also incorporates a new layer, named the spatially gated addition layer, to effectively combine the latent features from the output of the spatial encoder to existing backbones. Additionally, since there is limited data on the unseen roads, we argue that it is better to decouple traffic signals to trivial-to-capture periodic signals and difficult-to-capture Markovian signals, and for the spatial encoder to only learn the Markovian signals. Finally, we empirically evaluated SCPT using the ST split setup on four real-world datasets. The results showed that adding SCPT to a backbone consistently improves forecasting performance on unseen roads. More importantly, the improvements are greater when forecasting further into the future. The codes are available on GitHub: https://github.com/cruiseresearchgroup/forecasting-on-new-roads .

新的道路一直在修建。然而，以前的深度预测模型推广到训练数据中未见的新道路(看不见的道路)的能力很少被探索。在本文中，我们引入了一种称为时空分裂的新设置来评估模型推广到未知道路的能力。在这种设置中，模型在来自道路样本的数据上进行训练，但在训练数据中没有看到的道路上进行测试。此外，我们还提出了一种新的框架，称为空间对比预训练(SCPT)，其中我们引入了一个空间编码器模块，在推理时间内从未见的道路中提取潜在特征。这个空间编码器是使用对比学习预训练的。在推理过程中，空间编码器只需要新道路上两天的交通数据，不需要任何重新训练。我们还表明，在推理时间内，空间编码器的输出可以有效地用于推断未知道路上的潜在节点嵌入。SCPT框架还加入了一个新的层，称为空间门控附加层，以有效地将空间编码器输出的潜在特征与现有主干相结合。此外，由于看不见的道路上的数据有限，我们认为最好将交通信号解耦为易于捕获的周期信号和难以捕获的马尔可夫信号，并且空间编码器只学习马尔可夫信号。最后，我们在四个真实数据集上使用ST分割设置对SCPT进行了实证评估。结果表明，将SCPT添加到主干中可以持续提高对未知道路的预测性能。更重要的是，在进一步预测未来时，这种改进会更大。代码可在GitHub上获得:https://github.com/cruiseresearchgroup/forecasting-on-new-roads。

{"title":"Traffic forecasting on new roads using spatial contrastive pre-training (SCPT)","authors":"Arian Prabowo, Hao Xue, Wei Shao, Piotr Koniusz, Flora D. Salim","doi":"10.1007/s10618-023-00982-0","DOIUrl":"https://doi.org/10.1007/s10618-023-00982-0","url":null,"abstract":"Abstract New roads are being constructed all the time. However, the capabilities of previous deep forecasting models to generalize to new roads not seen in the training data (unseen roads) are rarely explored. In this paper, we introduce a novel setup called a spatio-temporal split to evaluate the models’ capabilities to generalize to unseen roads. In this setup, the models are trained on data from a sample of roads, but tested on roads not seen in the training data. Moreover, we also present a novel framework called Spatial Contrastive Pre-Training (SCPT) where we introduce a spatial encoder module to extract latent features from unseen roads during inference time. This spatial encoder is pre-trained using contrastive learning. During inference, the spatial encoder only requires two days of traffic data on the new roads and does not require any re-training. We also show that the output from the spatial encoder can be used effectively to infer latent node embeddings on unseen roads during inference time. The SCPT framework also incorporates a new layer, named the spatially gated addition layer, to effectively combine the latent features from the output of the spatial encoder to existing backbones. Additionally, since there is limited data on the unseen roads, we argue that it is better to decouple traffic signals to trivial-to-capture periodic signals and difficult-to-capture Markovian signals, and for the spatial encoder to only learn the Markovian signals. Finally, we empirically evaluated SCPT using the ST split setup on four real-world datasets. The results showed that adding SCPT to a backbone consistently improves forecasting performance on unseen roads. More importantly, the improvements are greater when forecasting further into the future. The codes are available on GitHub: https://github.com/cruiseresearchgroup/forecasting-on-new-roads .","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135535882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Data Mining and Knowledge Discovery

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀