ACM Transactions on Knowledge Discovery from Data (TKDD)最新文献

Machine Learning-based Short-term Rainfall Prediction from Sky Data 基于机器学习的天空数据短期降雨预测

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-07-29 DOI: 10.1145/3502731

Fu Jie Tey, Tin-Yu Wu, Jiann-Liang Chen

To predict rainfall, our proposed model architecture combines the Convolutional Neural Network (CNN), which uses the ResNet-152 pre-training model, with the Recurrent Neural Network (RNN), which uses the Long Short-term Memory Network (LSTM) layer, for model training. By encoding the cloud images through CNN, we extract the image feature vectors in the training process and train the vectors and meteorological data as the input of RNN. After training, the accuracy of the prediction model can reach up to 82%. The result has proven not only the outperformance of our proposed rainfall prediction method in terms of cost and prediction time, but also its accuracy and feasibility compared with general prediction methods.

为了预测降雨，我们提出的模型架构结合了使用ResNet-152预训练模型的卷积神经网络(CNN)和使用长短期记忆网络(LSTM)层的循环神经网络(RNN)进行模型训练。我们通过CNN对云图像进行编码，在训练过程中提取图像特征向量，并训练这些向量和气象数据作为RNN的输入。经过训练，预测模型的准确率可达82%。结果表明，本文提出的降雨预测方法不仅在成本和预测时间方面具有优势，而且与一般预测方法相比，其准确性和可行性也有所提高。

引用次数: 1

Incremental Feature Spaces Learning with Label Scarcity 标签稀缺性下的增量特征空间学习

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-06-27 DOI: 10.1145/3516368

Shilin Gu, Yuhua Qian, Chenping Hou

Recently, learning and mining from data streams with incremental feature spaces have attracted extensive attention, where data may dynamically expand over time in both volume and feature dimensions. Existing approaches usually assume that the incoming instances can always receive true labels. However, in many real-world applications, e.g., environment monitoring, acquiring the true labels is costly due to the need of human effort in annotating the data. To tackle this problem, we propose a novel incremental Feature spaces Learning with Label Scarcity (FLLS) algorithm, together with its two variants. When data streams arrive with augmented features, we first leverage the margin-based online active learning to select valuable instances to be labeled and thus build superior predictive models with minimal supervision. After receiving the labels, we combine the online passive-aggressive update rule and margin-maximum principle to jointly update the dynamic classifier in the shared and augmented feature space. Finally, we use the projected truncation technique to build a sparse but efficient model. We theoretically analyze the error bounds of FLLS and its two variants. Also, we conduct experiments on synthetic data and real-world applications to further validate the effectiveness of our proposed algorithms.

近年来，从具有增量特征空间的数据流中学习和挖掘引起了广泛的关注，其中数据可能随着时间的推移在体积和特征维度上动态扩展。现有的方法通常假设传入的实例总是能够接收到真实的标签。然而，在许多现实世界的应用程序中，例如，环境监测，由于需要人工注释数据，获取真正的标签是昂贵的。为了解决这个问题，我们提出了一种新的基于标签稀缺性的增量特征空间学习算法(FLLS)及其两个变体。当带有增强特征的数据流到达时，我们首先利用基于边缘的在线主动学习来选择有价值的实例进行标记，从而在最小的监督下构建卓越的预测模型。在接收到标签后，我们结合在线被动攻击更新规则和边际最大化原则，在共享和增强的特征空间中共同更新动态分类器。最后，我们利用投影截断技术建立了一个稀疏但高效的模型。从理论上分析了FLLS及其两种变体的误差范围。此外，我们还对合成数据和实际应用进行了实验，以进一步验证我们提出的算法的有效性。

{"title":"Incremental Feature Spaces Learning with Label Scarcity","authors":"Shilin Gu, Yuhua Qian, Chenping Hou","doi":"10.1145/3516368","DOIUrl":"https://doi.org/10.1145/3516368","url":null,"abstract":"Recently, learning and mining from data streams with incremental feature spaces have attracted extensive attention, where data may dynamically expand over time in both volume and feature dimensions. Existing approaches usually assume that the incoming instances can always receive true labels. However, in many real-world applications, e.g., environment monitoring, acquiring the true labels is costly due to the need of human effort in annotating the data. To tackle this problem, we propose a novel incremental Feature spaces Learning with Label Scarcity (FLLS) algorithm, together with its two variants. When data streams arrive with augmented features, we first leverage the margin-based online active learning to select valuable instances to be labeled and thus build superior predictive models with minimal supervision. After receiving the labels, we combine the online passive-aggressive update rule and margin-maximum principle to jointly update the dynamic classifier in the shared and augmented feature space. Finally, we use the projected truncation technique to build a sparse but efficient model. We theoretically analyze the error bounds of FLLS and its two variants. Also, we conduct experiments on synthetic data and real-world applications to further validate the effectiveness of our proposed algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124814616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Multi-objective Learning to Overcome Catastrophic Forgetting in Time-series Applications 多目标学习克服时间序列应用中的灾难性遗忘

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-06-17 DOI: 10.1145/3502728

Reem A. Mahmoud, Hazem M. Hajj

One key objective of artificial intelligence involves the continuous adaptation of machine learning models to new tasks. This branch of continual learning is also referred to as lifelong learning (LL), where a major challenge is to minimize catastrophic forgetting, or forgetting previously learned tasks. While previous work on catastrophic forgetting has been focused on vision problems; this work targets time-series data. In addition to choosing an architecture appropriate for time-series sequences, our work addresses limitations in previous work, including the handling of distribution shifts in class labels. We present multi-objective learning with three loss functions to minimize catastrophic forgetting, prediction error, and errors in generalizing across label shifts, simultaneously. We build a multi-task autoencoder network with a hierarchical convolutional recurrent architecture. The proposed method is capable of learning multiple time-series tasks simultaneously. For cases where the model needs to learn multiple new tasks, we propose sequential learning, starting with tasks that have the best individual performances. This solution was evaluated on four benchmark human activity recognition datasets collected from mobile sensing devices. A wide set of baseline comparisons is performed, and an ablation analysis is run to evaluate the impact of the different losses in the proposed multi-objective method. The results demonstrate an up to 4% performance improvement in catastrophic forgetting compared to the use of loss functions in state-of-the-art solutions while demonstrating minimal losses compared to upper bound methods of traditional fine-tuning (FT) and multi-task learning (MTL).

人工智能的一个关键目标涉及机器学习模型对新任务的持续适应。这种持续学习的分支也被称为终身学习(LL)，其中一个主要的挑战是尽量减少灾难性的遗忘，或忘记以前学过的任务。之前关于灾难性遗忘的研究主要集中在视觉问题上;这项工作针对的是时间序列数据。除了选择适合时间序列的体系结构之外，我们的工作还解决了以前工作中的限制，包括处理类标签中的分布变化。我们提出了具有三个损失函数的多目标学习，以最大限度地减少灾难性遗忘、预测误差和跨标签移位的泛化误差。我们构建了一个具有分层卷积循环结构的多任务自编码器网络。该方法能够同时学习多个时间序列任务。对于模型需要学习多个新任务的情况，我们建议顺序学习，从具有最佳单个性能的任务开始。该解决方案在从移动传感设备收集的四个基准人类活动识别数据集上进行了评估。进行了广泛的基线比较，并进行了消融分析，以评估所提出的多目标方法中不同损失的影响。结果表明，与在最先进的解决方案中使用损失函数相比，灾难性遗忘的性能提高了4%，而与传统微调(FT)和多任务学习(MTL)的上界方法相比，损失最小。

{"title":"Multi-objective Learning to Overcome Catastrophic Forgetting in Time-series Applications","authors":"Reem A. Mahmoud, Hazem M. Hajj","doi":"10.1145/3502728","DOIUrl":"https://doi.org/10.1145/3502728","url":null,"abstract":"One key objective of artificial intelligence involves the continuous adaptation of machine learning models to new tasks. This branch of continual learning is also referred to as lifelong learning (LL), where a major challenge is to minimize catastrophic forgetting, or forgetting previously learned tasks. While previous work on catastrophic forgetting has been focused on vision problems; this work targets time-series data. In addition to choosing an architecture appropriate for time-series sequences, our work addresses limitations in previous work, including the handling of distribution shifts in class labels. We present multi-objective learning with three loss functions to minimize catastrophic forgetting, prediction error, and errors in generalizing across label shifts, simultaneously. We build a multi-task autoencoder network with a hierarchical convolutional recurrent architecture. The proposed method is capable of learning multiple time-series tasks simultaneously. For cases where the model needs to learn multiple new tasks, we propose sequential learning, starting with tasks that have the best individual performances. This solution was evaluated on four benchmark human activity recognition datasets collected from mobile sensing devices. A wide set of baseline comparisons is performed, and an ablation analysis is run to evaluate the impact of the different losses in the proposed multi-objective method. The results demonstrate an up to 4% performance improvement in catastrophic forgetting compared to the use of loss functions in state-of-the-art solutions while demonstrating minimal losses compared to upper bound methods of traditional fine-tuning (FT) and multi-task learning (MTL).","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124918689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Combining Filtering and Cross-Correlation Efficiently for Streaming Time Series 流时间序列滤波和互相关的有效结合

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-05-24 DOI: 10.1145/3502738

Sheng Zhong, Vinicius M. A. Souza, A. Mueen

Monitoring systems have hundreds or thousands of distributed sensors gathering and transmitting real-time streaming data. The early detection of events in these systems, such as an earthquake in a seismic monitoring system, is the base for essential tasks as warning generations. To detect such events is usual to compute pairwise correlation across the disparate signals generated by the sensors. Since the data sources (e.g., sensors) are spatially separated, it is essential to consider the lagged correlation between the signals. Besides, many applications require to process a specific band of frequencies depending on the event’s type, demanding a pre-processing step of filtering before computing correlations. Due to the high speed of data generation and a large number of sensors in these systems, the operations of filtering and lagged cross-correlation need to be efficient to provide real-time responses without data losses. This article proposes a technique named FilCorr that efficiently computes both operations in one single step. We achieve an order of magnitude speedup by maintaining frequency transforms over sliding windows. Our method is exact, devoid of sensitive parameters, and easily parallelizable. Besides our algorithm, we also provide a publicly available real-time system named Seisviz that employs FilCorr in its core mechanism for monitoring a seismometer network. We demonstrate that our technique is suitable for several monitoring applications as seismic signal monitoring, motion monitoring, and neural activity monitoring.

监控系统有数百或数千个分布式传感器收集和传输实时流数据。在这些系统中对事件的早期发现，例如地震监测系统中的地震，是作为预警代的基本任务的基础。为了检测这些事件，通常需要计算传感器产生的不同信号之间的成对相关性。由于数据源(如传感器)在空间上是分离的，因此必须考虑信号之间的滞后相关性。此外，许多应用程序需要根据事件类型处理特定频带，这就要求在计算相关性之前进行滤波预处理步骤。由于这些系统中数据生成速度快，传感器数量多，滤波和滞后互相关的操作需要高效，才能在不丢失数据的情况下提供实时响应。本文提出了一种名为FilCorr的技术，它可以在一个步骤中有效地计算这两个操作。通过保持滑动窗口上的频率变换，我们实现了一个数量级的加速。该方法具有精度高、无敏感参数、易于并行化等特点。除了我们的算法，我们还提供了一个公开可用的实时系统，名为Seisviz，该系统在其核心机制中使用FilCorr来监测地震仪网络。我们证明了我们的技术适用于地震信号监测、运动监测和神经活动监测等多种监测应用。

{"title":"Combining Filtering and Cross-Correlation Efficiently for Streaming Time Series","authors":"Sheng Zhong, Vinicius M. A. Souza, A. Mueen","doi":"10.1145/3502738","DOIUrl":"https://doi.org/10.1145/3502738","url":null,"abstract":"Monitoring systems have hundreds or thousands of distributed sensors gathering and transmitting real-time streaming data. The early detection of events in these systems, such as an earthquake in a seismic monitoring system, is the base for essential tasks as warning generations. To detect such events is usual to compute pairwise correlation across the disparate signals generated by the sensors. Since the data sources (e.g., sensors) are spatially separated, it is essential to consider the lagged correlation between the signals. Besides, many applications require to process a specific band of frequencies depending on the event’s type, demanding a pre-processing step of filtering before computing correlations. Due to the high speed of data generation and a large number of sensors in these systems, the operations of filtering and lagged cross-correlation need to be efficient to provide real-time responses without data losses. This article proposes a technique named FilCorr that efficiently computes both operations in one single step. We achieve an order of magnitude speedup by maintaining frequency transforms over sliding windows. Our method is exact, devoid of sensitive parameters, and easily parallelizable. Besides our algorithm, we also provide a publicly available real-time system named Seisviz that employs FilCorr in its core mechanism for monitoring a seismometer network. We demonstrate that our technique is suitable for several monitoring applications as seismic signal monitoring, motion monitoring, and neural activity monitoring.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130132557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Segment-Wise Time-Varying Dynamic Bayesian Network with Graph Regularization 基于图正则化的分段时变动态贝叶斯网络

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-05-04 DOI: 10.1145/3522589

Xingxuan Yang, Chen Zhang, Baihua Zheng

Time-varying dynamic Bayesian network (TVDBN) is essential for describing time-evolving directed conditional dependence structures in complex multivariate systems. In this article, we construct a TVDBN model, together with a score-based method for its structure learning. The model adopts a vector autoregressive (VAR) model to describe inter-slice and intra-slice relations between variables. By allowing VAR parameters to change segment-wisely over time, the time-varying dynamics of the network structure can be described. Furthermore, considering some external information can provide additional similarity information of variables. Graph Laplacian is further imposed to regularize similar nodes to have similar network structures. The regularized maximum a posterior estimation in the Bayesian inference framework is used as a score function for TVDBN structure evaluation, and the alternating direction method of multipliers (ADMM) with L-BFGS-B algorithm is used for optimal structure learning. Thorough simulation studies and a real case study are carried out to verify our proposed method’s efficacy and efficiency.

时变动态贝叶斯网络(TVDBN)是描述复杂多变量系统中随时间变化的有向条件依赖结构的必要方法。在本文中，我们构建了一个TVDBN模型，并使用基于分数的方法对其结构进行学习。该模型采用向量自回归(VAR)模型来描述变量之间的片间和片内关系。通过允许VAR参数随时间明智地改变分段，可以描述网络结构的时变动态。此外，考虑一些外部信息可以提供额外的变量相似度信息。进一步应用图拉普拉斯对相似节点进行正则化，使其具有相似的网络结构。采用贝叶斯推理框架中的正则化后验最大a估计作为TVDBN结构评价的评分函数，采用乘法器交替方向法(ADMM)结合L-BFGS-B算法进行最优结构学习。通过深入的仿真研究和实际案例研究，验证了该方法的有效性和效率。

引用次数: 2

HW-Forest: Deep Forest with Hashing Screening and Window Screening HW-Forest:带散列筛选和窗口筛选的深森林

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-05-04 DOI: 10.1145/3532193

Pengfei Ma, Youxi Wu, Y. Li, Lei Guo, He Jiang, Xingquan Zhu, X. Wu

As a novel deep learning model, gcForest has been widely used in various applications. However, current multi-grained scanning of gcForest produces many redundant feature vectors, and this increases the time cost of the model. To screen out redundant feature vectors, we introduce a hashing screening mechanism for multi-grained scanning and propose a model called HW-Forest which adopts two strategies: hashing screening and window screening. HW-Forest employs perceptual hashing algorithm to calculate the similarity between feature vectors in hashing screening strategy, which is used to remove the redundant feature vectors produced by multi-grained scanning and can significantly decrease the time cost and memory consumption. Furthermore, we adopt a self-adaptive instance screening strategy called window screening to improve the performance of our approach, which can achieve higher accuracy without hyperparameter tuning on different datasets. Our experimental results show that HW-Forest has higher accuracy than other models, and the time cost is also reduced.

作为一种新型的深度学习模型，gcForest在各种应用中得到了广泛的应用。然而，目前gcForest的多粒度扫描产生了许多冗余的特征向量，这增加了模型的时间成本。为了筛除冗余的特征向量，我们引入了一种多粒度扫描的哈希筛选机制，并提出了HW-Forest模型，该模型采用哈希筛选和窗口筛选两种策略。HW-Forest在哈希筛选策略中采用感知哈希算法计算特征向量之间的相似度，用于去除多粒度扫描产生的冗余特征向量，可以显著降低时间成本和内存消耗。此外，我们采用了一种称为窗口筛选的自适应实例筛选策略来提高我们的方法的性能，该方法可以在不需要对不同数据集进行超参数调优的情况下获得更高的精度。实验结果表明，与其他模型相比，HW-Forest具有更高的准确率，并且减少了时间成本。

引用次数: 8

A News-Based Framework for Uncovering and Tracking City Area Profiles: Assessment in Covid-19 Setting 基于新闻的发现和跟踪城市地区概况框架:Covid-19背景下的评估

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-25 DOI: 10.1145/3532186

A. Bechini, Alessandro Bondielli, José Luis Corcuera Bárcena, P. Ducange, F. Marcelloni, Alessandro Renda

In the last years, there has been an ever-increasing interest in profiling various aspects of city life, especially in the context of smart cities. This interest has become even more relevant recently when we have realized how dramatic events, such as the Covid-19 pandemic, can deeply affect the city life, producing drastic changes. Identifying and analyzing such changes, both at the city level and within single neighborhoods, may be a fundamental tool to better manage the current situation and provide sound strategies for future planning. Furthermore, such fine-grained and up-to-date characterization can represent a valuable asset for other tools and services, e.g., web mapping applications or real estate agency platforms. In this article, we propose a framework featuring a novel methodology to model and track changes in areas of the city by extracting information from online newspaper articles. The problem of uncovering clusters of news at specific times is tackled by means of the joint use of state-of-the-art language models to represent the articles, and of a density-based streaming clustering algorithm, properly shaped to deal with high-dimensional text embeddings. Furthermore, we propose a method to automatically label the obtained clusters in a semantically meaningful way, and we introduce a set of metrics aimed at tracking the temporal evolution of clusters. A case study focusing on the city of Rome during the Covid-19 pandemic is illustrated and discussed to evaluate the effectiveness of the proposed approach.

在过去的几年里，人们对城市生活的各个方面，特别是在智慧城市的背景下，越来越感兴趣。最近，当我们意识到Covid-19大流行等重大事件如何深刻影响城市生活并产生巨大变化时，这种兴趣变得更加重要。在城市一级和单个社区内识别和分析这些变化，可能是更好地管理现状和为未来规划提供合理战略的基本工具。此外，这种细粒度和最新的特征可以代表其他工具和服务的宝贵资产，例如，web地图应用程序或房地产代理平台。在本文中，我们提出了一个框架，该框架采用一种新颖的方法，通过从在线报纸文章中提取信息来建模和跟踪城市区域的变化。在特定时间发现新闻集群的问题是通过联合使用最先进的语言模型来表示文章，以及基于密度的流聚类算法来解决的，该算法适当地形成以处理高维文本嵌入。此外，我们提出了一种以语义有意义的方式自动标记获得的聚类的方法，并引入了一组旨在跟踪聚类时间演变的度量。本文以2019冠状病毒病大流行期间罗马市为例进行了案例研究，并对其进行了说明和讨论，以评估所提出方法的有效性。

{"title":"A News-Based Framework for Uncovering and Tracking City Area Profiles: Assessment in Covid-19 Setting","authors":"A. Bechini, Alessandro Bondielli, José Luis Corcuera Bárcena, P. Ducange, F. Marcelloni, Alessandro Renda","doi":"10.1145/3532186","DOIUrl":"https://doi.org/10.1145/3532186","url":null,"abstract":"In the last years, there has been an ever-increasing interest in profiling various aspects of city life, especially in the context of smart cities. This interest has become even more relevant recently when we have realized how dramatic events, such as the Covid-19 pandemic, can deeply affect the city life, producing drastic changes. Identifying and analyzing such changes, both at the city level and within single neighborhoods, may be a fundamental tool to better manage the current situation and provide sound strategies for future planning. Furthermore, such fine-grained and up-to-date characterization can represent a valuable asset for other tools and services, e.g., web mapping applications or real estate agency platforms. In this article, we propose a framework featuring a novel methodology to model and track changes in areas of the city by extracting information from online newspaper articles. The problem of uncovering clusters of news at specific times is tackled by means of the joint use of state-of-the-art language models to represent the articles, and of a density-based streaming clustering algorithm, properly shaped to deal with high-dimensional text embeddings. Furthermore, we propose a method to automatically label the obtained clusters in a semantically meaningful way, and we introduce a set of metrics aimed at tracking the temporal evolution of clusters. A case study focusing on the city of Rome during the Covid-19 pandemic is illustrated and discussed to evaluate the effectiveness of the proposed approach.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining Poset族的蒙特卡洛Rademacher平均和近似模式挖掘

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-25 DOI: 10.1145/3532187

Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, Matteo Riondato

“I’m an MC still as honest” – Eminem, Rap God We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

我们提出了MCRapper，一种有效计算蒙特卡罗经验Rademacher平均(MCERA)的算法，用于显示偏序集(例如，晶格)结构的函数族，例如在许多模式挖掘任务中出现的函数族。MCERA允许我们计算样本均值与其期望的最大偏差的上界，因此它可以用来找到(1)当可用数据被视为来自未知分布的样本时，具有统计显著性的函数(即模式)，以及(2)当可用数据是来自大数据集的小样本时，高期望函数集合的近似值(例如，频繁模式)。与以前提出的解决方案相比，MCRapper提供的这种灵活性是一个很大的优势，以前的解决方案只能实现两者中的一个。MCRapper利用函数差异的上界来有效地探索和修剪搜索空间，这是一种借鉴于模式挖掘本身的技术。为了展示MCRapper的实际用途，我们利用它开发了一种算法TFP- r，用于真频繁模式(TFP)挖掘任务，通过适当计算感兴趣模式集合的负边界和正边界的近似，从而允许对模式空间进行有效修剪和计算最大偏差的强界。TFP-R可以保证包含任何误报的概率(精度)，并且比提供相同保证的现有方法显示出更高的统计能力(召回率)。我们评估了MCRapper和TFP-R，并表明它们在各自的任务中表现优于最先进的技术。

{"title":"MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining","authors":"Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, Matteo Riondato","doi":"10.1145/3532187","DOIUrl":"https://doi.org/10.1145/3532187","url":null,"abstract":"“I’m an MC still as honest” – Eminem, Rap God We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116047692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Synthesis of Longitudinal Human Location Sequences: Balancing Utility and Privacy 纵向人类定位序列的综合:平衡效用与隐私

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-24 DOI: 10.1145/3529260

Maya Benarous, Eran Toch, I. Ben-Gal

People’s location data are continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people’s privacy and reveal confidential information. Synthetic data have been used to generate representative location sequences yet to maintain the users’ privacy. Nonetheless, the privacy-accuracy tradeoff between these two measures has not been addressed systematically. In this article, we analyze the use of different synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains (MC), and variable-order Markov models (VMMs). We employ different performance measures, such as data similarity and privacy, and discuss the inherent tradeoff. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work offers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their effectiveness regarding those accuracy and privacy measures.

人们的位置数据被各种设备和传感器持续跟踪，从而能够对可能侵犯人们隐私和泄露机密信息的敏感信息进行持续分析。合成数据已用于生成具有代表性的位置序列，但仍需维护用户的隐私。尽管如此，这两种措施之间的隐私准确性权衡还没有得到系统的解决。在本文中，我们分析了长位置序列的不同合成数据生成模型的使用，包括扩展短期记忆网络(LSTMs)、马尔可夫链(MC)和变阶马尔可夫模型(vmm)。我们采用了不同的性能度量，例如数据相似性和隐私性，并讨论了固有的权衡。此外，我们引入其他度量来量化这些度量。基于30万手机用户的匿名数据，我们的工作为制定综合数据生成过程的政策提供了路线图。我们提出了一个框架，用于构建数据生成模型并评估其在准确性和隐私措施方面的有效性。

{"title":"Synthesis of Longitudinal Human Location Sequences: Balancing Utility and Privacy","authors":"Maya Benarous, Eran Toch, I. Ben-Gal","doi":"10.1145/3529260","DOIUrl":"https://doi.org/10.1145/3529260","url":null,"abstract":"People’s location data are continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people’s privacy and reveal confidential information. Synthetic data have been used to generate representative location sequences yet to maintain the users’ privacy. Nonetheless, the privacy-accuracy tradeoff between these two measures has not been addressed systematically. In this article, we analyze the use of different synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains (MC), and variable-order Markov models (VMMs). We employ different performance measures, such as data similarity and privacy, and discuss the inherent tradeoff. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work offers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their effectiveness regarding those accuracy and privacy measures.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130036552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Profile Decomposition Based Hybrid Transfer Learning for Cold-Start Data Anomaly Detection 基于剖面分解的混合迁移学习冷启动数据异常检测

ACM Transactions on Knowledge Discovery from Data (TKDD)

Pub Date : 2022-04-24 DOI: 10.1145/3530990

Ziyue Li, Haodong Yan, F. Tsung, Ke Zhang

Anomaly detection is an essential task for quality management in smart manufacturing. An accurate data-driven detection method usually needs enough data and labels. However, in practice, there commonly exist newly set-up processes in manufacturing, and they only have quite limited data available for analysis. Borrowing the name from the recommender system, we call this process a cold-start process. The sparsity of anomaly, the deviation of the profile, and noise aggravate the detection difficulty. Transfer learning could help to detect anomalies for cold-start processes by transferring the knowledge from more experienced processes to the new processes. However, the existing transfer learning and multi-task learning frameworks are established on task- or domain-level relatedness. We observe instead, within a domain, some components (background and anomaly) share more commonality, others (profile deviation and noise) not. To this end, we propose a more delicate component-level transfer learning scheme, i.e., decomposition-based hybrid transfer learning (DHTL): It first decomposes a domain (e.g., a data source containing profiles) into different components (smooth background, profile deviation, anomaly, and noise); then, each component’s transferability is analyzed by expert knowledge; Lastly, different transfer learning techniques could be tailored accordingly. We adopted the Bayesian probabilistic hierarchical model to formulate parameter transfer for the background, and “L2,1+L1”-norm to formulate low dimension feature-representation transfer for the anomaly. An efficient algorithm based on Block Coordinate Descend is proposed to learn the parameters. A case study based on glass coating pressure profiles demonstrates the improved accuracy and completeness of detected anomaly, and a simulation demonstrates the fidelity of the decomposition results.

异常检测是智能制造质量管理的一项重要任务。一个准确的数据驱动检测方法通常需要足够的数据和标签。然而，在实践中，在制造中通常存在新建立的过程，它们只有相当有限的数据可用于分析。借用推荐系统的名字，我们称这个过程为冷启动过程。异常的稀疏性、剖面的偏差和噪声加剧了检测的难度。迁移学习可以通过将知识从更有经验的过程转移到新的过程中来帮助检测冷启动过程的异常。然而，现有的迁移学习和多任务学习框架都是建立在任务级或领域级关联上的。相反，我们观察到，在一个域内，一些组件(背景和异常)具有更多的共性，而其他组件(剖面偏差和噪声)则没有。为此，我们提出了一种更精细的组件级迁移学习方案，即基于分解的混合迁移学习(DHTL):它首先将一个域(例如，包含轮廓的数据源)分解为不同的组件(平滑背景、轮廓偏差、异常和噪声);然后，利用专家知识分析各组成部分的可转移性;最后，不同的迁移学习技术可以相应地调整。采用贝叶斯概率层次模型对背景进行参数传递，采用“L2,1+L1”范数对异常进行低维特征表征传递。提出了一种基于块坐标下降的参数学习算法。基于玻璃镀膜压力剖面的实例研究表明，该方法提高了异常检测的准确性和完整性，并通过仿真验证了分解结果的保真度。

{"title":"Profile Decomposition Based Hybrid Transfer Learning for Cold-Start Data Anomaly Detection","authors":"Ziyue Li, Haodong Yan, F. Tsung, Ke Zhang","doi":"10.1145/3530990","DOIUrl":"https://doi.org/10.1145/3530990","url":null,"abstract":"Anomaly detection is an essential task for quality management in smart manufacturing. An accurate data-driven detection method usually needs enough data and labels. However, in practice, there commonly exist newly set-up processes in manufacturing, and they only have quite limited data available for analysis. Borrowing the name from the recommender system, we call this process a cold-start process. The sparsity of anomaly, the deviation of the profile, and noise aggravate the detection difficulty. Transfer learning could help to detect anomalies for cold-start processes by transferring the knowledge from more experienced processes to the new processes. However, the existing transfer learning and multi-task learning frameworks are established on task- or domain-level relatedness. We observe instead, within a domain, some components (background and anomaly) share more commonality, others (profile deviation and noise) not. To this end, we propose a more delicate component-level transfer learning scheme, i.e., decomposition-based hybrid transfer learning (DHTL): It first decomposes a domain (e.g., a data source containing profiles) into different components (smooth background, profile deviation, anomaly, and noise); then, each component’s transferability is analyzed by expert knowledge; Lastly, different transfer learning techniques could be tailored accordingly. We adopted the Bayesian probabilistic hierarchical model to formulate parameter transfer for the background, and “L2,1+L1”-norm to formulate low dimension feature-representation transfer for the anomaly. An efficient algorithm based on Block Coordinate Descend is proposed to learn the parameters. A case study based on glass coating pressure profiles demonstrates the improved accuracy and completeness of detected anomaly, and a simulation demonstrates the fidelity of the decomposition results.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126670152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10