E-commerce platforms heavily rely on automatic personalized recommender systems, e.g., collaborative filtering models, to improve customer experience. Some hybrid models have been proposed recently to address the deficiency of existing models. However, their performances drop significantly when the dataset is sparse. Most of the recent works failed to fully address this shortcoming. At most, some of them only tried to alleviate the problem by considering either user side or item side content information. In this article, we propose a novel recommender model called Hybrid Variational Autoencoder (HVAE) to improve the performance on sparse datasets. Different from the existing approaches, we encode both user and item information into a latent space for semantic relevance measurement. In parallel, we utilize collaborative filtering to find the implicit factors of users and items, and combine their outputs to deliver a hybrid solution. In addition, we compare the performance of Gaussian distribution and multinomial distribution in learning the representations of the textual data. Our experiment results show that HVAE is able to significantly outperform state-of-the-art models with robust performance.
{"title":"Hybrid Variational Autoencoder for Recommender Systems","authors":"Hangbin Zhang, R. Wong, Victor W. Chu","doi":"10.1145/3470659","DOIUrl":"https://doi.org/10.1145/3470659","url":null,"abstract":"E-commerce platforms heavily rely on automatic personalized recommender systems, e.g., collaborative filtering models, to improve customer experience. Some hybrid models have been proposed recently to address the deficiency of existing models. However, their performances drop significantly when the dataset is sparse. Most of the recent works failed to fully address this shortcoming. At most, some of them only tried to alleviate the problem by considering either user side or item side content information. In this article, we propose a novel recommender model called Hybrid Variational Autoencoder (HVAE) to improve the performance on sparse datasets. Different from the existing approaches, we encode both user and item information into a latent space for semantic relevance measurement. In parallel, we utilize collaborative filtering to find the implicit factors of users and items, and combine their outputs to deliver a hybrid solution. In addition, we compare the performance of Gaussian distribution and multinomial distribution in learning the representations of the textual data. Our experiment results show that HVAE is able to significantly outperform state-of-the-art models with robust performance.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123932952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pervasiveness of smartphones has shaped our lives, social norms, and the structure that dictates human behavior. They now directly influence how individuals demand resources or interact with network services. From this scenario, identifying key locations in cities is fundamental for the investigation of human mobility and also for the understanding of social problems. In this context, we propose the first graph-based methodology in the literature to quantify the power of Point-of-Interests (POIs) over its vicinity by means of user mobility trajectories. Different from literature, we consider the flow of people in our analysis, instead of the number of neighbor POIs or their structural locations in the city. Thus, we modeled POI’s visits using the multiflow graph model where each POI is a node and the transitions of users among POIs are a weighted direct edge. Using this multiflow graph model, we compute the attract, support, and independence powers. The attract power and support power measure how many visits a POI gathers from and disseminate over its neighborhood, respectively. Moreover, the independence power captures the capacity of a POI to receive visitors independently from other POIs. We tested our methodology on well-known university campus mobility datasets and validated on Location-Based Social Networks (LBSNs) datasets from various cities around the world. Our findings show that in university campus: (i) buildings have low support power and attract power; (ii) people tend to move over a few buildings and spend most of their time in the same building; and (iii) there is a slight dependence among buildings, even those with high independence power receive user visits from other buildings on campus. Globally, we reveal that (i) our metrics capture places that impact the number of visits in their neighborhood; (ii) cities in the same continent have similar independence patterns; and (iii) places with a high number of visitation and city central areas are the regions with the highest degree of independence.
{"title":"Assessing Large-Scale Power Relations among Locations from Mobility Data","authors":"L. S. Oliveira, P. V. D. Melo, A. C. Viana","doi":"10.1145/3470770","DOIUrl":"https://doi.org/10.1145/3470770","url":null,"abstract":"The pervasiveness of smartphones has shaped our lives, social norms, and the structure that dictates human behavior. They now directly influence how individuals demand resources or interact with network services. From this scenario, identifying key locations in cities is fundamental for the investigation of human mobility and also for the understanding of social problems. In this context, we propose the first graph-based methodology in the literature to quantify the power of Point-of-Interests (POIs) over its vicinity by means of user mobility trajectories. Different from literature, we consider the flow of people in our analysis, instead of the number of neighbor POIs or their structural locations in the city. Thus, we modeled POI’s visits using the multiflow graph model where each POI is a node and the transitions of users among POIs are a weighted direct edge. Using this multiflow graph model, we compute the attract, support, and independence powers. The attract power and support power measure how many visits a POI gathers from and disseminate over its neighborhood, respectively. Moreover, the independence power captures the capacity of a POI to receive visitors independently from other POIs. We tested our methodology on well-known university campus mobility datasets and validated on Location-Based Social Networks (LBSNs) datasets from various cities around the world. Our findings show that in university campus: (i) buildings have low support power and attract power; (ii) people tend to move over a few buildings and spend most of their time in the same building; and (iii) there is a slight dependence among buildings, even those with high independence power receive user visits from other buildings on campus. Globally, we reveal that (i) our metrics capture places that impact the number of visits in their neighborhood; (ii) cities in the same continent have similar independence patterns; and (iii) places with a high number of visitation and city central areas are the regions with the highest degree of independence.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122256362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Measuring the built and natural environment at a fine-grained scale is now possible with low-cost urban environmental sensor networks. However, fine-grained city-scale data analysis is complicated by tedious data cleaning including removing outliers and imputing missing data. While many methods exist to automatically correct anomalies and impute missing entries, challenges still exist on data with large spatial-temporal scales and shifting patterns. To address these challenges, we propose an online robust tensor recovery (OLRTR) method to preprocess streaming high-dimensional urban environmental datasets. A small-sized dictionary that captures the underlying patterns of the data is computed and constantly updated with new data. OLRTR enables online recovery for large-scale sensor networks that provide continuous data streams, with a lower computational memory usage compared to offline batch counterparts. In addition, we formulate the objective function so that OLRTR can detect structured outliers, such as faulty readings over a long period of time. We validate OLRTR on a synthetically degraded National Oceanic and Atmospheric Administration temperature dataset, and apply it to the Array of Things city-scale sensor network in Chicago, IL, showing superior results compared with several established online and batch-based low-rank decomposition methods.
通过低成本的城市环境传感器网络,现在可以在细粒度尺度上测量建筑和自然环境。然而,细粒度的城市规模数据分析由于繁琐的数据清理(包括去除异常值和输入缺失数据)而变得复杂。虽然已有许多方法可以自动纠正异常和输入缺失条目,但对于大时空尺度和变化模式的数据仍然存在挑战。为了解决这些挑战,我们提出了一种在线鲁棒张量恢复(OLRTR)方法来预处理高维城市环境数据集。计算一个捕获数据底层模式的小型字典,并用新数据不断更新它。OLRTR支持提供连续数据流的大型传感器网络的在线恢复,与离线批处理相比,其计算内存使用量更低。此外,我们制定了目标函数,使OLRTR可以检测结构化的异常值,例如长时间的错误读数。我们在综合退化的美国国家海洋和大气管理局温度数据集上验证了OLRTR,并将其应用于伊利诺伊州芝加哥的Array of Things城市规模传感器网络,与几种已建立的在线和基于批处理的低秩分解方法相比,显示出更好的结果。
{"title":"Streaming Data Preprocessing via Online Tensor Recovery for Large Environmental Sensor Networks","authors":"Yue Hu, Ao Qu, Yanbing Wang, D. Work","doi":"10.1145/3532189","DOIUrl":"https://doi.org/10.1145/3532189","url":null,"abstract":"Measuring the built and natural environment at a fine-grained scale is now possible with low-cost urban environmental sensor networks. However, fine-grained city-scale data analysis is complicated by tedious data cleaning including removing outliers and imputing missing data. While many methods exist to automatically correct anomalies and impute missing entries, challenges still exist on data with large spatial-temporal scales and shifting patterns. To address these challenges, we propose an online robust tensor recovery (OLRTR) method to preprocess streaming high-dimensional urban environmental datasets. A small-sized dictionary that captures the underlying patterns of the data is computed and constantly updated with new data. OLRTR enables online recovery for large-scale sensor networks that provide continuous data streams, with a lower computational memory usage compared to offline batch counterparts. In addition, we formulate the objective function so that OLRTR can detect structured outliers, such as faulty readings over a long period of time. We validate OLRTR on a synthetically degraded National Oceanic and Atmospheric Administration temperature dataset, and apply it to the Array of Things city-scale sensor network in Chicago, IL, showing superior results compared with several established online and batch-based low-rank decomposition methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116547537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a Self-supervised Transformer for Time-Series (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS.
{"title":"Self-Supervised Transformer for Sparse and Irregularly Sampled Multivariate Clinical Time-Series","authors":"Sindhu Tipirneni, C. Reddy","doi":"10.1145/3516367","DOIUrl":"https://doi.org/10.1145/3516367","url":null,"abstract":"Multivariate time-series data are frequently observed in critical care settings and are typically characterized by sparsity (missing information) and irregular time intervals. Existing approaches for learning representations in this domain handle these challenges by either aggregation or imputation of values, which in-turn suppresses the fine-grained information and adds undesirable noise/overhead into the machine learning model. To tackle this problem, we propose a Self-supervised Transformer for Time-Series (STraTS) model, which overcomes these pitfalls by treating time-series as a set of observation triplets instead of using the standard dense matrix representation. It employs a novel Continuous Value Embedding technique to encode continuous time and variable values without the need for discretization. It is composed of a Transformer component with multi-head attention layers, which enable it to learn contextual triplet embeddings while avoiding the problems of recurrence and vanishing gradients that occur in recurrent architectures. In addition, to tackle the problem of limited availability of labeled data (which is typically observed in many healthcare applications), STraTS utilizes self-supervision by leveraging unlabeled data to learn better representations by using time-series forecasting as an auxiliary proxy task. Experiments on real-world multivariate clinical time-series benchmark datasets demonstrate that STraTS has better prediction performance than state-of-the-art methods for mortality prediction, especially when labeled data is limited. Finally, we also present an interpretable version of STraTS, which can identify important measurements in the time-series data. Our data preprocessing and model implementation codes are available at https://github.com/sindhura97/STraTS.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127928011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In smartphone data analysis, both energy consumption modeling and user behavior mining have been explored extensively, but the relationship between energy consumption and user behavior has been rarely studied. Such a relationship is explored over large-scale users in this article. Based on energy consumption data, where each users’ feature vector is represented by energy breakdown on hardware components of different apps, User Behavior Models (UBM) are established to capture user behavior patterns (i.e., app preference, usage time). The challenge lies in the high diversity of user behaviors (i.e., massive apps and usage ways), which leads to high dimension and dispersion of data. To overcome the challenge, three mechanisms are designed. First, to reduce the dimension, apps are ranked with the top ones identified as typical apps to represent all. Second, the dispersion is reduced by scaling each users’ feature vector with typical apps to unit ℓ1 norm. The scaled vector becomes Usage Pattern, while the ℓ1 norm of vector before scaling is treated as Usage Intensity. Third, the usage pattern is analyzed with a two-layer clustering approach to further reduce data dispersion. In the upper layer, each typical app is studied across its users with respect to hardware components to identify Typical Hardware Usage Patterns (THUP). In the lower layer, users are studied with respect to these THUPs to identify Typical App Usage Patterns (TAUP). The analytical results of these two layers are consolidated into Usage Pattern Models (UPM), and UBMs are finally established by a union of UPMs and Usage Intensity Distributions (UID). By carrying out experiments on energy consumption data from 18,308 distinct users over 10 days, 33 UBMs are extracted from training data. With the test data, it is proven that these UBMs cover 94% user behaviors and achieve up to 20% improvement in accuracy of energy representation, as compared with the baseline method, PCA. Besides, potential applications and implications of these UBMs are illustrated for smartphone manufacturers, app developers, network providers, and so on.
{"title":"Establishing Smartphone User Behavior Model Based on Energy Consumption Data","authors":"M. Ding, Tianyu Wang, Xudong Wang","doi":"10.1145/3461459","DOIUrl":"https://doi.org/10.1145/3461459","url":null,"abstract":"In smartphone data analysis, both energy consumption modeling and user behavior mining have been explored extensively, but the relationship between energy consumption and user behavior has been rarely studied. Such a relationship is explored over large-scale users in this article. Based on energy consumption data, where each users’ feature vector is represented by energy breakdown on hardware components of different apps, User Behavior Models (UBM) are established to capture user behavior patterns (i.e., app preference, usage time). The challenge lies in the high diversity of user behaviors (i.e., massive apps and usage ways), which leads to high dimension and dispersion of data. To overcome the challenge, three mechanisms are designed. First, to reduce the dimension, apps are ranked with the top ones identified as typical apps to represent all. Second, the dispersion is reduced by scaling each users’ feature vector with typical apps to unit ℓ1 norm. The scaled vector becomes Usage Pattern, while the ℓ1 norm of vector before scaling is treated as Usage Intensity. Third, the usage pattern is analyzed with a two-layer clustering approach to further reduce data dispersion. In the upper layer, each typical app is studied across its users with respect to hardware components to identify Typical Hardware Usage Patterns (THUP). In the lower layer, users are studied with respect to these THUPs to identify Typical App Usage Patterns (TAUP). The analytical results of these two layers are consolidated into Usage Pattern Models (UPM), and UBMs are finally established by a union of UPMs and Usage Intensity Distributions (UID). By carrying out experiments on energy consumption data from 18,308 distinct users over 10 days, 33 UBMs are extracted from training data. With the test data, it is proven that these UBMs cover 94% user behaviors and achieve up to 20% improvement in accuracy of energy representation, as compared with the baseline method, PCA. Besides, potential applications and implications of these UBMs are illustrated for smartphone manufacturers, app developers, network providers, and so on.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122586298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rediet Abebe, T-H. Hubert Chan, J. Kleinberg, Zhibin Liang, D. Parkes, Mauro Sozio, Charalampos E. Tsourakakis
A long line of work in social psychology has studied variations in people’s susceptibility to persuasion—the extent to which they are willing to modify their opinions on a topic. This body of literature suggests an interesting perspective on theoretical models of opinion formation by interacting parties in a network: in addition to considering interventions that directly modify people’s intrinsic opinions, it is also natural to consider interventions that modify people’s susceptibility to persuasion. In this work, motivated by this fact, we propose an influence optimization problem. Specifically, we adopt a popular model for social opinion dynamics, where each agent has some fixed innate opinion, and a resistance that measures the importance it places on its innate opinion; agents influence one another’s opinions through an iterative process. Under certain conditions, this iterative process converges to some equilibrium opinion vector. For the unbudgeted variant of the problem, the goal is to modify the resistance of any number of agents (within some given range) such that the sum of the equilibrium opinions is minimized; for the budgeted variant, in addition the algorithm is given upfront a restriction on the number of agents whose resistance may be modified. We prove that the objective function is in general non-convex. Hence, formulating the problem as a convex program as in an early version of this work (Abebe et al., KDD’18) might have potential correctness issues. We instead analyze the structure of the objective function, and show that any local optimum is also a global optimum, which is somehow surprising as the objective function might not be convex. Furthermore, we combine the iterative process and the local search paradigm to design very efficient algorithms that can solve the unbudgeted variant of the problem optimally on large-scale graphs containing millions of nodes. Finally, we propose and evaluate experimentally a family of heuristics for the budgeted variant of the problem.
{"title":"Opinion Dynamics Optimization by Varying Susceptibility to Persuasion via Non-Convex Local Search","authors":"Rediet Abebe, T-H. Hubert Chan, J. Kleinberg, Zhibin Liang, D. Parkes, Mauro Sozio, Charalampos E. Tsourakakis","doi":"10.1145/3466617","DOIUrl":"https://doi.org/10.1145/3466617","url":null,"abstract":"A long line of work in social psychology has studied variations in people’s susceptibility to persuasion—the extent to which they are willing to modify their opinions on a topic. This body of literature suggests an interesting perspective on theoretical models of opinion formation by interacting parties in a network: in addition to considering interventions that directly modify people’s intrinsic opinions, it is also natural to consider interventions that modify people’s susceptibility to persuasion. In this work, motivated by this fact, we propose an influence optimization problem. Specifically, we adopt a popular model for social opinion dynamics, where each agent has some fixed innate opinion, and a resistance that measures the importance it places on its innate opinion; agents influence one another’s opinions through an iterative process. Under certain conditions, this iterative process converges to some equilibrium opinion vector. For the unbudgeted variant of the problem, the goal is to modify the resistance of any number of agents (within some given range) such that the sum of the equilibrium opinions is minimized; for the budgeted variant, in addition the algorithm is given upfront a restriction on the number of agents whose resistance may be modified. We prove that the objective function is in general non-convex. Hence, formulating the problem as a convex program as in an early version of this work (Abebe et al., KDD’18) might have potential correctness issues. We instead analyze the structure of the objective function, and show that any local optimum is also a global optimum, which is somehow surprising as the objective function might not be convex. Furthermore, we combine the iterative process and the local search paradigm to design very efficient algorithms that can solve the unbudgeted variant of the problem optimally on large-scale graphs containing millions of nodes. Finally, we propose and evaluate experimentally a family of heuristics for the budgeted variant of the problem.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131354628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Isidro González Hidalgo, S. G. T. C. Santos, Roberto S. M. Barros
A data stream can be defined as a system that continually generates a lot of data over time. Today, processing data streams requires new demands and challenging tasks in the data mining and machine learning areas. Concept Drift is a problem commonly characterized as changes in the distribution of the data within a data stream. The implementation of new methods for dealing with data streams where concept drifts occur requires algorithms that can adapt to several scenarios to improve its performance in the different experimental situations where they are tested. This research proposes a strategy for dynamic parameter adjustment in the presence of concept drifts. Parameter Estimation Procedure (PEP) is a general method proposed for dynamically adjusting parameters which is applied to the diversity parameter (λ) of several classification ensembles commonly used in the area. To this end, the proposed estimation method (PEP) was used to create Boosting-like Online Learning Ensemble with Parameter Estimation (BOLE-PE), Online AdaBoost-based M1 with Parameter Estimation (OABM1-PE), and Oza and Russell’s Online Bagging with Parameter Estimation (OzaBag-PE), based on the existing ensembles BOLE, OABM1, and OzaBag, respectively. To validate them, experiments were performed with artificial and real-world datasets using Hoeffding Tree (HT) as base classifier. The accuracy results were statistically evaluated using a variation of the Friedman test and the Nemenyi post-hoc test. The experimental results showed that the application of the dynamic estimation in the diversity parameter (λ) produced good results in most scenarios, i.e., the modified methods have improved accuracy in the experiments with both artificial and real-world datasets.
数据流可以定义为随着时间的推移不断生成大量数据的系统。今天,处理数据流在数据挖掘和机器学习领域提出了新的要求和具有挑战性的任务。概念漂移是一个通常以数据流中数据分布变化为特征的问题。实现处理发生概念漂移的数据流的新方法需要能够适应多种场景的算法,以提高其在测试的不同实验情况下的性能。本研究提出了一种存在概念漂移时的动态参数调整策略。参数估计程序(PEP)是一种动态调整参数的通用方法,应用于该领域常用的几种分类系统的分集参数(λ)。为此,利用所提出的估计方法(PEP),分别在现有的集成系统BOLE、OABM1和OzaBag的基础上,创建了类boost在线学习集成与参数估计(BOLE- pe)、基于adaboost的在线学习集成与参数估计(OABM1- pe)和Oza和Russell的在线Bagging与参数估计(OzaBag- pe)。为了验证它们,使用Hoeffding Tree (HT)作为基础分类器,在人工和现实世界的数据集上进行了实验。使用Friedman检验和Nemenyi事后检验的变体对准确性结果进行统计评估。实验结果表明,对分集参数(λ)的动态估计在大多数情况下都取得了良好的效果,即改进的方法在人工和真实数据集的实验中都提高了精度。
{"title":"Dynamically Adjusting Diversity in Ensembles for the Classification of Data Streams with Concept Drift","authors":"Juan Isidro González Hidalgo, S. G. T. C. Santos, Roberto S. M. Barros","doi":"10.1145/3466616","DOIUrl":"https://doi.org/10.1145/3466616","url":null,"abstract":"A data stream can be defined as a system that continually generates a lot of data over time. Today, processing data streams requires new demands and challenging tasks in the data mining and machine learning areas. Concept Drift is a problem commonly characterized as changes in the distribution of the data within a data stream. The implementation of new methods for dealing with data streams where concept drifts occur requires algorithms that can adapt to several scenarios to improve its performance in the different experimental situations where they are tested. This research proposes a strategy for dynamic parameter adjustment in the presence of concept drifts. Parameter Estimation Procedure (PEP) is a general method proposed for dynamically adjusting parameters which is applied to the diversity parameter (λ) of several classification ensembles commonly used in the area. To this end, the proposed estimation method (PEP) was used to create Boosting-like Online Learning Ensemble with Parameter Estimation (BOLE-PE), Online AdaBoost-based M1 with Parameter Estimation (OABM1-PE), and Oza and Russell’s Online Bagging with Parameter Estimation (OzaBag-PE), based on the existing ensembles BOLE, OABM1, and OzaBag, respectively. To validate them, experiments were performed with artificial and real-world datasets using Hoeffding Tree (HT) as base classifier. The accuracy results were statistically evaluated using a variation of the Friedman test and the Nemenyi post-hoc test. The experimental results showed that the application of the dynamic estimation in the diversity parameter (λ) produced good results in most scenarios, i.e., the modified methods have improved accuracy in the experiments with both artificial and real-world datasets.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129065230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos
Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.
{"title":"High-Value Token-Blocking: Efficient Blocking Method for Record Linkage","authors":"K. O'Hare, Anna Jurek-Loughrey, Cassio P. de Campos","doi":"10.1145/3450527","DOIUrl":"https://doi.org/10.1145/3450527","url":null,"abstract":"Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127038517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhao Li, Junshuai Song, Zehong Hu, Zhen Wang, Jun Gao
Impression regulation plays an important role in various online ranking systems, e.g., e-commerce ranking systems always need to achieve local commercial demands on some pre-labeled target items like fresh item cultivation and fraudulent item counteracting while maximizing its global revenue. However, local impression regulation may cause “butterfly effects” on the global scale, e.g., in e-commerce, the price preference fluctuation in initial conditions (overpriced or underpriced items) may create a significantly different outcome, thus affecting shopping experience and bringing economic losses to platforms. To prevent “butterfly effects”, some researchers define their regulation objectives with global constraints, by using contextual bandit at the page-level that requires all items on one page sharing the same regulation action, which fails to conduct impression regulation on individual items. To address this problem, in this article, we propose a personalized impression regulation method that can directly makes regulation decisions for each user-item pair. Specifically, we model the regulation problem as a Constrained Dual-level Bandit (CDB) problem, where the local regulation action and reward signals are at the item-level while the global effect constraint on the platform impression can be calculated at the page-level only. To handle the asynchronous signals, we first expand the page-level constraint to the item-level and then derive the policy updating as a second-order cone optimization problem. Our CDB approaches the optimal policy by iteratively solving the optimization problem. Experiments are performed on both offline and online datasets, and the results, theoretically and empirically, demonstrate CDB outperforms state-of-the-art algorithms.
{"title":"Constrained Dual-Level Bandit for Personalized Impression Regulation in Online Ranking Systems","authors":"Zhao Li, Junshuai Song, Zehong Hu, Zhen Wang, Jun Gao","doi":"10.1145/3461340","DOIUrl":"https://doi.org/10.1145/3461340","url":null,"abstract":"Impression regulation plays an important role in various online ranking systems, e.g., e-commerce ranking systems always need to achieve local commercial demands on some pre-labeled target items like fresh item cultivation and fraudulent item counteracting while maximizing its global revenue. However, local impression regulation may cause “butterfly effects” on the global scale, e.g., in e-commerce, the price preference fluctuation in initial conditions (overpriced or underpriced items) may create a significantly different outcome, thus affecting shopping experience and bringing economic losses to platforms. To prevent “butterfly effects”, some researchers define their regulation objectives with global constraints, by using contextual bandit at the page-level that requires all items on one page sharing the same regulation action, which fails to conduct impression regulation on individual items. To address this problem, in this article, we propose a personalized impression regulation method that can directly makes regulation decisions for each user-item pair. Specifically, we model the regulation problem as a Constrained Dual-level Bandit (CDB) problem, where the local regulation action and reward signals are at the item-level while the global effect constraint on the platform impression can be calculated at the page-level only. To handle the asynchronous signals, we first expand the page-level constraint to the item-level and then derive the policy updating as a second-order cone optimization problem. Our CDB approaches the optimal policy by iteratively solving the optimization problem. Experiments are performed on both offline and online datasets, and the results, theoretically and empirically, demonstrate CDB outperforms state-of-the-art algorithms.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"54 27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124700581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye
The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.
{"title":"A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream","authors":"Guangtao Wang, G. Cong, Ying Zhang, Zhen Hai, Jieping Ye","doi":"10.1145/3465238","DOIUrl":"https://doi.org/10.1145/3465238","url":null,"abstract":"The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}