The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms the existing methods.
{"title":"Automatic Calibration of Road Intersection Topology using Trajectories","authors":"Lisheng Zhao, Jiali Mao, Min Pu, Guoping Liu, Cheqing Jin, Weining Qian, Aoying Zhou, Xiang Wen, Runbo Hu, Hua Chai","doi":"10.1109/ICDE48307.2020.00145","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00145","url":null,"abstract":"The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms the existing methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"80 1","pages":"1633-1644"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72869131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00124
Jinkun Geng, Dan Li, Shuai Wang
Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. dataparallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.
{"title":"Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1109/ICDE48307.2020.00124","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00124","url":null,"abstract":"Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. dataparallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"1393-1404"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82166847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00116
Xiang Yu, Guoliang Li, Chengliang Chai, N. Tang
Join order selection (JOS) – the problem of finding the optimal join order for an SQL query – is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length handtuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short-term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is 101% on (estimated) cost and 67% on latency (i.e., execution time) on average, compared with dynamic programming that is known to produce the state-of-the-art results on join plans.
连接顺序选择(Join order selection, JOS)——为SQL查询找到最优连接顺序的问题——是数据库查询优化器的主要关注点。这个问题很难,因为它的解空间很大。彻底遍历解决方案空间是非常昂贵的,这通常与启发式修剪相结合。尽管经过了数十年的努力,传统的优化器在处理复杂的SQL查询时仍然存在低可伸缩性或低准确性的问题。最近使用深度强化学习(DRL)的尝试,通过用固定长度的手动调整特征向量编码连接树,为JOS提供了一些启发。然而,使用固定长度的特征向量不能捕获连接树的结构信息,这可能会产生较差的连接计划。此外,在处理模式更改(例如,添加表/列)或SQL查询中常见的多别名表名时,还可能导致对神经网络进行重新训练。在本文中,我们提出了一种新的学习优化器RTOS,它使用具有树状结构长短期记忆(LSTM)的强化学习进行连接顺序选择。RTOS主要在两个方面改进了现有的基于drl的方法:(1)采用图神经网络捕获连接树的结构;(2)支持对数据库模式和多别名表名的修改。在Join Order Benchmark (JOB)和TPC-H上的大量实验表明,RTOS优于传统的优化器和现有的基于drl的学习优化器。特别是,与动态规划相比,为JOB生成的计划RTOS平均为101%的(估计)成本和67%的延迟(即执行时间),动态规划可以在连接计划上产生最先进的结果。
{"title":"Reinforcement Learning with Tree-LSTM for Join Order Selection","authors":"Xiang Yu, Guoliang Li, Chengliang Chai, N. Tang","doi":"10.1109/ICDE48307.2020.00116","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00116","url":null,"abstract":"Join order selection (JOS) – the problem of finding the optimal join order for an SQL query – is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length handtuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short-term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is 101% on (estimated) cost and 67% on latency (i.e., execution time) on average, compared with dynamic programming that is known to produce the state-of-the-art results on join plans.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1297-1308"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82078089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00018
Lei Guo, Hongzhi Yin, Qinyong Wang, B. Cui, Zi Huang, Li-zhen Cui
Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied predefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed adhoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute differently to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR), and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic or vanilla attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as multiple voting processes, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. In the training process, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. Finally, we conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our proposed GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.
{"title":"Group Recommendation with Latent Voting Mechanism","authors":"Lei Guo, Hongzhi Yin, Qinyong Wang, B. Cui, Zi Huang, Li-zhen Cui","doi":"10.1109/ICDE48307.2020.00018","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00018","url":null,"abstract":"Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied predefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed adhoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute differently to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR), and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic or vanilla attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as multiple voting processes, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. In the training process, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. Finally, we conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our proposed GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"121-132"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76235553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00168
Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas
Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. We recently proposed NorM, a novel approach suitable for domain-agnostic anomaly detection, which addresses the aforementioned problems by detecting anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms the current state-of-the art in terms of both accuracy and execution time. In this demonstration, we present a system for unsupervised Subsequence Anomaly Detection (SAD) that uses the NorM method. Through various scenarios with real datasets, we showcase the challenges of the problem, and we demonstrate the advantages of the proposed system.
{"title":"SAD: An Unsupervised System for Subsequence Anomaly Detection","authors":"Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas","doi":"10.1109/ICDE48307.2020.00168","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00168","url":null,"abstract":"Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. We recently proposed NorM, a novel approach suitable for domain-agnostic anomaly detection, which addresses the aforementioned problems by detecting anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms the current state-of-the art in terms of both accuracy and execution time. In this demonstration, we present a system for unsupervised Subsequence Anomaly Detection (SAD) that uses the NorM method. Through various scenarios with real datasets, we showcase the challenges of the problem, and we demonstrate the advantages of the proposed system.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1778-1781"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82909203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00195
Stella Giannakopoulou, M. Karpathiotakis, A. Ailamaki
Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis.We propose an approach that performs probabilistic repair of functional dependency violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.
{"title":"Query-driven Repair of Functional Dependency Violations","authors":"Stella Giannakopoulou, M. Karpathiotakis, A. Ailamaki","doi":"10.1109/ICDE48307.2020.00195","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00195","url":null,"abstract":"Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis.We propose an approach that performs probabilistic repair of functional dependency violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"35 1","pages":"1886-1889"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83160770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00107
Lei Li, Mengxuan Zhang, Wen Hua, Xiaofang Zhou
Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server-side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query decomposition methods to cluster queries: Zigzag that considers the 1-N shared computation; Search-Space Estimation that further incorporates search space estimation; and Co-Clustering that considers the source and target’s spatial locality. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: Local Cache that improves the existing Global Cache with higher cache hit ratio, and R2R that finds a set of approximate shortest paths from one region to another with bounded error. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.
{"title":"Fast Query Decomposition for Batch Shortest Path Processing in Road Networks","authors":"Lei Li, Mengxuan Zhang, Wen Hua, Xiaofang Zhou","doi":"10.1109/ICDE48307.2020.00107","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00107","url":null,"abstract":"Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server-side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query decomposition methods to cluster queries: Zigzag that considers the 1-N shared computation; Search-Space Estimation that further incorporates search space estimation; and Co-Clustering that considers the source and target’s spatial locality. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: Local Cache that improves the existing Global Cache with higher cache hit ratio, and R2R that finds a set of approximate shortest paths from one region to another with bounded error. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"224 1","pages":"1189-1200"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83454520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00039
Yuanduo He, Xu Chu, Yasha Wang
Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the "tip of the iceberg" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.
{"title":"Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining","authors":"Yuanduo He, Xu Chu, Yasha Wang","doi":"10.1109/ICDE48307.2020.00039","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00039","url":null,"abstract":"Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the \"tip of the iceberg\" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"373-384"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84766895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.
{"title":"Statistical Estimation of Diffusion Network Topologies","authors":"Ke‐qi Han, Yuan Tian, Yunjia Zhang, Ling Han, H. Huang, Yunjun Gao","doi":"10.1109/ICDE48307.2020.00060","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00060","url":null,"abstract":"Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"146 1","pages":"625-636"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88684404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-01DOI: 10.1109/ICDE48307.2020.00239
Yong Zhang, Jiacheng Wu, Jin Wang, Chunxiao Xing
Set similarity search is a fundamental operation in a variety of applications [3] , [5] , [2] . There is a long stream of research on the problem of set similarity search. Given a collection of set records, a query and a similarity function, the algorithm will return all the set records that are similarity with the query. There are many metrics to measure the similarity between two sets, such as Overlap, Jaccard, Cosine and Dice. In this paper we use the widely applied Jaccard to quantify the similarity between two sets, but our proposed techniques can be easily extended to other set-based similarity functions. Previous approaches require users to specify a threshold of similarity. However, in many scenarios it is rather difficult to specify such a threshold. For example, when users types some keywords in the search engine, they will pay more attention for the results which rank in the front, say the top five ones. In this case, if we use threshold-based search instead of KNN similarity search, it is difficult to find the results that are more attractive for users.
{"title":"A Transformation-based Framework for KNN Set Similarity Search(Extended Abstract)","authors":"Yong Zhang, Jiacheng Wu, Jin Wang, Chunxiao Xing","doi":"10.1109/ICDE48307.2020.00239","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00239","url":null,"abstract":"Set similarity search is a fundamental operation in a variety of applications [3] , [5] , [2] . There is a long stream of research on the problem of set similarity search. Given a collection of set records, a query and a similarity function, the algorithm will return all the set records that are similarity with the query. There are many metrics to measure the similarity between two sets, such as Overlap, Jaccard, Cosine and Dice. In this paper we use the widely applied Jaccard to quantify the similarity between two sets, but our proposed techniques can be easily extended to other set-based similarity functions. Previous approaches require users to specify a threshold of similarity. However, in many scenarios it is rather difficult to specify such a threshold. For example, when users types some keywords in the search engine, they will pay more attention for the results which rank in the front, say the top five ones. In this case, if we use threshold-based search instead of KNN similarity search, it is difficult to find the results that are more attractive for users.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"2040-2041"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89392907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}