2020 IEEE 36th International Conference on Data Engineering (ICDE)最新文献_第2页

Automatic Calibration of Road Intersection Topology using Trajectories 基于轨迹的道路交叉口拓扑自动标定

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00145

Lisheng Zhao, Jiali Mao, Min Pu, Guoping Liu, Cheqing Jin, Weining Qian, Aoying Zhou, Xiang Wen, Runbo Hu, Hua Chai

The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms the existing methods.

数字地图中十字路口的不准确容易给移动导航等应用带来严重影响。成千上万辆汽车的大规模行驶轨迹使得交叉口拓扑结构频繁更新。本文首先将交叉口检测问题扩展为交叉口影响区的拓扑标定问题。与现有的道路交叉口更新方法不同，我们不仅可以确定道路交叉口的位置和覆盖范围，还可以根据与现有地图不匹配的轨迹找出整个影响区域内不正确或缺失的转弯路径。校准问题的重要挑战包括轨迹与异常数据的混合，道路交叉口的大小和形状不同等。为了解决上述挑战，我们提出了一个称为CITT的三相校准框架。该系统主要由轨道质量改进、核心区检测和交叉口影响区内拓扑标定三个部分组成。从这些分量中自动获得高质量的道路交叉口影响区拓扑。利用滴滴出行和芝加哥校园班车获得的轨迹数据与最先进的方法进行了大量实验对比，结果表明，CITT方法具有较强的稳定性和鲁棒性，显著优于现有方法。

{"title":"Automatic Calibration of Road Intersection Topology using Trajectories","authors":"Lisheng Zhao, Jiali Mao, Min Pu, Guoping Liu, Cheqing Jin, Weining Qian, Aoying Zhou, Xiang Wen, Runbo Hu, Hua Chai","doi":"10.1109/ICDE48307.2020.00145","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00145","url":null,"abstract":"The inaccuracy of road intersection in digital road map easily brings serious effects on the mobile navigation and other applications. Massive traveling trajectories of thousands of vehicles enable frequent updating of road intersection topology. In this paper, we first expand the road intersection detection issue into a topology calibration problem for road intersection influence zone. Distinct from the existing road intersection update methods, we not only determine the location and coverage of road intersection, but figure out incorrect or missing turning paths within whole influence zone based on unmatched trajectories as compared to the existing map. The important challenges of calibration issue include that trajectories are mixing with exceptional data, and road intersections are of different sizes and shapes, etc. To address above challenges, we propose a three-phase calibration framework, called CITT. It is composed of trajectory quality improving, core zone detection, and topology calibration within road intersection influence zone. From such components it can automatically obtain high quality topology of road intersection influence zone. Extensive experiments compared with the state-of-the-art methods using trajectory data obtained from Didi Chuxing and Chicago campus shuttles demonstrate that CITT method has strong stability and robustness and significantly outperforms the existing methods.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"80 1","pages":"1633-1644"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72869131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML 结合灵活并行性和弹性调优加速大规模DML

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00124

Jinkun Geng, Dan Li, Shuai Wang

Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. dataparallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.

由于训练数据的爆炸式增长和训练模型的日益复杂，分布式机器学习(DML)已经成为工业上的普遍实践。传统的DML遵循数据并行性，但由于需要传输大量的参数，导致通信成本很高。近年来出现的模型并行解决方案虽然可以减少通信工作量，但也会导致负载不平衡和严重的离散问题。更重要的是，现有的解决方案，无论是数据并行还是模型并行，都忽略了大多数DML任务灵活并行的本质，从而无法充分利用GPU的计算能力。针对这些缺点，我们提出了Fela，它结合了灵活的并行性和弹性调优机制来加速DML。为了充分利用GPU的能力，降低通信成本，Fela采用混合并行，使用灵活的并行度来训练模型的不同部分。同时，Fela设计了基于令牌的调度策略，在不同工作人员之间弹性调整工作负载，从而减轻了掉队效应，实现了更好的负载平衡。我们的对比实验表明，Fela可以显著提高训练吞吐量，并且比三个主要基线(即数据并行，模型并行和混合并行)分别高出3.23倍，12.22倍和1.85倍。

{"title":"Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML","authors":"Jinkun Geng, Dan Li, Shuai Wang","doi":"10.1109/ICDE48307.2020.00124","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00124","url":null,"abstract":"Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. dataparallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"32 1","pages":"1393-1404"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82166847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Reinforcement Learning with Tree-LSTM for Join Order Selection 基于树- lstm的连接顺序选择强化学习

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00116

Xiang Yu, Guoliang Li, Chengliang Chai, N. Tang

Join order selection (JOS) – the problem of finding the optimal join order for an SQL query – is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length handtuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short-term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is 101% on (estimated) cost and 67% on latency (i.e., execution time) on average, compared with dynamic programming that is known to produce the state-of-the-art results on join plans.

连接顺序选择(Join order selection, JOS)——为SQL查询找到最优连接顺序的问题——是数据库查询优化器的主要关注点。这个问题很难，因为它的解空间很大。彻底遍历解决方案空间是非常昂贵的，这通常与启发式修剪相结合。尽管经过了数十年的努力，传统的优化器在处理复杂的SQL查询时仍然存在低可伸缩性或低准确性的问题。最近使用深度强化学习(DRL)的尝试，通过用固定长度的手动调整特征向量编码连接树，为JOS提供了一些启发。然而，使用固定长度的特征向量不能捕获连接树的结构信息，这可能会产生较差的连接计划。此外，在处理模式更改(例如，添加表/列)或SQL查询中常见的多别名表名时，还可能导致对神经网络进行重新训练。在本文中，我们提出了一种新的学习优化器RTOS，它使用具有树状结构长短期记忆(LSTM)的强化学习进行连接顺序选择。RTOS主要在两个方面改进了现有的基于drl的方法:(1)采用图神经网络捕获连接树的结构;(2)支持对数据库模式和多别名表名的修改。在Join Order Benchmark (JOB)和TPC-H上的大量实验表明，RTOS优于传统的优化器和现有的基于drl的学习优化器。特别是，与动态规划相比，为JOB生成的计划RTOS平均为101%的(估计)成本和67%的延迟(即执行时间)，动态规划可以在连接计划上产生最先进的结果。

{"title":"Reinforcement Learning with Tree-LSTM for Join Order Selection","authors":"Xiang Yu, Guoliang Li, Chengliang Chai, N. Tang","doi":"10.1109/ICDE48307.2020.00116","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00116","url":null,"abstract":"Join order selection (JOS) – the problem of finding the optimal join order for an SQL query – is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length handtuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multialias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short-term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is 101% on (estimated) cost and 67% on latency (i.e., execution time) on average, compared with dynamic programming that is known to produce the state-of-the-art results on join plans.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"19 1","pages":"1297-1308"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82078089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

Group Recommendation with Latent Voting Mechanism 具有潜在投票机制的群体推荐

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00018

Lei Guo, Hongzhi Yin, Qinyong Wang, B. Cui, Zi Huang, Li-zhen Cui

Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied predefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed adhoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute differently to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR), and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic or vanilla attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as multiple voting processes, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. In the training process, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. Finally, we conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our proposed GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.

群体推荐(Group Recommendation, GR)是在线系统中为一组用户推荐相关项目/事件的任务，其主要挑战是汇总群体成员的偏好以推断群体的决策。先验组推荐方法采用预定义的静态策略进行偏好聚合。然而，这些静态策略不足以模拟一个群体的复杂决策过程，特别是对于临时形成的群体。与传统的个体推荐任务相比，GR具有很强的动态性，每个群体成员对最终群体决策的贡献可能不同。最近的研究认为，群体成员在群体决策中应该具有不均匀的权重，并试图利用标准的注意机制来汇总群体成员的偏好，但它们没有对群体成员之间的互动行为进行建模，决策过程在很大程度上未被探索。在这项工作中，我们在更一般的场景下研究了GR，即偶尔组推荐(OGR)，并重点解决了偏好聚集问题和组-项目交互的数据稀疏性问题。本文提出了一种基于社会自注意的聚合策略，即群体自注意(group self-attention, GroupSA)，而不是探索新的启发式的或普通的基于注意的机制。在GroupSA中，我们将群体决策过程视为多个投票过程，并开发了一个堆叠的社会自关注网络来模拟如何达成群体共识。为了克服数据稀疏性问题，我们利用相对丰富的用户-项目和用户-用户交互数据，通过两种聚合方法增强用户的表示。在训练过程中，我们进一步提出了一种联合训练方法，同时学习组项推荐任务和用户项推荐任务中的用户/项嵌入。最后，我们在两个真实世界的数据集上进行了广泛的实验。实验结果表明，与几种最先进的HR和NDCG方法相比，我们提出的GroupSA方法具有优越性。

{"title":"Group Recommendation with Latent Voting Mechanism","authors":"Lei Guo, Hongzhi Yin, Qinyong Wang, B. Cui, Zi Huang, Li-zhen Cui","doi":"10.1109/ICDE48307.2020.00018","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00018","url":null,"abstract":"Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied predefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed adhoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute differently to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR), and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic or vanilla attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as multiple voting processes, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. In the training process, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. Finally, we conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our proposed GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"121-132"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76235553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

SAD: An Unsupervised System for Subsequence Anomaly Detection 子序列异常检测的无监督系统

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00168

Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas

Subsequence anomaly (or outlier) detection in long sequences is an important problem with applications in a wide range of domains. However, current approaches have severe limitations: they either require prior domain knowledge, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. We recently proposed NorM, a novel approach suitable for domain-agnostic anomaly detection, which addresses the aforementioned problems by detecting anomalies based on their (dis)similarity to a model that represents normal behavior. The experimental results on several real datasets demonstrate that the proposed approach outperforms the current state-of-the art in terms of both accuracy and execution time. In this demonstration, we present a system for unsupervised Subsequence Anomaly Detection (SAD) that uses the NorM method. Through various scenarios with real datasets, we showcase the challenges of the problem, and we demonstrate the advantages of the proposed system.

长序列的子序列异常(或离群值)检测是一个重要的问题，具有广泛的应用领域。然而，当前的方法有严重的局限性:它们要么需要先前的领域知识，要么在具有相同类型的反复出现的异常的情况下使用起来繁琐且昂贵。我们最近提出了NorM，一种适用于领域不可知异常检测的新方法，它通过基于异常与表示正常行为的模型的(非)相似性来检测异常，从而解决了上述问题。在几个真实数据集上的实验结果表明，该方法在准确率和执行时间方面都优于目前的技术水平。在这个演示中，我们提出了一个使用NorM方法的无监督子序列异常检测(SAD)系统。通过使用真实数据集的各种场景，我们展示了该问题的挑战，并展示了所提出系统的优势。

引用次数: 14

Query-driven Repair of Functional Dependency Violations 查询驱动的功能依赖冲突修复

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00195

Stella Giannakopoulou, M. Karpathiotakis, A. Ailamaki

Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis.We propose an approach that performs probabilistic repair of functional dependency violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.

数据清理是一个耗时的过程，取决于用户执行的数据分析。现有解决方案将数据清理视为在分析开始之前进行的单独脱机过程。在分析之前应用数据清理假定对不一致性和查询工作负载有先验知识，因此需要努力理解和清理对分析来说不必要的数据。我们提出了一种方法，根据用户执行的探索性分析，按需执行功能依赖违反的概率修复。我们介绍Daisy，这是一个通过放松查询结果将数据清理无缝集成到分析中的系统。Daisy通过将清理操作符编织到查询计划中，对脏数据执行分析性查询工作负载。我们的评估表明，Daisy能够适应工作负载，并且在合成工作负载和实际工作负载上都优于传统的离线清理。

引用次数: 3

Fast Query Decomposition for Batch Shortest Path Processing in Road Networks 面向批量最短路径处理的快速查询分解

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00107

Lei Li, Mengxuan Zhang, Wen Hua, Xiaofang Zhou

Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server-side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query decomposition methods to cluster queries: Zigzag that considers the 1-N shared computation; Search-Space Estimation that further incorporates search space estimation; and Co-Clustering that considers the source and target’s spatial locality. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: Local Cache that improves the existing Global Cache with higher cache hit ratio, and R2R that finds a set of approximate shortest paths from one region to another with bounded error. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.

最短路径查询是各种基于位置的服务(LBS)的基本操作，它们大多在服务器端处理查询。随着业务的扩展，可伸缩性成为一个严重的问题。批量最短路径算法不是简单地部署更多的服务器来处理快速增长的查询数量，而是最近提出的使用可共享计算来回答一组查询。此外，由于不需要索引，它们也可以在高度动态的环境中工作。然而，现有的批处理算法要么假定对批处理查询进行了精细分解，要么不加区分地进行处理，导致查询效率较低。在本文中，我们旨在通过重新审视查询聚类问题来提高批处理最短路径算法的性能。具体来说，我们首先提出了三种聚类查询分解方法:考虑1-N共享计算的Zigzag;搜索空间估计，进一步融合了搜索空间估计;以及考虑源和目标空间局部性的协同聚类。之后，我们提出了两种批处理算法，它们利用先前分解的查询集来实现高效的查询应答:局部缓存(Local Cache)改进了现有的全局缓存(Global Cache)，具有更高的缓存命中率;R2R (R2R)找到一组从一个区域到另一个区域的近似最短路径，并且错误有限。在大型真实查询集上的实验验证了我们的分解方法与最先进的批处理算法的有效性和效率。

{"title":"Fast Query Decomposition for Batch Shortest Path Processing in Road Networks","authors":"Lei Li, Mengxuan Zhang, Wen Hua, Xiaofang Zhou","doi":"10.1109/ICDE48307.2020.00107","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00107","url":null,"abstract":"Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server-side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query decomposition methods to cluster queries: Zigzag that considers the 1-N shared computation; Search-Space Estimation that further incorporates search space estimation; and Co-Clustering that considers the source and target’s spatial locality. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: Local Cache that improves the existing Global Cache with higher cache hit ratio, and R2R that finds a set of approximate shortest paths from one region to another with bounded error. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"224 1","pages":"1189-1200"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83454520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining 邻居配置文件:无监督时间序列挖掘的最近邻袋装化

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00039

Yuanduo He, Xu Chu, Yasha Wang

Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the "tip of the iceberg" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.

无监督时间序列挖掘已经引起了学术界和工业界的极大兴趣。作为两项最基本的数据挖掘任务，频繁/罕见子序列的发现在文献中得到了广泛的研究。具体来说，频繁/罕见子序列被定义为具有最小/最大1-近邻距离的子序列，也称为motif/discord。然而，当它在时间序列中不止一次出现时，不和谐就不能识别出罕见的子序列，这就是众所周知的双胞胎畸形问题。由于基于1个最近邻距离的定义，这个问题只是“冰山一角”。在这项工作中，我们首次提供了一个明确的理论分析motif/discord作为基于1近邻的子序列非参数密度估计。我们特别关注矩阵剖面，这是最近提出的一种挖掘框架，它在同一计算模型下统一了motif和discord的发现。在此基础上，指出了该方法固有的三个问题:低质量密度估计、重力违抗行为和缺乏可重用模型，这些问题在效率和子序列质量上都降低了矩阵剖面的性能。为了克服这些问题，我们提出了邻居配置文件，通过袋装最近邻居来鲁棒地建模子序列密度，以发现频繁/罕见子序列。具体来说，我们利用多个子样本，并使用调整后的最近邻距离对子样本的密度估计进行平均，不仅增强了估计的鲁棒性，而且实现了高效学习的可重用模型。我们在合成数据上检查邻居配置文件的完整性，并在真实数据集上进一步评估它。实验结果表明，邻域轮廓能正确地对不同密度的子序列进行建模，在实际心律失常数据集上表现出明显优于矩阵轮廓的性能。结果表明，邻域轮廓对于海量数据集是有效的。

{"title":"Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining","authors":"Yuanduo He, Xu Chu, Yasha Wang","doi":"10.1109/ICDE48307.2020.00039","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00039","url":null,"abstract":"Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As the two most basic data mining tasks, the discoveries of frequent/rare subsequences have been extensively studied in the literature. Specifically, frequent/rare subsequences are defined as the ones with the smallest/largest 1-nearest neighbor distance, which are also known as motif/discord. However, discord fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the twin freak problem. This problem is just the \"tip of the iceberg\" due to the 1-nearest neighbor distance based definitions. In this work, we for the first time provide a clear theoretical analysis of motif/discord as the 1-nearest neighbor based nonparametric density estimation of subsequence. Particularly, we focus on matrix profile, a recently proposed mining framework, which unifies the discovery of motif and discord under the same computing model. Thereafter, we point out the inherent three issues: low-quality density estimation, gravity defiant behavior, and lack of reusable model, which deteriorate the performance of matrix profile in both efficiency and subsequence quality.To overcome these issues, we propose Neighbor Profile to robustly model the subsequence density by bagging nearest neighbors for the discovery of frequent/rare subsequences. Specifically, we leverage multiple subsamples and average the density estimations from subsamples using adjusted nearest neighbor distances, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance significantly over matrix profile on the real-world arrhythmia dataset. Also, it is shown that neighbor profile is efficient for massive datasets.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"5 1","pages":"373-384"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84766895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Statistical Estimation of Diffusion Network Topologies 扩散网络拓扑的统计估计

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00060

Ke‐qi Han, Yuan Tian, Yunjia Zhang, Ling Han, H. Huang, Yunjun Gao

Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.

基于观察到的扩散结果重构扩散网络的拓扑结构是数据挖掘中的一个开放性挑战。现有的方法大多假设观察到的扩散结果是可用的，不仅包括节点的最终感染状态，还包括确定感染发生时间的确切时间戳。然而，由于监测节点感染的高成本和不确定性，在实践中往往无法获得确切的感染时间戳。在这项工作中，我们研究了如何仅从节点的最终感染状态推断扩散网络拓扑的问题。为此，我们提出了一种新的扩散网络重建评分标准，该标准能够基于感染状态结果估计目标扩散网络潜在拓扑的可能性，且统计误差相对较低。由于提出的评分标准是可分解的，我们的问题被转化为为网络中的每个节点寻找一组最可能的父节点，使局部评分的值最大化。此外，为了消除搜索最可能父节点时的冗余计算，我们通过检查其感染是否与相应子节点的感染具有负相关或极低正相关来识别不重要的候选父节点，并将其排除在搜索空间之外。在合成网络和现实网络上进行了大量的实验，结果验证了我们方法的有效性和效率。

{"title":"Statistical Estimation of Diffusion Network Topologies","authors":"Ke‐qi Han, Yuan Tian, Yunjia Zhang, Ling Han, H. Huang, Yunjun Gao","doi":"10.1109/ICDE48307.2020.00060","DOIUrl":"https://doi.org/10.1109/ICDE48307.2020.00060","url":null,"abstract":"Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"146 1","pages":"625-636"},"PeriodicalIF":0.0,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88684404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Transformation-based Framework for KNN Set Similarity Search(Extended Abstract) 基于变换的KNN集相似度搜索框架(扩展摘要)

2020 IEEE 36th International Conference on Data Engineering (ICDE)

Pub Date : 2020-04-01 DOI: 10.1109/ICDE48307.2020.00239

Yong Zhang, Jiacheng Wu, Jin Wang, Chunxiao Xing

Set similarity search is a fundamental operation in a variety of applications [3] , [5] , [2] . There is a long stream of research on the problem of set similarity search. Given a collection of set records, a query and a similarity function, the algorithm will return all the set records that are similarity with the query. There are many metrics to measure the similarity between two sets, such as Overlap, Jaccard, Cosine and Dice. In this paper we use the widely applied Jaccard to quantify the similarity between two sets, but our proposed techniques can be easily extended to other set-based similarity functions. Previous approaches require users to specify a threshold of similarity. However, in many scenarios it is rather difficult to specify such a threshold. For example, when users types some keywords in the search engine, they will pay more attention for the results which rank in the front, say the top five ones. In this case, if we use threshold-based search instead of KNN similarity search, it is difficult to find the results that are more attractive for users.

集合相似度搜索是各种应用中的基本操作[3]，[5]，[2]。集相似度搜索问题的研究由来已久。给定一个集合记录、一个查询和一个相似函数的集合，该算法将返回与查询相似的所有集合记录。有许多指标可以衡量两个集合之间的相似性，如重叠、Jaccard、余弦和骰子。在本文中，我们使用广泛应用的Jaccard来量化两个集合之间的相似性，但我们提出的技术可以很容易地扩展到其他基于集合的相似性函数。以前的方法需要用户指定一个相似度的阈值。然而，在许多情况下，指定这样的阈值是相当困难的。例如，当用户在搜索引擎中输入一些关键词时，他们会更加关注排名靠前的结果，比如前五名。在这种情况下，如果我们使用基于阈值的搜索而不是KNN相似度搜索，很难找到对用户更有吸引力的结果。

引用次数: 1