2016 IEEE 32nd International Conference on Data Engineering (ICDE)最新文献_第8页

SLR: A scalable latent role model for attribute completion and tie prediction in social networks 单反:社交网络中属性完成和联系预测的可扩展潜在角色模型

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498313

Lizi Liao, Qirong Ho, Jing Jiang, Ee-Peng Lim

Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia - to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons - for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete - in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine implementation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.

社交网络是一个重要的网络类别，它跨越了各种各样的媒体，包括社交网站，如Facebook和b谷歌Plus，学术论文和专利的引用网络，电信的呼叫者网络，以及超链接文档集合，如维基百科等等。这些社交网络中的许多现在超过了数百万用户或参与者，每个用户或参与者都可能与丰富的属性数据相关联，例如社交网站和呼叫者网络中的用户配置文件，或文档集合和引文网络中的主题分类。由于许多原因，这些属性数据通常是不完整的——例如，用户可能不愿意花费精力来完成他们的配置文件，而在文档集合的情况下，可能没有足够的人力来准确地对所有文档进行分类。同时，这些网络中的纽带或链接信息也可能是不完整的——在社交网站中，用户可能根本不知道潜在的熟人，而在引文网络中，作者可能不知道应该引用哪些合适的文献。完成和预测这些缺失的属性和联系对于一系列应用程序(如推荐、个性化搜索和定向广告)非常重要，但是大型社交网络可能会对为此任务设计的现有算法构成可伸缩性挑战。为此，我们提出了一种综合概率模型，SLR，它可以同时捕获属性和关联信息，并可用于属性补全和关联预测，以实现上述应用。我们模型的一个关键创新是使用三角形图案来表示网络中的关系，以便扩展到具有数百万节点甚至更多节点的网络。在真实数据集上的实验表明，与已知的方法相比，SLR显著提高了属性预测和关系预测的准确性，并且我们的分布式、多机器实现很容易扩展到数百万用户。除了快速准确的属性和联系预测外，我们还展示了单反如何识别网络中最负责同质性的属性，从而揭示哪些属性驱动网络联系的形成。

{"title":"SLR: A scalable latent role model for attribute completion and tie prediction in social networks","authors":"Lizi Liao, Qirong Ho, Jing Jiang, Ee-Peng Lim","doi":"10.1109/ICDE.2016.7498313","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498313","url":null,"abstract":"Social networks are an important class of networks that span a wide variety of media, ranging from social websites such as Facebook and Google Plus, citation networks of academic papers and patents, caller networks in telecommunications, and hyperlinked document collections such as Wikipedia - to name a few. Many of these social networks now exceed millions of users or actors, each of which may be associated with rich attribute data such as user profiles in social websites and caller networks, or subject classifications in document collections and citation networks. Such attribute data is often incomplete for a number of reasons - for example, users may be unwilling to spend the effort to complete their profiles, while in the case of document collections, there may be insufficient human labor to accurately classify all documents. At the same time, the tie or link information in these networks may also be incomplete - in social websites, users may simply be unaware of potential acquaintances, while in citation networks, authors may be unaware of appropriate literature that should be referenced. Completing and predicting these missing attributes and ties is important to a spectrum of applications, such as recommendation, personalized search, and targeted advertising, yet large social networks can pose a scalability challenge to existing algorithms designed for this task. Towards this end, we propose an integrative probabilistic model, SLR, that captures both attribute and tie information simultaneously, and can be used for attribute completion and tie prediction, in order to enable the above mentioned applications. A key innovation in our model is the use of triangle motifs to represent ties in the network, in order to scale to networks with millions of nodes and beyond. Experiments on real world datasets show that SLR significantly improves the accuracy of attribute prediction and tie prediction compared to well-known methods, and our distributed, multi-machine implementation easily scales up to millions of users. In addition to fast and accurate attribute and tie prediction, we also demonstrate how SLR can identify the attributes most responsible for homophily within the network, thus revealing which attributes drive network tie formation.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"1062-1073"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78505295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fuzzy trajectory linking 模糊轨迹连接

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498296

Huayu Wu, Mingqiang Xue, Jianneng Cao, Panagiotis Karras, W. Ng, Kee Kiat Koo

Today, people can access various services with smart carry-on devices, e.g., surf the web with smart phones, make payments with credit cards, or ride a bus with commuting cards. In addition to the offered convenience, the access of such services can reveal their traveled trajectory to service providers. Very often, a user who has signed up for multiple services may expose her trajectory to more than one service providers. This state of affairs raises a privacy concern, but also an opportunity. On one hand, several colluding service providers, or a government agency that collects information from such service providers, may identify and reconstruct users' trajectories to an extent that can be threatening to personal privacy. On the other hand, the processing of such rich data may allow for the development of better services for the common good. In this paper, we take a neutral standpoint and investigate the potential for trajectories accumulated from different sources to be linked so as to reconstruct a larger trajectory of a single person. We develop a methodology, called fuzzy trajectory linking (FTL) that achieves this goal, and two instantiations thereof, one based on hypothesis testing and one on Naïve-Bayes. We provide a theoretical analysis for factors that affect FTL and use two real datasets to demonstrate that our algorithms effectively achieve their goals.

今天，人们可以通过智能随身设备获得各种服务，例如，用智能手机上网，用信用卡支付，或者用通勤卡乘坐公共汽车。除了提供便利之外，这些服务的访问还可以向服务提供商显示其行进轨迹。通常，注册了多个服务的用户可能会将其轨迹暴露给多个服务提供商。这种状况引发了人们对隐私的担忧，但也带来了机遇。一方面，几个串通的服务提供商或从这些服务提供商收集信息的政府机构可能会识别和重建用户的轨迹，从而可能威胁到个人隐私。另一方面，处理如此丰富的数据可以为共同利益开发更好的服务。在本文中，我们采取中立的立场，研究从不同来源积累的轨迹连接起来的可能性，从而重建一个更大的单个人的轨迹。我们开发了一种方法，称为模糊轨迹链接(FTL)，以实现这一目标，以及两个实例，一个基于假设检验，一个基于Naïve-Bayes。我们对影响超光速的因素进行了理论分析，并使用两个真实数据集来证明我们的算法有效地实现了目标。

{"title":"Fuzzy trajectory linking","authors":"Huayu Wu, Mingqiang Xue, Jianneng Cao, Panagiotis Karras, W. Ng, Kee Kiat Koo","doi":"10.1109/ICDE.2016.7498296","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498296","url":null,"abstract":"Today, people can access various services with smart carry-on devices, e.g., surf the web with smart phones, make payments with credit cards, or ride a bus with commuting cards. In addition to the offered convenience, the access of such services can reveal their traveled trajectory to service providers. Very often, a user who has signed up for multiple services may expose her trajectory to more than one service providers. This state of affairs raises a privacy concern, but also an opportunity. On one hand, several colluding service providers, or a government agency that collects information from such service providers, may identify and reconstruct users' trajectories to an extent that can be threatening to personal privacy. On the other hand, the processing of such rich data may allow for the development of better services for the common good. In this paper, we take a neutral standpoint and investigate the potential for trajectories accumulated from different sources to be linked so as to reconstruct a larger trajectory of a single person. We develop a methodology, called fuzzy trajectory linking (FTL) that achieves this goal, and two instantiations thereof, one based on hypothesis testing and one on Naïve-Bayes. We provide a theoretical analysis for factors that affect FTL and use two real datasets to demonstrate that our algorithms effectively achieve their goals.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"20 1","pages":"859-870"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86978905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

CSI_GED: An efficient approach for graph edit similarity computation 一种高效的图编辑相似度计算方法

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498246

K. Gouda, M. Hassaan

Graph similarity is a basic and essential operation in many applications. In this paper, we are interested in computing graph similarity based on edit distance. Existing graph edit distance computation methods adopt the best first search paradigm A*. These methods are time and space bound. In practice, they can compute the edit distance of graphs containing 12 vertices at most. To enable graph edit similarity computation on larger and distant graphs, we present CSI_GED, a novel edge-based mapping method for computing graph edit distance through common sub-structure isomorphisms enumeration. CSI_GED utilizes backtracking search combined with a number of heuristics to reduce memory requirements and quickly prune away a large portion of the mapping search space. Experiments show that CSI_GED is highly efficient for computing the edit distance on small as well as large and distant graphs. Furthermore, we evaluated CSI_GED as a stand-alone graph edit similarity search query method. The experiments show that CSI_GED is effective and scalable, and outperforms the state-of-the-art indexing-based methods by over two orders of magnitude.

图相似度在许多应用中是一种基本的、必不可少的操作。在本文中，我们对基于编辑距离的图相似度计算感兴趣。现有的图编辑距离计算方法均采用最佳第一搜索范式A*。这些方法受时间和空间的限制。在实际中，它们最多可以计算出包含12个顶点的图的编辑距离。为了实现更大、更远的图的图编辑相似度计算，我们提出了一种新的基于边缘的映射方法CSI_GED，该方法通过公共子结构同构枚举计算图编辑距离。CSI_GED将回溯搜索与许多启发式方法相结合，以减少内存需求，并快速减少大部分映射搜索空间。实验表明，CSI_GED对于小图和大图、远图的编辑距离计算都是非常高效的。此外，我们评估了CSI_GED作为一个独立的图编辑相似度搜索查询方法。实验表明，CSI_GED是有效的和可扩展的，并且比目前基于索引的方法高出两个数量级以上。

引用次数: 48

Spatial influence - measuring followship in the real world 空间影响——衡量现实世界中的追随性

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498268

Huy Pham, C. Shahabi

Finding influential people in a society has been the focus of social studies for decades due to its numerous applications, such as viral marketing or spreading ideas and practices. A critical first step is to quantify the amount of influence an individual exerts on another, termed pairwise influence. Early social studies had to confine themselves to surveys and manual data collections for this purpose; more recent studies have exploited web data (e.g., blogs). In this paper, for the first time, we utilize people's movement in the real world (aka spatiotemporal data) to derive pairwise influence. We first define followship to capture the phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence of another individual who has visited that same location in the past. Subsequently, we coin the term spatial influence as the concept of inferring pairwise influence from spatiotemporal data by quantifying the amount of followship influence that an individual has on others. We then propose the Temporal and Locational Followship Model (TLFM) to estimate spatial influence, in which we study three factors that impact followship: the time delay between the visits, the popularity of the location, and the inherent coincidences in individuals' visiting behaviors. We conducted extensive experiments using various real-world datasets, which demonstrate the effectiveness of our TLFM model in quantifying spatial influence.

几十年来，在社会中寻找有影响力的人一直是社会研究的焦点，因为它有许多应用，比如病毒式营销或传播思想和实践。关键的第一步是量化一个人对另一个人施加的影响力，称为成对影响。为了这个目的，早期的社会研究只能局限于调查和人工数据收集;最近的研究利用了网络数据(如博客)。在本文中，我们首次利用人们在现实世界中的运动(即时空数据)来推导成对影响。我们首先定义了追随性，以捕捉由于过去访问过同一地点的另一个人的影响而访问现实世界地点(例如，餐馆)的现象。随后，我们创造了“空间影响”一词，通过量化个体对他人的追随影响程度，从时空数据推断成对影响的概念。在此基础上，我们提出了时空追随模型(TLFM)来评估空间影响，该模型研究了三个影响追随的因素:访问时间间隔、地点的受欢迎程度和个体访问行为的内在巧合。我们使用各种真实世界的数据集进行了广泛的实验，证明了我们的TLFM模型在量化空间影响方面的有效性。

{"title":"Spatial influence - measuring followship in the real world","authors":"Huy Pham, C. Shahabi","doi":"10.1109/ICDE.2016.7498268","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498268","url":null,"abstract":"Finding influential people in a society has been the focus of social studies for decades due to its numerous applications, such as viral marketing or spreading ideas and practices. A critical first step is to quantify the amount of influence an individual exerts on another, termed pairwise influence. Early social studies had to confine themselves to surveys and manual data collections for this purpose; more recent studies have exploited web data (e.g., blogs). In this paper, for the first time, we utilize people's movement in the real world (aka spatiotemporal data) to derive pairwise influence. We first define followship to capture the phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence of another individual who has visited that same location in the past. Subsequently, we coin the term spatial influence as the concept of inferring pairwise influence from spatiotemporal data by quantifying the amount of followship influence that an individual has on others. We then propose the Temporal and Locational Followship Model (TLFM) to estimate spatial influence, in which we study three factors that impact followship: the time delay between the visits, the popularity of the location, and the inherent coincidences in individuals' visiting behaviors. We conducted extensive experiments using various real-world datasets, which demonstrate the effectiveness of our TLFM model in quantifying spatial influence.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"10 1","pages":"529-540"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87688324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Efficient fault-tolerance for iterative graph processing on distributed dataflow systems 分布式数据流系统中迭代图处理的高效容错

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498275

Chen Xu, M. Holzemer, Manohar Kaul, V. Markl

Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.

实际的图形处理应用程序通常需要将图形数据与表格数据相结合。此外，图形处理通常是由数据准备、分析和模型构建以及模型应用组成的更大的分析工作流的一部分。通用分布式数据流框架整体地执行这些工作流的所有步骤。这种整体视图使这些系统能够推理并自动优化处理。大多数大型图形处理算法都是迭代的，并且会产生很长的运行时间，因为它们需要多次遍历数据直到收敛。因此，在工作流程的任何步骤中，容错和从任何间歇性故障中快速恢复对于有效和高效的分析至关重要。在这项工作中，我们提出了一种新的容错机制，用于分布式数据流系统的迭代图处理，目的是减少检查点成本和故障恢复时间。我们的机制不是写阻塞下游操作符的检查点，而是以一种非阻塞的方式写检查点，而不会破坏流水线任务。与典型的无阻塞检查点方法(即，独立管理不可变数据集的检查点)相比，我们将可变数据集的检查点注入迭代数据流本身。因此，我们的机制在设计上是迭代感知的。这简化了系统架构，并便于在迭代图处理期间协调检查点创建。通过使用每个节点上的本地日志文件来避免从头开始的完全重新计算，我们实现了更快的恢复，即受限的恢复。我们的理论研究以及我们对Flink的实验分析进一步深入了解了我们的容错策略，并表明它们比数据流系统上迭代图处理的阻塞检查点和完全恢复更有效。

{"title":"Efficient fault-tolerance for iterative graph processing on distributed dataflow systems","authors":"Chen Xu, M. Holzemer, Manohar Kaul, V. Markl","doi":"10.1109/ICDE.2016.7498275","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498275","url":null,"abstract":"Real-world graph processing applications often require combining the graph data with tabular data. Moreover, graph processing usually is part of a larger analytics workflow consiting of data preparation, analysis and model building, and model application. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the processing. Most big graph processing algorithms are iterative and incur a long runtime, as they require multiple passes over the data until convergence. Thus, fault tolerance and quick recovery from any intermittent failure at any step of the workflow are crucial for effective and efficient analysis. In this work, we propose a novel fault-tolerance mechanism for iterative graph processing on distributed data-flow systems with the objective to reduce the checkpointing cost and failure recovery time. Rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner, without breaking pipelined tasks. In contrast to the typical unblocking checkpointing approaches (i.e., managing checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating the checkpoint creation during iterative graph processing. We achieve speedier recovery, i.e., confined recovery, by using the local log files on each node to avoid a complete re-computation from scratch. Our theoretical studies as well as our experimental analysis on Flink give further insight into our fault-tolerance strategies and show that they are more efficient than blocking checkpointing and complete recovery for iterative graph processing on dataflow systems.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"36 1","pages":"613-624"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82825012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Mercury: Metro density prediction with recurrent neural network on streaming CDR data 基于流式CDR数据的递归神经网络地铁密度预测

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498348

Victor C. Liang, Richard T. B. Ma, W. Ng, Li Wang, M. Winslett, Huayu Wu, Shanshan Ying, Zhenjie Zhang

Telecommunication companies possess mobility information of their phone users, containing accurate locations and velocities of commuters travelling in public transportation system. Although the value of telecommunication data is well believed under the smart city vision, there is no existing solution to transform the data into actionable items for better transportation, mainly due to the lack of appropriate data utilization scheme and the limited processing capability on massive data. This paper presents the first ever system implementation of real-time public transportation crowd prediction based on telecommunication data, relying on the analytical power of advanced neural network models and the computation power of parallel streaming analytic engines. By analyzing the feeds of caller detail record (CDR) from mobile users in interested regions, our system is able to predict the number of metro passengers entering stations, the number of waiting passengers on the platforms and other important metrics on the crowd density. New techniques, including geographical-spatial data processing, weight-sharing recurrent neural network, and parallel streaming analytical programming, are employed in the system. These new techniques enable accurate and efficient prediction outputs, to meet the real-world business requirements from public transportation system.

通信公司拥有手机用户的移动信息，包括乘坐公共交通系统的通勤者的准确位置和速度。虽然在智慧城市的愿景下，电信数据的价值得到了充分的认可，但由于缺乏合适的数据利用方案和对海量数据的处理能力有限，目前还没有将数据转化为可操作的项目以改善交通的解决方案。利用先进的神经网络模型的分析能力和并行流分析引擎的计算能力，首次实现了基于电信数据的公共交通人群实时预测系统。通过分析感兴趣地区移动用户的呼叫详细记录(CDR)馈送，我们的系统能够预测进入车站的地铁乘客数量，站台上等待的乘客数量以及人群密度的其他重要指标。该系统采用了地理空间数据处理、权重共享递归神经网络和并行流分析规划等新技术。这些新技术能够实现准确高效的预测输出，以满足公共交通系统的实际业务需求。

{"title":"Mercury: Metro density prediction with recurrent neural network on streaming CDR data","authors":"Victor C. Liang, Richard T. B. Ma, W. Ng, Li Wang, M. Winslett, Huayu Wu, Shanshan Ying, Zhenjie Zhang","doi":"10.1109/ICDE.2016.7498348","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498348","url":null,"abstract":"Telecommunication companies possess mobility information of their phone users, containing accurate locations and velocities of commuters travelling in public transportation system. Although the value of telecommunication data is well believed under the smart city vision, there is no existing solution to transform the data into actionable items for better transportation, mainly due to the lack of appropriate data utilization scheme and the limited processing capability on massive data. This paper presents the first ever system implementation of real-time public transportation crowd prediction based on telecommunication data, relying on the analytical power of advanced neural network models and the computation power of parallel streaming analytic engines. By analyzing the feeds of caller detail record (CDR) from mobile users in interested regions, our system is able to predict the number of metro passengers entering stations, the number of waiting passengers on the platforms and other important metrics on the crowd density. New techniques, including geographical-spatial data processing, weight-sharing recurrent neural network, and parallel streaming analytical programming, are employed in the system. These new techniques enable accurate and efficient prediction outputs, to meet the real-world business requirements from public transportation system.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"30 1","pages":"1374-1377"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87834581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Efficiently computing reverse k furthest neighbors 有效地计算反向k个最远邻居

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498317

Shenlu Wang, M. A. Cheema, Xuemin Lin, Ying Zhang, Dongxi Liu

Given a set of facilities F, a set of users U and a query facility q, a reverse k furthest neighbors (RkFN) query retrieves every user u ∈ U for which q is one of its k-furthest facilities. RkFN query is the natural complement of reverse k-nearest neighbors (RkNN) query that returns every user u for which q is one of its k-nearest facilities. While RkNN query returns the users that are highly influenced by a query q, RkFN query aims at finding the users that are least influenced by a query q. RkFN query has many applications in location-based services, marketing, facility location, clustering, and recommendation systems etc. While there exist several algorithms that answer RkFN query for k = 1, we are the first to propose a solution for arbitrary value of k. Based on several interesting observations, we present an efficient algorithm to process the RkFN queries. We also present a rigorous theoretical analysis to study various important aspects of the problem and our algorithm. An extensive experimental study is conducted using both real and synthetic data sets, demonstrating that our algorithm outperforms the state-of-the-art algorithm even for k = 1. The accuracy of our theoretical analysis is also verified by the experiments.

给定一组设施F，一组用户U和一个查询设施q，一个反向k个最远邻居(RkFN)查询检索每个用户U∈U，其中q是其k个最远设施之一。RkFN查询是反向k近邻查询(RkNN)的自然补充，它返回q是其k近邻设施之一的每个用户u。RkNN查询返回受查询q影响最大的用户，而RkFN查询旨在找到受查询q影响最小的用户。RkFN查询在基于位置的服务、市场营销、设施定位、集群和推荐系统等方面有许多应用。虽然有几种算法可以回答k = 1时的RkFN查询，但我们是第一个提出任意k值的解决方案。基于几个有趣的观察，我们提出了一种处理RkFN查询的有效算法。我们还提出了严格的理论分析来研究问题的各个重要方面和我们的算法。使用真实和合成数据集进行了广泛的实验研究，证明即使k = 1，我们的算法也优于最先进的算法。实验也验证了理论分析的准确性。

{"title":"Efficiently computing reverse k furthest neighbors","authors":"Shenlu Wang, M. A. Cheema, Xuemin Lin, Ying Zhang, Dongxi Liu","doi":"10.1109/ICDE.2016.7498317","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498317","url":null,"abstract":"Given a set of facilities F, a set of users U and a query facility q, a reverse k furthest neighbors (RkFN) query retrieves every user u ∈ U for which q is one of its k-furthest facilities. RkFN query is the natural complement of reverse k-nearest neighbors (RkNN) query that returns every user u for which q is one of its k-nearest facilities. While RkNN query returns the users that are highly influenced by a query q, RkFN query aims at finding the users that are least influenced by a query q. RkFN query has many applications in location-based services, marketing, facility location, clustering, and recommendation systems etc. While there exist several algorithms that answer RkFN query for k = 1, we are the first to propose a solution for arbitrary value of k. Based on several interesting observations, we present an efficient algorithm to process the RkFN queries. We also present a rigorous theoretical analysis to study various important aspects of the problem and our algorithm. An extensive experimental study is conducted using both real and synthetic data sets, demonstrating that our algorithm outperforms the state-of-the-art algorithm even for k = 1. The accuracy of our theoretical analysis is also verified by the experiments.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"36 1","pages":"1110-1121"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75085311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Fast top-k search in knowledge graphs 知识图快速top-k搜索

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498307

Shengqi Yang, Fangqiu Han, Yinghui Wu, Xifeng Yan

Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5-10 times faster than the state-of-the-art TA-style graph search algorithm, and 10-100 times faster than a belief propagation approach.

给定知识图G上的图查询Q, top-k图查询就是根据排序函数在G中找到k个排序分数最高的匹配项。知识图的快速top-k搜索具有挑战性，因为图遍历和相似度搜索都是昂贵的。传统的top-k图搜索通常基于阈值算法(TA)，该算法已不能满足新设置的需求。本文提出了top-k知识图谱搜索框架STAR。它有两个组成部分:(a)用于星型查询的快速top-k算法，以及(b)用于一般图查询的组装算法。集合算法以星型查询为构建块，以动态调整的边界迭代地清除星型匹配列表。对于top-k星图查询，其中一条边可以匹配到有界长度d的路径，我们开发了一种消息传递算法，实现了时间复杂度O(d2|E| + md)和线性到d|V|的空间复杂度(假设Q和k的大小有一个常数)，其中m是g中的最大节点度。通过将查询分解为多个星图查询并稍后将其结果连接起来，star可以进一步利用来回答一般的图查询。还开发了基于学习的技术来优化查询分解。我们通过实验验证，STAR比最先进的ta式图搜索算法快5-10倍，比信念传播方法快10-100倍。

{"title":"Fast top-k search in knowledge graphs","authors":"Shengqi Yang, Fangqiu Han, Yinghui Wu, Xifeng Yan","doi":"10.1109/ICDE.2016.7498307","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498307","url":null,"abstract":"Given a graph query Q posed on a knowledge graph G, top-k graph querying is to find k matches in G with the highest ranking score according to a ranking function. Fast top-k search in knowledge graphs is challenging as both graph traversal and similarity search are expensive. Conventional top-k graph search is typically based on threshold algorithm (TA), which can no long fit the demand in the new setting. This work proposes STAR, a top-k knowledge graph search framework. It has two components: (a) a fast top-k algorithm for star queries, and (b) an assembling algorithm for general graph queries. The assembling algorithm uses star query as a building block and iteratively sweeps the star match lists with a dynamically adjusted bound. For top-k star graph query where an edge can be matched to a path with bounded length d, we develop a message passing algorithm, achieving time complexity O(d2|E| + md) and space complexity linear to d|V| (assuming the size of Q and k is bounded by a constant), where m is the maximum node degree in G. STAR can further be leveraged to answer general graph queries by decomposing a query to multiple star queries and joining their results later. Learning-based techniques to optimize query decomposition are also developed. We experimentally verify that STAR is 5-10 times faster than the state-of-the-art TA-style graph search algorithm, and 10-100 times faster than a belief propagation approach.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"23 1","pages":"990-1001"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81304316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Geo-Social K-Cover Group queries for collaborative spatial computing 协同空间计算的地理社会K-Cover组查询

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498399

Yafei Li, Rui Chen, Jianliang Xu, Qiao Huang, Haibo Hu, Byron Choi

In this paper, we study a new type of Geo-Social K-Cover Group (GSKCG) queries that, given a set of query points and a social network, retrieves a minimum user group in which each user is socially related to at least k other users and the users' associated regions (e.g., familiar regions or service regions) can jointly cover all the query points. Albeit its practical usefulness, the GSKCG query problem is NP-hard. We consequently explore a set of effective pruning strategies to derive an efficient algorithm for finding the optimal solution. Moreover, we design a novel index structure tailored to our problem to further accelerate query processing. Extensive experiments demonstrate that our algorithm achieves desirable performance on real-life datasets.

本文研究了一种新的地理社会k -覆盖组(GSKCG)查询，给定一组查询点和一个社交网络，检索一个最小用户组，其中每个用户与至少k个其他用户有社会关系，并且用户所关联的区域(如熟悉区域或服务区域)可以共同覆盖所有查询点。尽管GSKCG具有实用性，但它的查询问题是np困难的。因此，我们探索了一组有效的修剪策略，以推导出寻找最优解的有效算法。此外，我们还针对问题设计了一种新的索引结构，以进一步加快查询处理速度。大量的实验表明，我们的算法在实际数据集上取得了理想的性能。

引用次数: 24

PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data PurTreeClust:用于大规模客户交易数据的购买树聚类算法

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Pub Date : 2016-05-16 DOI: 10.1109/ICDE.2016.7498279

Xiaojun Chen, J. Huang, Jun Luo

Clustering of customer transaction data is usually an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we present to use a “personalized product tree”, called purchase tree, to represent a customer's transaction data. The customer transaction data set can be represented as a set of purchase trees. We propose a PurTreeClust algorithm for clustering of large-scale customers from purchase trees. We define a new distance metric to effectively compute the distance between two purchase trees from the entire levels in the tree. A cover tree is then built for indexing the purchase tree data and we propose a leveled density estimation method for selecting initial cluster centers from a cover tree. PurTreeClust, a fast clustering method for clustering of large-scale purchase trees, is then presented. Last, we propose a gap statistic based method for estimating the number of clusters from the purchase tree clustering results. A series of experiments were conducted on ten large-scale transaction data sets which contain up to four million transaction records, and experimental results have verified the effectiveness and efficiency of the proposed method. We also compared our method with three clustering algorithms, e.g., spectral clustering, hierarchical agglomerative clustering and DBSCAN. The experimental results have demonstrated the superior performance of the proposed method.

客户交易数据聚类通常是零售和电子商务公司分析客户行为的重要步骤。请注意，来自公司的产品通常被组织为产品树，其中叶节点是要销售的商品，内部节点(根节点除外)可以是多个产品类别。在此树的基础上，我们提出使用“个性化产品树”，称为购买树，来表示客户的交易数据。客户事务数据集可以表示为一组购买树。我们提出了一种PurTreeClust算法，用于从购买树中聚类大规模客户。我们定义了一个新的距离度量来有效地计算两个购买树之间的距离。然后建立一个覆盖树用于索引购买树数据，我们提出了一种分层密度估计方法，用于从覆盖树中选择初始聚类中心。提出了一种用于大规模采购树聚类的快速聚类方法PurTreeClust。最后，我们提出了一种基于间隙统计的方法，从购买树聚类结果中估计聚类数量。在10个包含400万条交易记录的大规模交易数据集上进行了一系列实验，实验结果验证了该方法的有效性和高效性。并将该方法与光谱聚类、层次聚类和DBSCAN三种聚类算法进行了比较。实验结果证明了该方法的优越性。

{"title":"PurTreeClust: A purchase tree clustering algorithm for large-scale customer transaction data","authors":"Xiaojun Chen, J. Huang, Jun Luo","doi":"10.1109/ICDE.2016.7498279","DOIUrl":"https://doi.org/10.1109/ICDE.2016.7498279","url":null,"abstract":"Clustering of customer transaction data is usually an important procedure to analyze customer behaviors in retail and e-commerce companies. Note that products from companies are often organized as a product tree, in which the leaf nodes are goods to sell, and the internal nodes (except root node) could be multiple product categories. Based on this tree, we present to use a “personalized product tree”, called purchase tree, to represent a customer's transaction data. The customer transaction data set can be represented as a set of purchase trees. We propose a PurTreeClust algorithm for clustering of large-scale customers from purchase trees. We define a new distance metric to effectively compute the distance between two purchase trees from the entire levels in the tree. A cover tree is then built for indexing the purchase tree data and we propose a leveled density estimation method for selecting initial cluster centers from a cover tree. PurTreeClust, a fast clustering method for clustering of large-scale purchase trees, is then presented. Last, we propose a gap statistic based method for estimating the number of clusters from the purchase tree clustering results. A series of experiments were conducted on ten large-scale transaction data sets which contain up to four million transaction records, and experimental results have verified the effectiveness and efficiency of the proposed method. We also compared our method with three clustering algorithms, e.g., spectral clustering, hierarchical agglomerative clustering and DBSCAN. The experimental results have demonstrated the superior performance of the proposed method.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"59 1","pages":"661-672"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83970276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13